Workflow for Protein Allocation Model Parameter Estimation
This subdirectory contains the framework which performs the parametrization workflow. The workflow starts from an initial parametrization. Based on model simulations and the effect of individual enzymes on the simulated growth rate, enzymes are selected. The parameters of these enzymes will be optimized by a genetic algorithm, which aims to maximize the fit of the simulations to experimentally observed phenotypes by means of an R2 value. The whole workflow is repeated until a user-defined maximum number of iterations or error threshold is reached. As a final step, the amount of unused enzyme sector at zero growth is optimized to determine the quantitative protein burden.
Software Structure
The PAM parametrizer is organized as follows:
PAM_parametrizer: contains thePAMParametrizerobject with all the functionalitiesPAM_data_classes: contains all data classes to store metadata, hyperparameters and resultsValidationData: experimental data used to determine the R2 value and the reactions which should be used for validation/plottingHyperparameter: all hyperparameters for the parametrizer and the genetic algorithmParametrizationResults: results of the parametrization workflows, such as fluxes, sensitivities and best parameter setsFluxResults: class storing the simulation results (fluxes and error to experimental measurements) for a single substrate uptake reaction. Stored in ParameterizationResults.flux_results.
The parametrization workflow can make use of experimental measurements of growth on different substrates. This is enabled by providing a separate ValidationData instance for each substrate. This way, each substrate is stored together with the corresponding experimental data and reactions which should be used for validation. The user can define one main substrate which is purely used to visualize the progress of the parametrization in an effective manner. For each substrate, a separate FluxResults object will be created and stored in ParametrizationResults. As a consequence, the simulations and corresponding errors can be stored for each substrate individually.
PAM parametrizer workflow
The PAMparametrizer is run with the run function. The workflow is organized as follows (in pseudocode):
For each carbon source:
Calculate relation between substrate uptake rate and translational protein sector
Plot validation data
if number_experimental_measurement > hyperparameter.bin_resolution: Sample the validation data
if binned == before: Perform iteration in bins
while (iteration <= hyperparameters.threshold_iteration) & (final_error <= hyperparameters.threshold_error):
Initialize result objects in
ParametrizationResults.Run simulations over range of substrate uptake rates
Based on the sensitivity analysis described in van den Bogaard, et al. (2024), select enzymes to evaluat
Initiate genetic algorithm with sampled validation data and associated substrate uptake rates
genetic_algorithm.start() (more details in Genetic Algorithm README file)
Parse the results of the genetic algorithm, save intermediate results to excel file
Reparametrize the model with optimized parameters
Plot the progress Optimize the unused enzyme sector at zero growth Save the resulting png file and parameters to excel
Data objects
The PAMparametrizer is a complex framework requiring different types of input from the user and generates different types
of output. To orchestrate the data movement through the framework, the inputs and outputs of the PAMparametrizer
are organized in four different data classes in the PAM_data_classes.py file. Below follows a description of each of
the classes, and how it is used by the framework.
SectorConfig
The main power of the PAM is that it contains sectors which represent proteins which are not included explicitely in protein-constrained metabolic models, but do occupy a variable amount of protein space. Normally, 2 sectors are added: the UnusedEnzymeSector and the TranslationalProteinSector. The amount of proteins in these sectors are related to the (carbon) substrate uptake rate. This poses a challenge: the relation between carbon substrate and the amount of proteins in these sectors is unique for each substrate. Nevertheless, when a metabolic model is not limited by the amount of proteins, there is a linear relation between the carbon uptake and the growth rate, depending on the way the microbe metabolizes the substrate. We can thus use the relationship between the growth rate and protein sector in not protein-limited regimes to compute the relation between the protein sector and the various substrates. This relationship is stored in the input Excel file for the PAMparametrizer and provided to the PAMparametrizer object using the SectorConfig TypedDict:
sectorname: The sector identifier as present in the PAM
slope: relationship between sector (g/gCDW) and the growth rate (1/h)
intercept: sector (g/gCDW) at zero growth
substrate_range: range of substrate uptake rates in which the relation between sector and growth rate is valid
ValidationData
The PAMparametrizer needs exchange fluxes to determine how good the simulations with the new parameter set are. These flux rates can be measured in different conditions, most commonly using different carbon sources. This poses a complex problem: how do we provide the framework with not only the experimental measurements, but also the precise conditions in which the measurements have been performed, and which reactions are most representative to see the progress? The ValidationData object tries to streamline this data handling issue. For each condition to which the parameter set should be optimized, the corresponding ValidationData object should be created. It stores the following information:
id: substrate uptake identifier
valid_data: dataframe containing the experimental measurements
validation_range: range of substrate uptake range to consider for validation
reactions_to_validate: reactions to use for error calculation
reactions_to_plot: reactions to use for visualization
sector_config: configuration of all sectors related to the substrate uptake rate (as SectorConfig)
When the sector_configs is left empty, it will be automatically determined using the relation between the sector and the growth rate. The PAMparametrizer recalculates the sector parameters using the relation between the growth rate and substrate uptake rate of a PAM with a large protein pool (similar to a genome-scale model). If the relation between the sector parameters and the growth rate are unknown, only the translational sector is assumed to be related to the substrate uptake rate. The sector parameters are then obtained, assuming a similar growth rate to translational protein content relationship as established for E. coli.
In the PAMparametrizer object, all the ValidationData objects are stored and used by the framework to validate the new parametersets.
HyperParameters
A genetic algorithm is a very flexible tool for optimizing a set of parameters based on some type of error calculation. It can, however, be sensitive to different hyperparameters, such as the mutation probability and rate, the number of generations, the population size, etc. On top of that, the PAMparametrizer itself also needs some manual input. For examples the number of iterations, the convergence threshold, the number of kcats to optimize, etc. For easy handling and easy reproduction, all this information is saved in the HyperParameters data object.
Below follows a summary off all the inputs you can change in this object, including an explanation of its effects.
Attribute |
Default value |
Explanation |
|---|---|---|
PAMparametrizer |
||
threshold_error |
0.9 |
R2 value at which workflow stops |
threshold_iteration |
100 |
max number of iterations |
threshold_convergence |
1e-3 |
max difference between subsequent errors to consider the error to be converging |
threshold_nmbr_convergence |
3 |
number of final error to consider for convergence |
number_of_kcats_to_mutate |
5 |
number of k to optimize per iteration |
genetic_algorithm_file_base |
‘genetic_algorithm_run_’ |
base of files in which the genetic algorithm results are saves |
filename_extension |
‘’ |
filename extension to recognize output files with |
genetic_algorithm |
GAPOUniform |
personalized genetic algorithm object |
genetic_algorithm_hyperparams |
dict() |
dictionary containing all hyperparameters for the genetic algorithm |
Genetic Algorithm |
||
mutation_probability |
0.5 |
probability with which an individual (solution) is mutated in a generation |
mutation_rate |
0.5 |
probability with which an attribute (e.g. gene) of an individual is mutated |
population_size |
10 |
number of individuals (solution) per population |
crossover_probability |
0.8 |
probability with which two indivduals/offsprings are crossed over |
number_generations |
20 |
number of consecutive generations per gene flow event |
number_gene_flow_events |
2 |
number of gene flow events, |
init_attribute_probability |
0 |
probability with which the initial population is mutated |
fitness_class |
‘Fitfun_params_uniform’ |
name of the file containing the personalized fitness function |
processes |
2 |
number of parallel workers |
time_limit |
600 |
time limit in seconds |
error_weights |
dict() |
reaction which should have a different impact on the error calculation |
folderpath_save |
Path(r”Results”) |
path for saving results |
overwrite_intermediate_results |
True |
if true, saved intermediate results are overwritten |
print_progress |
True |
if True, progress of the genetic algorithm is printed |
ParametrizationResults
The PAMparametrizer does not only need a complex input, but also generated a complex output. For this reason, the output is organized in two different data structures: (1) ParametrizationResults, to save the (intermediate) results of the entire framework, and (2) FluxResults, to save the (intermediate) results of the simulation of various conditions.
The ParametrizationResults object is initiated upon initiation of the PAMparametrizer. It saves the following information:
substrate_uptake_reactions: list of all substrate uptake reactions to simulate
esc_df: dataframe to store the enzyme sensitivities
sensitive_enzymes: dataframe to store the most sensitive enzymes selected for each iteration
flux_results: dictlist to store the FluxResults objects for all conditions
best_individuals: dataframe to store the best individuals for genetic algorithm optimization
computational_time: dataframe to store the computational costs of each run
final_errors: dataframe to store the final error of each simulation
FluxResults
The FluxResults object is stored in the ParametrizationResults. It saves the results, mostly flux rates, of the simulations of a specific conditions, which can be used at a later time point to calculate the error to experimental data (stored in ValidationData). It stores the following:
id: substrate uptake identifier
error_df: dataframe to store the error for this specific conditions
fluxes_df: dataframe to store the reaction fluxes required to calculate the error (defined by
ValidationData.reactions_to_validate)substrate_range: a list with the substrate uptake rates to validate. The substrate uptake rates are obtained from
ValidationData.sampled_valid_data