Workflow for Protein Allocation Model Parameter Estimation

This subdirectory contains the framework which performs the parametrization workflow. The workflow starts from an initial parametrization. Based on model simulations and the effect of individual enzymes on the simulated growth rate, enzymes are selected. The parameters of these enzymes will be optimized by a genetic algorithm, which aims to maximize the fit of the simulations to experimentally observed phenotypes by means of an R2 value. The whole workflow is repeated until a user-defined maximum number of iterations or error threshold is reached. As a final step, the amount of unused enzyme sector at zero growth is optimized to determine the quantitative protein burden.

Software Structure

The PAM parametrizer is organized as follows:

  1. PAM_parametrizer: contains the PAMParametrizer object with all the functionalities

  2. PAM_data_classes: contains all data classes to store metadata, hyperparameters and results

    • ValidationData: experimental data used to determine the R2 value and the reactions which should be used for validation/plotting

    • Hyperparameter: all hyperparameters for the parametrizer and the genetic algorithm

    • ParametrizationResults: results of the parametrization workflows, such as fluxes, sensitivities and best parameter sets

    • FluxResults: class storing the simulation results (fluxes and error to experimental measurements) for a single substrate uptake reaction. Stored in ParameterizationResults.flux_results.

The parametrization workflow can make use of experimental measurements of growth on different substrates. This is enabled by providing a separate ValidationData instance for each substrate. This way, each substrate is stored together with the corresponding experimental data and reactions which should be used for validation. The user can define one main substrate which is purely used to visualize the progress of the parametrization in an effective manner. For each substrate, a separate FluxResults object will be created and stored in ParametrizationResults. As a consequence, the simulations and corresponding errors can be stored for each substrate individually.

PAM parametrizer workflow

The PAMparametrizer is run with the run function. The workflow is organized as follows (in pseudocode):

For each carbon source:

Calculate relation between substrate uptake rate and translational protein sector

Plot validation data

if number_experimental_measurement > hyperparameter.bin_resolution: Sample the validation data

if binned == before: Perform iteration in bins

while (iteration <= hyperparameters.threshold_iteration) & (final_error <= hyperparameters.threshold_error):

  1. Initialize result objects in ParametrizationResults.

  2. Run simulations over range of substrate uptake rates

  3. Based on the sensitivity analysis described in van den Bogaard, et al. (2024), select enzymes to evaluat

  4. Initiate genetic algorithm with sampled validation data and associated substrate uptake rates

  5. genetic_algorithm.start() (more details in Genetic Algorithm README file)

  6. Parse the results of the genetic algorithm, save intermediate results to excel file

  7. Reparametrize the model with optimized parameters

  8. Plot the progress Optimize the unused enzyme sector at zero growth Save the resulting png file and parameters to excel

Data objects

The PAMparametrizer is a complex framework requiring different types of input from the user and generates different types of output. To orchestrate the data movement through the framework, the inputs and outputs of the PAMparametrizer are organized in four different data classes in the PAM_data_classes.py file. Below follows a description of each of the classes, and how it is used by the framework.

SectorConfig

The main power of the PAM is that it contains sectors which represent proteins which are not included explicitely in protein-constrained metabolic models, but do occupy a variable amount of protein space. Normally, 2 sectors are added: the UnusedEnzymeSector and the TranslationalProteinSector. The amount of proteins in these sectors are related to the (carbon) substrate uptake rate. This poses a challenge: the relation between carbon substrate and the amount of proteins in these sectors is unique for each substrate. Nevertheless, when a metabolic model is not limited by the amount of proteins, there is a linear relation between the carbon uptake and the growth rate, depending on the way the microbe metabolizes the substrate. We can thus use the relationship between the growth rate and protein sector in not protein-limited regimes to compute the relation between the protein sector and the various substrates. This relationship is stored in the input Excel file for the PAMparametrizer and provided to the PAMparametrizer object using the SectorConfig TypedDict:

  • sectorname: The sector identifier as present in the PAM

  • slope: relationship between sector (g/gCDW) and the growth rate (1/h)

  • intercept: sector (g/gCDW) at zero growth

  • substrate_range: range of substrate uptake rates in which the relation between sector and growth rate is valid

ValidationData

The PAMparametrizer needs exchange fluxes to determine how good the simulations with the new parameter set are. These flux rates can be measured in different conditions, most commonly using different carbon sources. This poses a complex problem: how do we provide the framework with not only the experimental measurements, but also the precise conditions in which the measurements have been performed, and which reactions are most representative to see the progress? The ValidationData object tries to streamline this data handling issue. For each condition to which the parameter set should be optimized, the corresponding ValidationData object should be created. It stores the following information:

  • id: substrate uptake identifier

  • valid_data: dataframe containing the experimental measurements

  • validation_range: range of substrate uptake range to consider for validation

  • reactions_to_validate: reactions to use for error calculation

  • reactions_to_plot: reactions to use for visualization

  • sector_config: configuration of all sectors related to the substrate uptake rate (as SectorConfig)

When the sector_configs is left empty, it will be automatically determined using the relation between the sector and the growth rate. The PAMparametrizer recalculates the sector parameters using the relation between the growth rate and substrate uptake rate of a PAM with a large protein pool (similar to a genome-scale model). If the relation between the sector parameters and the growth rate are unknown, only the translational sector is assumed to be related to the substrate uptake rate. The sector parameters are then obtained, assuming a similar growth rate to translational protein content relationship as established for E. coli.

In the PAMparametrizer object, all the ValidationData objects are stored and used by the framework to validate the new parametersets.

HyperParameters

A genetic algorithm is a very flexible tool for optimizing a set of parameters based on some type of error calculation. It can, however, be sensitive to different hyperparameters, such as the mutation probability and rate, the number of generations, the population size, etc. On top of that, the PAMparametrizer itself also needs some manual input. For examples the number of iterations, the convergence threshold, the number of kcats to optimize, etc. For easy handling and easy reproduction, all this information is saved in the HyperParameters data object.

Below follows a summary off all the inputs you can change in this object, including an explanation of its effects.

Attribute

Default value

Explanation

PAMparametrizer

threshold_error

0.9

R2 value at which workflow stops

threshold_iteration

100

max number of iterations

threshold_convergence

1e-3

max difference between subsequent errors to consider the error to be converging

threshold_nmbr_convergence

3

number of final error to consider for convergence

number_of_kcats_to_mutate

5

number of k to optimize per iteration

genetic_algorithm_file_base

‘genetic_algorithm_run_’

base of files in which the genetic algorithm results are saves

filename_extension

‘’

filename extension to recognize output files with

genetic_algorithm

GAPOUniform

personalized genetic algorithm object
(if you want to change something in the genetic algorithm structure)

genetic_algorithm_hyperparams

dict()
see below

dictionary containing all hyperparameters for the genetic algorithm
see below for all inputs

Genetic Algorithm

mutation_probability

0.5

probability with which an individual (solution) is mutated in a generation

mutation_rate

0.5

probability with which an attribute (e.g. gene) of an individual is mutated

population_size

10

number of individuals (solution) per population

crossover_probability

0.8

probability with which two indivduals/offsprings are crossed over

number_generations

20

number of consecutive generations per gene flow event

number_gene_flow_events

2

number of gene flow events,
i.e. merging of multiple populations independently evolved on parallel workers

init_attribute_probability

0

probability with which the initial population is mutated

fitness_class

‘Fitfun_params_uniform’

name of the file containing the personalized fitness function

processes

2

number of parallel workers

time_limit

600

time limit in seconds

error_weights

dict()

reaction which should have a different impact on the error calculation
than other reactions
e.g. {‘EX_ac_e’:5}:acetate has 5 times more impact on the error than the other reactions

folderpath_save

Path(r”Results”)

path for saving results

overwrite_intermediate_results

True

if true, saved intermediate results are overwritten

print_progress

True

if True, progress of the genetic algorithm is printed

ParametrizationResults

The PAMparametrizer does not only need a complex input, but also generated a complex output. For this reason, the output is organized in two different data structures: (1) ParametrizationResults, to save the (intermediate) results of the entire framework, and (2) FluxResults, to save the (intermediate) results of the simulation of various conditions.

The ParametrizationResults object is initiated upon initiation of the PAMparametrizer. It saves the following information:

  • substrate_uptake_reactions: list of all substrate uptake reactions to simulate

  • esc_df: dataframe to store the enzyme sensitivities

  • sensitive_enzymes: dataframe to store the most sensitive enzymes selected for each iteration

  • flux_results: dictlist to store the FluxResults objects for all conditions

  • best_individuals: dataframe to store the best individuals for genetic algorithm optimization

  • computational_time: dataframe to store the computational costs of each run

  • final_errors: dataframe to store the final error of each simulation

FluxResults

The FluxResults object is stored in the ParametrizationResults. It saves the results, mostly flux rates, of the simulations of a specific conditions, which can be used at a later time point to calculate the error to experimental data (stored in ValidationData). It stores the following:

  • id: substrate uptake identifier

  • error_df: dataframe to store the error for this specific conditions

  • fluxes_df: dataframe to store the reaction fluxes required to calculate the error (defined by ValidationData.reactions_to_validate)

  • substrate_range: a list with the substrate uptake rates to validate. The substrate uptake rates are obtained from ValidationData.sampled_valid_data