Parametrizing a PAM: from GEM to PAM
Building a Protein Allocation Model (PAM) from a genome-scale metabolic model (GEM) requires several inputs and design choices from the user. In this workflow, all the steps, requirements, and inputs are listed. All steps are, where possible, connected to scripts and notebooks which support the process of going from GEM to PAM. Let’s get started!
1. The Genome-Scale Metabolic Model Sanity Check
The basis of each PAM is a genome-scale metabolic model of high-quality. To create a PAM out of a GEM, the following requirements should be met:
100% mass balanced
100% charge balanced
Solid parametrization of the ATP maintenance (both growth- and non-growth-associated)
Model can be pickled and unpickled (if this is not the case, you might want to rename all metabolites which are called St)
The following is not required, but makes your life easier:
Reactions annotated with KEGG ids
Metabolites annotated with KEGG ids
Genes annotated with KEGG ids
>60% of all reactions should be annotated with gene-reaction associations
Also, make sure you have the following information
Regular expression to recognize gene-ids (e.g.,
r'\b([b|s]\d{4})\b'for the b1234 locus tags in E. coli)Potential exceptions in identifier/mapping format (an example is
s0001which is used for unknown gene)
2. Building the Protein Sectors
You can find scripts to help you build the protein sectors in Scripts/i1_preprocessing. The scripts can be run in random
order, although we suggest starting with the ActiveEnzymesSector.
2.1 Initiating the ActiveEnzymesSector
Notebook: Scripts/i1_preprocessing/0_parse_kcat_values_GotEnzymes.ipynb (personlization needed)
Input:
Genome-scale model
Default protein sector parameter file
Data/proteinAllocationModel_EnzymaticData_empty.xlsxUniProt mapping between genes and proteins (instructions in notebook)
All kcats for the microorganism from GotEnzymes (instructions in notebook)
Defaults:
Reactions or genes which cannot be mapped to any protein identifier will be given a default id, length and molar mass :
Enzyme_<gene>orEnzyme_<reaction>Proteins which cannot be mapped to a kcat value are assigned a default value of 13.7 (BRENDA mean value according to Bar-Even et al. (2011))
Each protein represents a functional enzyme: enzyme complexes are represented by ids which combine all subunits (
peptide1_peptide2) and the molar masses and lengths are summedEach row represents an individual enzyme-reaction relation
Output:
Excel file saved in
Results/1_preprocessing/proteinAllocationModel_EnzymaticData_<your-model>_<yymmdd>.xlsxSheet:
ActiveEnzymeswith the following information
rxn_id |
enzyme_id |
direction |
kcat_values |
kegg_id |
Reactants |
Products |
EC |
GPR |
gene |
Length |
molMass |
|---|---|---|---|---|---|---|---|---|---|---|---|
str |
str |
Literal[‘f’,‘b’] |
float |
str |
List[str] |
List[str] |
str |
str |
List[str] |
int |
int |
model reaction id |
UniProt protein id |
in 1/s |
reaction KEGG id |
all reactants in KEGG id |
all products in KEGG id |
all EC numbers associated with the reaction |
and/or relation between genes and reaction |
list of genes coding for all peptides associated with the protein |
length of protein in aa |
molar mass of protein in kDa |
2.2 Parametrizing the TranslationalProteinSector
Notebook: Scripts/i1_preprocessing/0_translational_sector_config.ipynb (personlization needed)
Input:
Genome-scale model
PAM set up method (see the PAModelpy documentation), examples are in
Modules/utils/pam_generation.pyDefault protein sector parameter file
Data/proteinAllocationModel_EnzymaticData_empty.xlsxorResults/1_preprocessing/proteinAllocationModel_EnzymaticData_<your-model>_yymmdd.xlsx(if you already created the active enzymes sector)Either one of the following:
proteomics measurements of the entire proteome [g/gCDW] or [g/gPROT], an estimate of the protein coverage by the measurement, the amount of protein per gCDW, and a functional annotation of the measured peptides (e.g., with cluster of orthologous genes annotations)
assumption that the microorganism is similar enough to an organism which does have all from point 1. available.
Output:
Excel file saved in
Results/1_preprocessing/proteinAllocationModel_EnzymaticData_<your-model>_yymmdd.xlsxSheet:
Translationalwith the following information for at least the relation to growth rate (Value_for_growthcolumn), and possibly for the carbon source with a good amount of experimental datapoints (Valuecolumn):
Parameter |
Description |
|---|---|
|
Identifier related to protein fraction associated with translational proteins |
|
Translational protein fraction at zero growth rate. [g_protein/g_CDW] |
|
Change in translational protein fraction per unit change of the associated reaction. |
|
Molar mass of the translational enzymes. [kDa] |
|
Range of values of susbstrate uptake which are associated with linear growth regime (no fermentation) |
2.3 Parametrizing the UnusedEnzymesSector
Notebook: Scripts/i1_preprocessing/0_unused_enzyme_determination.ipynb (personlization needed)
Input:
Default protein sector parameter file
Data/proteinAllocationModel_EnzymaticData_empty.xlsxorResults/1_preprocessing/proteinAllocationModel_EnzymaticData_<your-model>_yymmdd.xlsx(if you already created the active enzymes sector)Either one of the following (for the slope):
maximal growth rate of a mutant obtained after adaptive laboratory evolution (ALE) in the microbes preferred carbon source
assumption that the microorganism is similar enough to an organism which does have all from point 1. available. You can use the fractional increase of the growth rate after ALE to calculate the absolute maximal growth rate for your organism.
Either one of the following (for the intercept):
protein overexpression experiment in which the protein overexpression is quantified in g/g_totalprotein or g/gCDW and the growth rate for each expression strength is measured. These datapoints can be used to find the amount of unused protein at zero growth (37% for E. coli according to Bruggeman et al. (2020))
assumption that the microorganism is similar enough to an organism which does have all from point 1. available.
Output:
Excel file saved in
Results/1_preprocessing/proteinAllocationModel_EnzymaticData_<your-model>_yymmdd.xlsxSheet:
UnusedEnzymewith the following information for at least the relation to growth rate (Value_for_growthcolumn), and possibly for the carbon source with a good amount of experimental datapoints (Valuecolumn):
Parameter |
Description |
|---|---|
|
Identifier related to protein fraction associated with the unused enzyme sector |
|
Unused enzyme fraction at zero growth rate. [g_protein/g_CDW] |
|
Change in unused enzyme fraction per unit change of the associated reaction. |
|
Molar mass of unused enzymes. [kDa] |
|
Range of values of susbstrate uptake which are associated with linear growth regime (no fermentation) |
3. Design Choices
Now that all the parameters are parsed, there are several points to consider. In this step, it is important to keep your aim in mind. The genetic algorithm which forms the basis of the PAMparametrizer is only informed about the diffusion limit, and not about any other biophysical mechanism. It is therefore important to determine to which extent the protein content affects the metabolic phenotype of your microorganism (with the total protein content), what your starting position is (by scaling the kcat values), which conditions you are using for parametrization, and what data you keep aside for testing the results.
3.1 Total protein content
The total protein content is a key parameter as it determines how much protein the model can allocate to metabolic processes. It can be deduced from data or from any combination of the following assumptions:
Assumption: 50% of the proteome is allocated to housekeeping proteins
Assumption: the total protein content does not change with changing conditions
Assumption: use the metabolic protein fraction from a closely related organism (e.g., E. coli: 0.258 g/gCDW)
Quantitative proteomics data available? You can use the proteins in the model to determine the fraction of total protein mass measured which is in the model. Please correct this fraction for the amount of enzymes which is measured vs. the total amount of enzymes
For some (eukaryotic) organisms, the assumption that the total protein content does not change with changing conditions does not hold. In these cases, you can introduce a linear relationship between the amount of protein space and any model reaction by adding this as a custom sector (see the PAModelpy documentation).
3.2 Scaling kcat values
In some cases, the kcat values are underestimated during the initial model configuration. In this case,
it might take the genetic algorithm many iterations to increase all kcat values to prevent a too stringent protein
burden. To improve the starting position in the parametrization, you can consider to scale all kcat. You can check
which scaling factor would be most efficient for your application using the Scripts/i1_preprocessing/1_scale_kcats_AES.py script.
You can easily adapt the parameter file using the increase_kcats_in_parameter_file from PAModelpy in your PAMparametrizer setup method
function from.
3.3 Which conditions?
The quality and quantity of your experimental physiology measurements determine the quality of the parametrized PAM. The PAMparametrizer does not have any biophysical information to perform the parametrization, besides the ones you give it! If you are interested in a specific condition or phenotype, it makes sense to emphasize these datapoints in the parametrization by adding weights to the associated reactions. For example in the case of E. coli, we are interested in the overflow metabolic phenotype, so we want to make sure that acetate secretion starts at the right growth rate and increases with the right slope. We can inform the PAMparametrizer about this by adding weights to the growth rate and the acetate exchange reaction in the HyperParams data object (see PAMparamertizer setup instructions).
In this preparatory stage, you can overlay the experimental data you have with the conditions and phenotypes you want to model. The following questions can help you in your design choices:
Are there specific reactions that are representative of a specific phenotype or conditions?
Are you interested in growth on different carbon sources?
How much metabolic flexibility (and thus different pathways to parametrize) does your microbe have and how can you cover the entire range of pathways with your experimental measurements (e.g., different growth rates or carbon sources)?
3.4 Getting the validation data
Once you have decided on the conditions you want to use for parametrization, it is time to get the physiology data for these conditions in shape. Each carbon source gets its own ValidationData object as explained in the PAMparamertizer setup instructions. To easily transfer your measured fluxes to a validation data object, you need the following:
Units: mmol/gCDW/h
Mapping between the metabolites and the exchange reaction ids (such as
EX_glc__D_efor glucose uptake)Correct sign of measured fluxes according to the definition in the model (remember, all uptake fluxes should be negative!)
At least a substrate uptake rate and a growth rate per condition, preferably also O2 and CO2 exchange rates
All measurements for the same strain and medium composition (can be tricky if you are getting data from literature)
The best format to store the information is the following:
<Biomass reaction id> |
<substrate uptake id> |
<other exchane reaction id> |
|---|---|---|
…. |
Check the direction in the model! |
Check the direction in the model! |
Keep in mind that it is good practice to have separate training and validation datasets. In the end, you will need some data to check whether the PAM you generated performs well on unseen conditions. As experimental measurements are often limited, it is wise to use data that the PAMparametrizer cannot use as validation data. Such as:
Measurements with different cultivation strategies (autotrophy versus heterotrophy)
Measurements for growth on different cultivation media composition (minimal versus complex)
Intracellular flux measurements
4. Working with the PAMparametrizer
4.1 Building and running the PAMparametrizer
Instructions: PAMparamertizer setup instructions
Examples: Scripts/i2_parametrization
Input:
Genome-scale model (see step 1)
Excel file with initial sector parameters (
Results/1_preprocessing/proteinAllocationModel_EnzymaticData_<your-model>_yymmdd.xlsx, see step 2)Total protein content for the PAModel (see step 3.1)
PAM set up method (see PAModelpy documentation and examples in
Modules/utils/pam_generation.py)Sufficient experimentally measured exchange rates (see step 3.3 and step 3.4)
Other design choices defined as discussed in step 3
filename extension to easily find your simulation results back, can be defined in the HyperParams object
Running: pamparametrizer.run()
Output:
Results/2_parametrization/diagnostics/pam_parametrizer_diagnostics_<your-filename-extension>.xlsxResults/2_parametrization/progress/pam_parametrizer_progress_<your-filename-extension>.png
4.2 Creating an ensemble of models
A genetic algorithm is not a deterministic method, which means the changes in kcat values are different for each run. Although the alternative models have a comparable phenotype, it is important to always run the framework more than once and create what we call an ‘ensemble of models’. This ensemble represents the solution space in which the turnover numbers can deviate while still resulting in a specific metabolic phenotype. It can therefore be informative to create multiple models: not only are you more certain of the simulated phenotype, you can also test how dependent certain modifications to the model, strain or conditions are on the kcat values!
It is advisable to create at least 5 PAMs for your organism. The more PAMs, the more certain you can be about specific simulated behaviours. Nevertheless, as each parametrization can take anywhere between 5 and 30 hours on a normal desktop laptop, 5 different PAMs should sample the solution space properly without burdening your computer or HPC too much.
5. Analyzing the Paramaterization Results
Be aware that the PAMparametrizer and the genetic algorithm are two separate software objects. The PAMparametrizer is a workflow that utilizes the genetic algorithm to optimize the turnover number in a PAM. The PAMparametrizer uses the output of the genetic algorithm to improve the model simulations. To ensure there is a proper distinction between the two, the results are saved separately. Please bear this difference in mind when reading through the following sections.
5.1 The output
The PAMparametrizer stores the output of the parametrization process in two files:
Results/2_parametrization/diagnostics/pam_parametrizer_diagnostics_<your-filename-extension>.xlsxResults/2_parametrization/progress/pam_parametrizer_progress_<your-filename-extension>.png
The pam_parametrizer_diagnostics file contains the output of the genetic algorithm and some configurations,
such as weights on specific reactions and the configurations of the protein sectors for different outputs.
The pam_parametrizer_progress file shows the simulation results after every round of parametrization for a selected substrate
and selected reactions. It allows you a quick look into the simulation results and how they match to experimental observations.
It is the same plot that is shown when the PAMparametrizer is run in real time.
The pam_parametrizer_diagnostics file is build up out of different sheets, each storing information about a specific feature or process.
Below the content and layout of each sheet is discussed.
5.1.1 Best_Individuals
Format
run_id |
enzyme_id |
direction |
rxn_id |
kcat[s-1 |
ga_error |
|---|---|---|---|---|---|
iteration of the genetic algorithm |
enzyme id |
‘f’ or ‘b’ |
reaction id |
modified kcat value in 1/s |
final R |
Source: genetic algorithm
Interpretation:
Each time the genetic algorithm is run, it returns a list of reaction-direction-enzyme relationships and the optimized
kcat values (represented as the ‘best individual’ within a population of alternative kcat values.
The Best_Individual sheets stores the information about each run of the genetic algorithm, which means that a
reaction-direction-enzyme relationship can occur multiple times with different kcat values. This information
can be easily transformed to a parameter Excel file (Scripts/i3_analysis/create_new_AES_from_parametrization_results.py)
or directly to a PAM using the parameter file which was used during the parametrization
(create_pamodel_from_diagnostics_file from Modules/utils/pam_generation.py).
5.1.2 Computational_time
Format
run_id |
time_s |
time_h |
|---|---|---|
iteration of the genetic algorithm |
time in s |
time in h |
Source: PAMparametrizer
Interpretation: wall-clock time used for each iteration of the PAMparametrizer
5.1.3 Final_Errors
Format:
run_id |
r_squared |
|---|---|
iteration of the genetic algorithm |
final R |
Source: PAMparametrizer.parametrization_results
Interpretation:
Increasing R
5.1.4 sector_parameters
Format:
substrate_uptake_id |
slope |
intercept |
sector_id |
|---|---|---|---|
exchange reaction id of substrate |
slope between substrate uptake rate and protein fraction of this sector |
intercept of “” |
sector id as defined in the PAM |
source: PAMparametrizer.validation_data
Interpretation: The amount of protein is associated to a specific protein sector is dependent on the substrate uptake rate. The exact parametrization of this relation depends on the relation between substrate uptake rate and growth rate, as discussed in INSERT PUBLICATION SUPPLEMENTS. The PAMparametrizer calculates the slope and intercept for all substrate-dependent protein sectors during the parametrization. Also the amount of un- and underused enzymes at zero growth is optimized in the last step of the PAMparametrizer. This sheet stores this sector configuration for your convenience, so you can easily use it for building condition-dependent PAMs.
5.1.5 reaction_weights
Format:
reaction |
weight |
|---|---|
reaction id of reaction |
weight used for calculating R |
Source: PAMparametrizer.hyperparameters
Interpretation: As discussed in step 3.3, it is desired to implement a weighing scheme to parametrize certain (non-linear) conditions. With such a scheme you can prioritize the fit of the simulation results to specific reactions related to a specific phenotype. In this sheet, these configurations are stored, so you can trace back which settings were used to obtain a specific set of parameters.
5.2 Prebuilt analyses
The results of the PAMparametrizer can be analyzed in various ways: for each alternative parametrization you can study the prediction accuracy for growth on various conditions, you can see how the kcat values of the alternative relate to each other, you can study the sensitivities in the model, etc. In this repository, there are several tools to help you with this. Below, we provide an overview of the analyses which can be done.
5.2.1 kcat distribution
What |
example script |
function |
What you learn |
|---|---|---|---|
plotting kcat histogram |
|
|
Are the alternatives similar distributed? |
|
Does the mean kcat value agree with expectations |
||
statistics of kcat distribution |
|
|
same as previous |
functional distribution of kcat change |
|
|
In which pathway the kcats are changed most? |
5.2.2 Flux prediction accuracy
TODO
5.2.3 Protein Prediction accuracy
TODO