Parametrizing a PAM: from GEM to PAM

Building a Protein Allocation Model (PAM) from a genome-scale metabolic model (GEM) requires several inputs and design choices from the user. In this workflow, all the steps, requirements, and inputs are listed. All steps are, where possible, connected to scripts and notebooks which support the process of going from GEM to PAM. Let’s get started!

1. The Genome-Scale Metabolic Model Sanity Check

The basis of each PAM is a genome-scale metabolic model of high-quality. To create a PAM out of a GEM, the following requirements should be met:

100% mass balanced
100% charge balanced
Solid parametrization of the ATP maintenance (both growth- and non-growth-associated)
Model can be pickled and unpickled (if this is not the case, you might want to rename all metabolites which are called St)

The following is not required, but makes your life easier:

Reactions annotated with KEGG ids
Metabolites annotated with KEGG ids
Genes annotated with KEGG ids
>60% of all reactions should be annotated with gene-reaction associations

Also, make sure you have the following information

Regular expression to recognize gene-ids (e.g., r'\b([b|s]\d{4})\b' for the b1234 locus tags in E. coli)
Potential exceptions in identifier/mapping format (an example is s0001 which is used for unknown gene)

2. Building the Protein Sectors

You can find scripts to help you build the protein sectors in Scripts/i1_preprocessing. The scripts can be run in random order, although we suggest starting with the ActiveEnzymesSector.

2.1 Initiating the ActiveEnzymesSector

Notebook: Scripts/i1_preprocessing/0_parse_kcat_values_GotEnzymes.ipynb (personlization needed)

Input:

Genome-scale model
Default protein sector parameter file Data/proteinAllocationModel_EnzymaticData_empty.xlsx
UniProt mapping between genes and proteins (instructions in notebook)
All k_cats for the microorganism from GotEnzymes (instructions in notebook)

Defaults:

Reactions or genes which cannot be mapped to any protein identifier will be given a default id, length and molar mass : Enzyme_<gene> or Enzyme_<reaction>
Proteins which cannot be mapped to a k_cat value are assigned a default value of 13.7 (BRENDA mean value according to Bar-Even et al. (2011))
Each protein represents a functional enzyme: enzyme complexes are represented by ids which combine all subunits (peptide1_peptide2) and the molar masses and lengths are summed
Each row represents an individual enzyme-reaction relation

Output:

Excel file saved in Results/1_preprocessing/proteinAllocationModel_EnzymaticData_<your-model>_<yymmdd>.xlsx
Sheet: ActiveEnzymes with the following information

rxn_id	enzyme_id	direction	kcat_values	kegg_id	Reactants	Products	EC	GPR	gene	Length	molMass
str	str	Literal[‘f’,‘b’]	float	str	List[str]	List[str]	str	str	List[str]	int	int
model reaction id	UniProt protein id		in 1/s	reaction KEGG id	all reactants in KEGG id	all products in KEGG id	all EC numbers associated with the reaction	and/or relation between genes and reaction	list of genes coding for all peptides associated with the protein	length of protein in aa	molar mass of protein in kDa

2.2 Parametrizing the TranslationalProteinSector

Notebook: Scripts/i1_preprocessing/0_translational_sector_config.ipynb (personlization needed)

Input:

Genome-scale model
PAM set up method (see the PAModelpy documentation), examples are in Modules/utils/pam_generation.py
Default protein sector parameter file Data/proteinAllocationModel_EnzymaticData_empty.xlsx or Results/1_preprocessing/proteinAllocationModel_EnzymaticData_<your-model>_yymmdd.xlsx (if you already created the active enzymes sector)
Either one of the following:
1. proteomics measurements of the entire proteome [g/gCDW] or [g/gPROT], an estimate of the protein coverage by the measurement, the amount of protein per gCDW, and a functional annotation of the measured peptides (e.g., with cluster of orthologous genes annotations)
2. assumption that the microorganism is similar enough to an organism which does have all from point 1. available.

Output:

Excel file saved in Results/1_preprocessing/proteinAllocationModel_EnzymaticData_<your-model>_yymmdd.xlsx
Sheet: Translational with the following information for at least the relation to growth rate (Value_for_growth column), and possibly for the carbon source with a good amount of experimental datapoints (Value column):

Parameter	Description
`id_list`	Identifier related to protein fraction associated with translational proteins
`tps_0`	Translational protein fraction at zero growth rate. [g_protein/g_CDW]
`tps_mu`	Change in translational protein fraction per unit change of the associated reaction.
`mol_mass`	Molar mass of the translational enzymes. [kDa]
`substrate_range`	Range of values of susbstrate uptake which are associated with linear growth regime (no fermentation)

2.3 Parametrizing the UnusedEnzymesSector

Notebook: Scripts/i1_preprocessing/0_unused_enzyme_determination.ipynb (personlization needed)

Input:

Default protein sector parameter file Data/proteinAllocationModel_EnzymaticData_empty.xlsx or Results/1_preprocessing/proteinAllocationModel_EnzymaticData_<your-model>_yymmdd.xlsx (if you already created the active enzymes sector)
Either one of the following (for the slope):
1. maximal growth rate of a mutant obtained after adaptive laboratory evolution (ALE) in the microbes preferred carbon source
2. assumption that the microorganism is similar enough to an organism which does have all from point 1. available. You can use the fractional increase of the growth rate after ALE to calculate the absolute maximal growth rate for your organism.
Either one of the following (for the intercept):
1. protein overexpression experiment in which the protein overexpression is quantified in g/g_totalprotein or g/gCDW and the growth rate for each expression strength is measured. These datapoints can be used to find the amount of unused protein at zero growth (37% for E. coli according to Bruggeman et al. (2020))
2. assumption that the microorganism is similar enough to an organism which does have all from point 1. available.

Output:

Excel file saved in Results/1_preprocessing/proteinAllocationModel_EnzymaticData_<your-model>_yymmdd.xlsx
Sheet: UnusedEnzyme with the following information for at least the relation to growth rate (Value_for_growth column), and possibly for the carbon source with a good amount of experimental datapoints (Value column):

Parameter	Description
`id_list`	Identifier related to protein fraction associated with the unused enzyme sector
`ups_0`	Unused enzyme fraction at zero growth rate. [g_protein/g_CDW]
`ups_mu`	Change in unused enzyme fraction per unit change of the associated reaction.
`mol_mass`	Molar mass of unused enzymes. [kDa]
`substrate_range`	Range of values of susbstrate uptake which are associated with linear growth regime (no fermentation)

3. Design Choices

Now that all the parameters are parsed, there are several points to consider. In this step, it is important to keep your aim in mind. The genetic algorithm which forms the basis of the PAMparametrizer is only informed about the diffusion limit, and not about any other biophysical mechanism. It is therefore important to determine to which extent the protein content affects the metabolic phenotype of your microorganism (with the total protein content), what your starting position is (by scaling the k_cat values), which conditions you are using for parametrization, and what data you keep aside for testing the results.

3.1 Total protein content

The total protein content is a key parameter as it determines how much protein the model can allocate to metabolic processes. It can be deduced from data or from any combination of the following assumptions:

Assumption: 50% of the proteome is allocated to housekeeping proteins
Assumption: the total protein content does not change with changing conditions
Assumption: use the metabolic protein fraction from a closely related organism (e.g., E. coli: 0.258 g/gCDW)
Quantitative proteomics data available? You can use the proteins in the model to determine the fraction of total protein mass measured which is in the model. Please correct this fraction for the amount of enzymes which is measured vs. the total amount of enzymes

For some (eukaryotic) organisms, the assumption that the total protein content does not change with changing conditions does not hold. In these cases, you can introduce a linear relationship between the amount of protein space and any model reaction by adding this as a custom sector (see the PAModelpy documentation).

3.2 Scaling kcat values

In some cases, the k_cat values are underestimated during the initial model configuration. In this case, it might take the genetic algorithm many iterations to increase all k_cat values to prevent a too stringent protein burden. To improve the starting position in the parametrization, you can consider to scale all k_cat. You can check which scaling factor would be most efficient for your application using the Scripts/i1_preprocessing/1_scale_kcats_AES.py script.

You can easily adapt the parameter file using the increase_kcats_in_parameter_file from PAModelpy in your PAMparametrizer setup method function from.

3.3 Which conditions?

The quality and quantity of your experimental physiology measurements determine the quality of the parametrized PAM. The PAMparametrizer does not have any biophysical information to perform the parametrization, besides the ones you give it! If you are interested in a specific condition or phenotype, it makes sense to emphasize these datapoints in the parametrization by adding weights to the associated reactions. For example in the case of E. coli, we are interested in the overflow metabolic phenotype, so we want to make sure that acetate secretion starts at the right growth rate and increases with the right slope. We can inform the PAMparametrizer about this by adding weights to the growth rate and the acetate exchange reaction in the HyperParams data object (see PAMparamertizer setup instructions).

In this preparatory stage, you can overlay the experimental data you have with the conditions and phenotypes you want to model. The following questions can help you in your design choices:

Are there specific reactions that are representative of a specific phenotype or conditions?
Are you interested in growth on different carbon sources?
How much metabolic flexibility (and thus different pathways to parametrize) does your microbe have and how can you cover the entire range of pathways with your experimental measurements (e.g., different growth rates or carbon sources)?

3.4 Getting the validation data

Once you have decided on the conditions you want to use for parametrization, it is time to get the physiology data for these conditions in shape. Each carbon source gets its own ValidationData object as explained in the PAMparamertizer setup instructions. To easily transfer your measured fluxes to a validation data object, you need the following:

Units: mmol/gCDW/h
Mapping between the metabolites and the exchange reaction ids (such as EX_glc__D_e for glucose uptake)
Correct sign of measured fluxes according to the definition in the model (remember, all uptake fluxes should be negative!)
At least a substrate uptake rate and a growth rate per condition, preferably also O2 and CO2 exchange rates
All measurements for the same strain and medium composition (can be tricky if you are getting data from literature)

The best format to store the information is the following:

<Biomass reaction id>	<substrate uptake id>	<other exchane reaction id>
….	Check the direction in the model!	Check the direction in the model!

Keep in mind that it is good practice to have separate training and validation datasets. In the end, you will need some data to check whether the PAM you generated performs well on unseen conditions. As experimental measurements are often limited, it is wise to use data that the PAMparametrizer cannot use as validation data. Such as:

Measurements with different cultivation strategies (autotrophy versus heterotrophy)
Measurements for growth on different cultivation media composition (minimal versus complex)
Intracellular flux measurements

4. Working with the PAMparametrizer

4.1 Building and running the PAMparametrizer

Instructions: PAMparamertizer setup instructions Examples: Scripts/i2_parametrization

Input:

Genome-scale model (see step 1)
Excel file with initial sector parameters (Results/1_preprocessing/proteinAllocationModel_EnzymaticData_<your-model>_yymmdd.xlsx, see step 2)
Total protein content for the PAModel (see step 3.1)
PAM set up method (see PAModelpy documentation and examples in Modules/utils/pam_generation.py)
Sufficient experimentally measured exchange rates (see step 3.3 and step 3.4)
Other design choices defined as discussed in step 3
filename extension to easily find your simulation results back, can be defined in the HyperParams object

Running: pamparametrizer.run()

Output:

Results/2_parametrization/diagnostics/pam_parametrizer_diagnostics_<your-filename-extension>.xlsx
Results/2_parametrization/progress/pam_parametrizer_progress_<your-filename-extension>.png

4.2 Creating an ensemble of models

A genetic algorithm is not a deterministic method, which means the changes in k_cat values are different for each run. Although the alternative models have a comparable phenotype, it is important to always run the framework more than once and create what we call an ‘ensemble of models’. This ensemble represents the solution space in which the turnover numbers can deviate while still resulting in a specific metabolic phenotype. It can therefore be informative to create multiple models: not only are you more certain of the simulated phenotype, you can also test how dependent certain modifications to the model, strain or conditions are on the k_cat values!

It is advisable to create at least 5 PAMs for your organism. The more PAMs, the more certain you can be about specific simulated behaviours. Nevertheless, as each parametrization can take anywhere between 5 and 30 hours on a normal desktop laptop, 5 different PAMs should sample the solution space properly without burdening your computer or HPC too much.

5. Analyzing the Paramaterization Results

Be aware that the PAMparametrizer and the genetic algorithm are two separate software objects. The PAMparametrizer is a workflow that utilizes the genetic algorithm to optimize the turnover number in a PAM. The PAMparametrizer uses the output of the genetic algorithm to improve the model simulations. To ensure there is a proper distinction between the two, the results are saved separately. Please bear this difference in mind when reading through the following sections.

5.1 The output

The PAMparametrizer stores the output of the parametrization process in two files:

Results/2_parametrization/diagnostics/pam_parametrizer_diagnostics_<your-filename-extension>.xlsx
Results/2_parametrization/progress/pam_parametrizer_progress_<your-filename-extension>.png

The pam_parametrizer_diagnostics file contains the output of the genetic algorithm and some configurations, such as weights on specific reactions and the configurations of the protein sectors for different outputs. The pam_parametrizer_progress file shows the simulation results after every round of parametrization for a selected substrate and selected reactions. It allows you a quick look into the simulation results and how they match to experimental observations. It is the same plot that is shown when the PAMparametrizer is run in real time.

The pam_parametrizer_diagnostics file is build up out of different sheets, each storing information about a specific feature or process. Below the content and layout of each sheet is discussed.

5.1.1 Best_Individuals

Format

run_id	enzyme_id	direction	rxn_id	kcat[s-1	ga_error
iteration of the genetic algorithm	enzyme id	‘f’ or ‘b’	reaction id	modified kcat value in 1/s	final R2 value of the genetic algorithm

Source: genetic algorithm

Interpretation: Each time the genetic algorithm is run, it returns a list of reaction-direction-enzyme relationships and the optimized k_cat values (represented as the ‘best individual’ within a population of alternative k_cat values. The Best_Individual sheets stores the information about each run of the genetic algorithm, which means that a reaction-direction-enzyme relationship can occur multiple times with different k_cat values. This information can be easily transformed to a parameter Excel file (Scripts/i3_analysis/create_new_AES_from_parametrization_results.py) or directly to a PAM using the parameter file which was used during the parametrization (create_pamodel_from_diagnostics_file from Modules/utils/pam_generation.py).

5.1.2 Computational_time

Format

run_id	time_s	time_h
iteration of the genetic algorithm	time in s	time in h

Source: PAMparametrizer

Interpretation: wall-clock time used for each iteration of the PAMparametrizer

5.1.3 Final_Errors

Format:

run_id	r_squared
iteration of the genetic algorithm	final R2 after reparametrization

Source: PAMparametrizer.parametrization_results

Interpretation: Increasing R2 with the iteration means that the fit of the simulation results to the experimental measurement is iteratively improving. Please note that the R2 of the genetic algorithm can be different, due to weighing of specific reactions. The weighing is only used in the genetic algorithm. In the genetic algorithm the R2 expresses the fitness of a specific solution, and this value is optimized, while it is only used as a way to quantify the progress in the PAMparametrizer.

5.1.4 sector_parameters

Format:

substrate_uptake_id	slope	intercept	sector_id
exchange reaction id of substrate	slope between substrate uptake rate and protein fraction of this sector	intercept of “”	sector id as defined in the PAM

source: PAMparametrizer.validation_data

Interpretation: The amount of protein is associated to a specific protein sector is dependent on the substrate uptake rate. The exact parametrization of this relation depends on the relation between substrate uptake rate and growth rate, as discussed in INSERT PUBLICATION SUPPLEMENTS. The PAMparametrizer calculates the slope and intercept for all substrate-dependent protein sectors during the parametrization. Also the amount of un- and underused enzymes at zero growth is optimized in the last step of the PAMparametrizer. This sheet stores this sector configuration for your convenience, so you can easily use it for building condition-dependent PAMs.

5.1.5 reaction_weights

Format:

reaction	weight
reaction id of reaction	weight used for calculating R2 in the genetic algorithm

Source: PAMparametrizer.hyperparameters

Interpretation: As discussed in step 3.3, it is desired to implement a weighing scheme to parametrize certain (non-linear) conditions. With such a scheme you can prioritize the fit of the simulation results to specific reactions related to a specific phenotype. In this sheet, these configurations are stored, so you can trace back which settings were used to obtain a specific set of parameters.

5.2 Prebuilt analyses

The results of the PAMparametrizer can be analyzed in various ways: for each alternative parametrization you can study the prediction accuracy for growth on various conditions, you can see how the k_cat values of the alternative relate to each other, you can study the sensitivities in the model, etc. In this repository, there are several tools to help you with this. Below, we provide an overview of the analyses which can be done.

5.2.1 k_cat distribution

What	example script	function	What you learn
plotting kcat histogram	`Scripts/i3_analysis/flux_kcat_distribution.py`	`create_kcat_histogram_old_vs_new`	Are the alternatives similar distributed?
		`create_kcat_joyplot_old_vs_new`	Does the mean kcat value agree with expectations
statistics of kcat distribution	`Scripts/i3_analysis/flux_kcat_distribution.py`	`create_kcat_histogram_old_vs_new`	same as previous
functional distribution of kcat change	`Figures/Scripts/Figure1_iml1515_kcat_analysis.py`	`create_kcat_change_per_cog_barplot`	In which pathway the kcats are changed most?

5.2.2 Flux prediction accuracy

TODO

5.2.3 Protein Prediction accuracy

TODO

Parametrizing a PAM: from GEM to PAM

1. The Genome-Scale Metabolic Model Sanity Check

2. Building the Protein Sectors

2.1 Initiating the ActiveEnzymesSector

2.2 Parametrizing the TranslationalProteinSector

2.3 Parametrizing the UnusedEnzymesSector

3. Design Choices

3.1 Total protein content

3.2 Scaling kcat values

3.3 Which conditions?

3.4 Getting the validation data

4. Working with the PAMparametrizer

4.1 Building and running the PAMparametrizer

4.2 Creating an ensemble of models

5. Analyzing the Paramaterization Results

5.1 The output

5.1.1 Best_Individuals

5.1.2 Computational_time

5.1.3 Final_Errors

5.1.4 sector_parameters

5.1.5 reaction_weights

5.2 Prebuilt analyses

5.2.1 kcat distribution

5.2.2 Flux prediction accuracy

5.2.3 Protein Prediction accuracy

5.2.1 k_cat distribution