Accelerating metabolic models evaluation with statistical metamodels: application to Salmonella infection models

14 Mathematical and numerical models are increasingly used in microbial ecology to model the 15 fate of microbial communities in their ecosystem. These models allow to connect in a mechanistic 16 framework species-level informations, such as the microbial genomes, with macro-scale features, such 17 as species spatial distributions or metabolite gradients. Numerous models are built upon species-18 level metabolic models that predict the metabolic behaviour of a microbe by solving an optimization 19 problem knowing its genome and its nutritional environment. However, screening the community 20 dynamics with these metabolic models implies to solve such an optimization problem by species at 21 each time step, leading to a significant computational load further increased by several orders of 22 magnitude when spatial dimensions are added. 23 In this paper, we propose a statistical framework based on Reproducing Kernel Hilbert Space 24 (RKHS) metamodels that are used to provide fast approximations of the original metabolic model. 25 The metamodel can replace the optimization step in the system dynamics, providing comparable 26 outputs at a much lower computational cost. We will first build a system dynamics model of a 27 simplified gut microbiota composed of a unique commensal bacterial strain in interaction with the 28 host and challenged by a Salmonella infection. Then, the machine learning method will be intro-29 duced, and particularly the ANOVA-RKHS that will be exploited to achieve variable selection and 30 model parsimony. A training dataset will be constructed with the original system dynamics model 31 and hyper-parameters will be carefully chosen to provide fast and accurate approximations of the 32 original model. Finally, the accuracy of the trained metamodels will be assessed, in particular by 33 comparing the system dynamics outputs when the original model is replaced by its metamodel. 34 The metamodel allows an overall relative error of 4 . 71% but reducing the computational load by a 35 speed-up factor higher than 45, while correctly reproducing the complex behaviour occurring dur-36 ing Salmonella infection. These results provide a proof-of-concept of the potentiality of machine 37 learning methods to give fast approximations of metabolic model outputs and pave the way towards 38 PDE-based spatio-temporal models of microbial communities including microbial metabolism and 39 host-microbiota-pathogen interactions. 40


Introduction
Modelling in microbial ecology.Microbial ecology focuses on the study of microbial communities, called microbiota, interacting with their environment and regulated by the microbiota host [32,5].The gut microbiota is such a symbiotic ecosystem composed of a community of hundreds of microbial species living in the large intestine lumen, referred to as the commensals, and regulated by the epithelial cells of the host colon.The main drivers of the microbiota dynamics are the metabolism of each microbial species, the interactions between micro-organisms and their spatio-temporal interactions with the host.
In the specific case of a pathogenic infection, a new player disturbs the system and tries to shift the microbial environment from an healthy homeostasis favourable to the commensals towards a dysbiotic situation favourable to the pathogen, enabling its colonization [27,4].The concept of pathobiome has been introduced [35] as an analysis framework to describe the specific interactions between the commensal microbiota, the host and the pathogen leading to pathogenic infection.
Mathematical and numerical models of the gut microbiota have been recognized as suitable tools for providing mechanistic interpretations of biological observations, predicting the evolution of these ecosystems, for example in pathological situations, or defining controlling actions to lead them towards a targeted state [37,15,36,21].Mathematical models in microbial ecology are population dynamics models describing the microbial population growth, i.e. their metabolism, microbe-microbe interactions and interactions with their environment, in particular the available nutrients.
FBA framework to model microbial metabolism.A classical modelling framework to represent the microbial metabolism is Flux Balance Analysis (FBA) [24,29].FBA relies on metabolic models inferred from microorganism genome: the genes are annotated to identify the biochemical reactions they code for and the whole set of reactions is combined into a genome-scale metabolic network connecting the substrate metabolites the microorganism is able to metabolize to the synthesized biomass and endproducts produced by the microbe.
Namely, if we note (m i ) 1⩽i⩽Nm the set of the N m metabolites that can be found in a micro-organism, and (r j ) 1⩽j⩽Nr the set of the N r reactions coded in the genome, then mass conservation equations can be written on the internal concentration of the metabolites : In this equation, R(m i ) is the subset of reactions involving the metabolite m i , θ mi,j is the stoichiometric coefficient of the metabolite m i in the reaction j (negative for consumption reaction, and positive for production reaction) and ν j is the reaction flux, i.e. the quantity of metabolite involved in the reaction by time and microbial biomass units (the flux unit is mmol.h−1 .g−1 ).In FBA models, an additional fictitious biochemical reaction is considered: the biomass reaction r b , with its corresponding fictional molecule b representing biomass.This comes from an abstraction of the mean content of the cell, and the energetic cost to synthesize it, see for example the works of Battley et al. [2].This reaction connects the biomass precursors to the biomass b with the chemical equation where θ mi,r b is the stoichiometric coefficient of metabolite m i in the biomass reaction r b and M (b) is the subset of metabolites m i that constitute the biomass, i.e. the metabolites needed by the microorganism for growth (to duplicate the genomic material, the metabolism machinery, the cellular membrane, etc...).
The metabolic flux flowing through this biomass equation is noted ν b and is then the amount of microbial biomass produced by time and biomass unit, with unit (g.h −1 .g−1 by convention, or h −1 ).
The FBA models aim to predict this growth rate ν b while observing biological constraints such as the mass conservation equations (1).To achieve this prediction, the FBA framework makes important simplifying assumptions: 1) Steady-state assumption.All internal metabolites are assumed to be at steady-state in the cell, so that the mass conservation equation (1) reduces to a linear system on the flux vector ν := (ν j ) 1⩽j⩽Nr gathering the fluxes of the N r reactions of the metabolic network, where A is the reaction matrix, i.e. the matrix of dimension N m ×N r with A ij := θ mi,j the stoichiometric coefficient of metabolite i in the reaction j, gathering the whole set of conservation equations for the metabolites and reactions involved in the metabolic network; 2) Biomass maximization.The microbes are assumed to be instantaneously maximizing the biomass production in a given nutritional context; 3) Flux constraints.Every flux are constrained by intrinsic limits, related for example to metabolite transporter capacities, or known enzymatic efficiency.These limits are noted c min and c max so that Hence, the biomass production and all the metabolic fluxes in the microbial machinery can be predicted with the constrained optimization FBA problem find ν * ∈ R Nr , such that ν * := This problem searches for the optimal growth rate represented by the component ν b , which is the biomass formation flux.It is obtained by the system under mass-balance and flux constraints.Mathematically speaking, this optimization problem is linear and can be solved using linear programming: very efficient solvers exist for such a problem, even for high dimensional problems like this one, where N r is classically around several thousands.A classical FBA toolbox is the Cobra toolbox (in Matlab environment) [11] or its python equivalent Cobrapy [9].
Nutritional environment described as constraints on uptake fluxes.Important FBA model parameters are constraints on substrate flux from the extracellular compartment into the intracellular compartment, i.e. the first reactions of the metabolic network, enabling nutrients to enter the microbial cell.These constraints represent the possible uptake for the microorganism, hence representing a proxy of the microbe nutritional environment, i.e. the available nutrients for the microbial species to activate its metabolism.
The uptake reactions are exchange reactions, i.e. reactions at the interface between the intra and extracellular media.Indeed, by construction, exchange reactions are reactions between the extracellular pool m i , i.e. the nutritional environment, and the intracellular pool m i of the corresponding metabolite.
If we note c ≤ ν up ≤ 0, we get a mapping F s between c (up) s and the FBA solution for the bacterial strain s where ν * is the FBA solution with the constraints c (up) s for the strain s.This mapping allows to tune the uptake constraints to adapt the FBA prediction to a specific nutritional environment context.We note that by convention, uptake fluxes are negative due to the exchange reaction orientation.
Dynamic FBA Eq. 4 can be used as the second member of an ordinary differential equation (ODE) to compute the growth or consumption rates of a population dynamics equation in a framework termed dynamic FBA or dFBA [19].Let us introduce a generic dFBA model describing the dynamics of a microbial population density b growing on a substrate of density s with metabolic fluxes described by a FBA model 2 and the resulting mapping 4. We have In this equation, c In the sequel, we will simplify the notations by noting The dFBA framework is used in an increasing number of system biology models of the gut microbiota [18,7].However, dFBA involves the resolution of many FBA optimization problems during the time integration inducing high computational costs that can lead to intractable computations when the dFBA is repeated multiple times, like in several intensive numerical applications such as sensitivity analysis, inference or PDEs, advocating for reduction method.
Outline of the paper.This paper aims to 1) adapt a metamodeling method to the context of metabolic models to accelerate the computation of a population dynamics model coupled to a FBA model such as Eq. ( 5), 2) benchmark this method in the specific context of an ODE-based model of the gut environment during the infection of an enteric pathogen: Salmonella enterica Typhimurium.We want to substitute the FBA optimization problem solved at each time step by an approximate model, built with a Reproducing Kernel Hilbert Space (RKHS) metamodeling method.The RKHS metamodel is a machine learning approach: an approximation of the model image is built from the model evaluation in a sample of the state space (i.e. a learning database).This metamodel will be used to predict the model response for new points outside the learning database, with a faster computation than the original optimization problem.
First, we will set up the general framework of the accelerated model using eq.(5) as a toy example to introduce the essential mathematical results for RKHS metamodeling in Sec. 2.Then, we will use the acceleration method on a more evolved population dynamics model of Salmonella infection with the host response in Sec. 3.This population dynamic model will be used to produce a learning database to train the metamodel in Sec. 4. Next, the hyperparameters of the learning method will be selected in Sec.
5 in order to provide a good trade-off between prediction accuracy and computation speed.Finally, the RKHS metamodel will be derived with the selected hyperparameters and its accuracy will be assessed in Sec. 6. See Fig. 1 for a sketch image of the overall methodology.
2 Mathematical framework for the RKHS metamodel

Accelerating a dFBA with a metamodel: general methodology
To accelerate the computation of problem 5, we speed up the evaluation of F b by using a metamodel Fb , resulting in an overall acceleration for the time integration of (5) (see Fig. 1, left panel).Namely, we solve the following problem.
where Fb is the best approximation of F b in a particular functional space, here a specific RKHS called ANOVA-RKHS.We now precise the mathematical framework we use by introducing important results for the global understanding of RKHS metamodeling.We next introduce ANOVA-RKHS that will be used for variable selection.These results are however classical, and we do not provide their proof that can be found in the corresponding references.The main contribution of the paper is the specific adaptations needed for the application of ANOVA-RKHS metamodels to the context of microbial population dynamics models, and in particular the context-specific learning database construction, hyperparameter tuning and selection criteria that will be crucial for tailoring a trade-off between metamodel accuracy and speed-up (see Fig. 1, right panel).

Metamodeling and Hoeffding decomposition
Let us set up the context of metamodeling for metabolic models.We consider X a N up -dimensional random vector of possible metabolic constraints for the FBA model inputs with known distribution where Y s is a N r -dimensional vector and s an index designating the bacterial strain related to the FBA model.In this paper, we will consider real-valued meta-models.For a given 1 ≤ j ≤ N r and a given strain s, building the meta-model m j of the real-valued function F s,j amounts to solve in a given functional space H ⊂ L 2 (P X ), the non-parametric Gaussian regression model [13] Y s j = m j (X) + σε (9) where ε ∼ N (0, 1) is independent of (X) and the variance σ 2 is unknown.
H K ?Hyperparameters  The dFBA framework (upper panel) is defined by the coupling of a FBA metabolic model with a dynamic system.Numerically, this remains to loop over a time integration scheme in which a FBA is solved at each time step.We propose a new framework (lower panel) where the FBA model is replaced by a low-computational-cost metamodel speeding up the time integration process.Right panel: metamodeling framework.We set up the general statistical framework where the flux Y is the output of the FBA model F given the input X.
We then assemble a learning dataset by sampling the input space (X i ) and computing the corresponding FBA output Y i with N obs observations.The metamodel is then defined as the solution of a non-linear non-parametric regression problem in a finite dimensional functional space H K of dimension K with regularization function G.In practice, we will choose a group-lasso regularization to perform feature selection together with the metamodel computation.This regression problem has two hyperparameters to be chosen: the regularization parameter µ, that will tune the number of selected input variables, and the dimension K of the functional space, which is related to the number of observations in the RKHS framework (see Sec. 2 and 5).Selecting lower number of features or lower K decreases the computation load of the metamodel evaluation in a new unseen point x unseen and thus accelerates the ODE model integration but decreases the metamodel accuracy: a trade-off must be sought (see Sec. 5).
where p is a multi-index, P the power set of {1, • • • , N up }, x p denotes the vector with components x j for j ∈ p.The functions m p are L 2 (P X ) functions centered and orthogonal in L 2 (P X ), so that the variance of m j can be decomposed with The Hoeffding decomposition is used to separate principal effects (the function m j,p that involve one unique input variable x i ) from variable interactions (the functions m j,p with |p| > 1, i.e. involving more than one input component).The Hoeffding decomposition is widely used for sensitivity analysis, since Sobol index directly derives from it, or for variable selection: the relative contribution of the functions m j,p in the Hoeffding decomposition allows to neglect the less contributive terms which can lead to discard some input variables if all the functions they are involved in are neglected.

Generalities on RKHS metamodel
Let X be a compact subset of R N up .A definite symmetric kernel is a function The Moore-Aronszajn's theorem ensures a bijective mapping between the space of positive definite kernels and specific Hilbert spaces termed Reproducing Kernel Hilbert spaces (or RKHS).
Theorem 1 (Moore-Aronszajn [1]).Setting k : X × X → R a symmetric positive definite kernel, there exists a unique Hilbert space H k of real-valued functions on X defined as the completion of that endows H k .The kernel k is termed the Reproducing kernel of the RKHS H k .
Reciprocally, if H is a Hilbert space of functions f : X → R endowed with its inner product noted ⟨•, •⟩ H , and if ∀x ∈ X the functional f → f (x) is continuous on H, then H is a RKHS [6].The reproducing kernel of H can be exhibited according to the Riesz theorem: for all x ∈ X , there exists a unique and we have by construction the reproducing property The RKHS framework is very powerful to approximate solutions of the non-linear regression problem 9 on the basis of Namely, we will address the problem of finding where g is a strictly increasing function allowing to regularize the regression problem.As H k is a functional space of a priori infinite dimension, this problem must be discretized to be solved.In the RKHS framework, the Representer theorem reduces this problem to a N obs -dimensional minimization Theorem 2 (Representer Theorem [30]).Any function m j ∈ H k minimizing equation (10) admits a representation of the form so that problem (10) can be replaced by finding or, in vectorial form where K is the Gram matrix obtained with the kernel k and The inference of α * uniquely defines the metamodel m * which can be evaluated in a new point We note that the computational load of eq. ( 13) linearly depends on N obs .

ANOVA-RKHS
In (12), multidimensional kernels can be chosen to assemble the matrix K, resulting in a simple regression problem if g = Id.However, in the context of metabolic modelling, vectors X can be of high dimension (a.e. in our application N up = 9) implying a large number N obs of samples in the learning set to cover this high dimension space.Thinking in term of computational budget for the evaluation of eq. ( 13) which is linearly tuned by N obs , it is appealing to reduce N up and thus the dimension of the space of state variable involved in the metamodel.For a fixed number N obs allowed by the computational budget, the metamodel approximation accuracy is expected to be better in a reduced state variable space (see Sec. 7.3 for a deeper discussion on this aspect): we then adopt a more evolved method based on variable selection framework introduced in [13] and based on a very specific RKHS introduced in [8], the ANOVA-RKHS.
The ANOVA-RKHS H is built as a direct sum of sub-RKHS H p so that a given function f ∈ H will have for Hoeffding decomposition its decomposition on the subspaces H p .Using the ANOVA-RKHS, we will build a metamodel only involving the most significant state variables (i.e. a reduced number N up ), reducing the input space dimension and thus increasing the metamodel accuracy and the computational speed-up for a given computational budget fixed by N obs .The goal of the ANOVA-RKHS is not to accelerate the metamodel computation in (12), but rather to speed up the metamodel evaluation in an unseen point in (13).
Let us note corresponding RKHS H a are chosen on X a , with the additional properties: 1) The RKHS H a can be decomposed as where the kernel associated to the RKHS H 0a being defined as follows [13] p.8: The ANOVA kernel is finally defined by . The corresponding RKHS is finally where H p is the RKHS associated to k p .Let us now take any function f in the ANOVA-RKHS H.We get by the reproducing property and linearity As the functions f p are centered and uncorrelated by construction, this decomposition is also the Hoeffding decomposition of f .This setting will be used for variable selection: in the following, the numerical problem will be set up, with a group-lasso regularization that will select the important variables and variables interactions.

Discretization of the regression problem and metamodel construction
From the representer theorem 2 and the ANOVA-RKHS reproducing property in eq. ( 17), we can state the following finite dimension parametric regression problem: for a given 1 ⩽ j ⩽ N r and a given bacterial strain s, find θs 0,j , ( θs p,j ) p∈P := arg min with K p ∈ R N obs ×N obs the Gram matrix such that (K p j1,j2 ) 1≤j1,j2≤N obs = k p (c j1 , c j2 ), the value of the kernel k p evaluated at constraint points c j1 and c j2 .In this equation, the norm . The term G is a regularization term that writes: with µ an hyperparameter and W some weight matrix.
If the weight matrix is A composite criteria can be chosen such as the ridge group sparse criteria In this exploratory study, we set W = Id, leading to a group-lasso criteria.
To compute K p , a numerical version of the ANOVA-RKHS is needed, and in particular the computation of k a0 and the integrals in Eq. ( 14).These integrals are computed empirically for all 1 ≤ i ≤ N obs once for all and stored for further use with the formulas: where X i is the i-th row of the learning database X and a is the mono-dimensional index.Note that these integrals are respectively mono and bi-dimensional, which limits the computational load.
This estimation problem is a N obs × |P| + 1-dimensional optimization problem, which can be numerically expensive if N up and N obs are large.The problem can be reduced by considering interactions up to a certain order.However, the minimization problem is done off-line once for all.Then, the function F s,j can be approximated in a new point c(up) in the input parameter space by Fs,j (c (up) ) defined with the explicit formula Fs,j (c (up) ) := θs 0,j where F p (c (up) ) is the N obs dimensional vector 1⩽i⩽N obs i.e., the evaluation of the k p kernel at c(up) and the N obs learning set points X i .This analytical formula is fast to compute: it has the complexity of a dot product once k p are evaluated.In practice, we will use Matern kernels for kernels k a , a ∈ {1, • • • , N up } , the parameters of the Matern kernel being fixed to a priori values so that the kernel are computed with the formula (c Note that k 0a in eq. ( 14) is needed to compute k p : the computation of the first integral in eq. ( 14) is done empirically through eq. ( 19) while the others have been computed once for all and stored, reducing the computation time.

Population dynamics model of Salmonella infection, including host inflammatory response
We now contextualize the previous methodology to a dynamic system describing Salmonella infection in the gut lumen.This application example is a sound benchmark to show the potentiality of our method because 1) it is a representative example of the intrinsic complexity of a system biology model of the gut by involving two different metabolic models and 10 metabolites screened in time in two compartments, 2) the model involves stiff dynamics after infection, making it sensitive to flux approximation errors and thus more difficult to approximate by a metamodel.

Biological context of Salmonella infection.
Salmonella Thyphimurium uses a very complex mechanism to invade the gut.Let us characterized the healthy gut homeostasis: it will emphasize by contrast how the pathogen colonizes the intestine lumen.
Healthy gut.The environment of a healthy gut is anaerobic: the commensal micro-organisms are then specialized microbes relying on anaerobic metabolism to grow without oxygen.Actually, a main part of the gut microbiota are strictly anaerobic, meaning that oxygen is harmful to them.With this anaerobic metabolism, the commensal microbiota consumes fibre-derivated sugars (e.g.. glucose and galactose) and produces short-chain fatty acids (SCFA) -mainly butyrate, acetate and propionatethat are absorbed by the host for its own metabolism.The main energetic source for the intestinal cells is butyrate, which is metabolized together with the oxygen carried to the intestine by the blood system.
A virtuous cycle is then set up (see Figure 2a): the commensal microbiota produces butyrate that is metabolized by the host with oxygen; consequently, this oxygen does not diffuse to the lumen ensuring hypoxia and a favorable habitat for the butyrate-producing anaerobes.Salmonella is not very efficient in an anaerobic environment: the pathogen will have to hack this regulation mechanism, in order to create a favorable niche and permit the invasion of the gut.[4,27] Colonized gut.When arrived at the gut lumen, the pathogen releases a virulence factor (sipA) that triggers an inflammation in the epithelial cells (see Figure 2b).The host cells produce neutrophils: these immune cells are sent into the gut lumen where they trap any bacteria they encounter (pathogenic bacteria but also SCFA-producing symbionts).Then, the production of butyrate decreases, and this metabolite is no longer available for the epithelial cells: the oxygen reaching the cells is no longer metabolized and starts flowing in the gut lumen.This oxygen will be harmful for the butyrate-producing anaerobes, which initiates a vicious circle.The oxygen will also oxydize nutrients present in the gut, providing very efficient energetic sources for the pathogen alone, allowing it to take over from the commensal bacteria.Namely, galactose, glucose and thiosulfate will be oxydized into galactarate, glucarate end tetrathionate.In the meantime, inflammation induces the production of nitric oxyde, which is oxydated in nitrate, also very favorable for the pathogen [4,27].Figures sketching these mechanisms can be found in Fig. (2a-2b).(a) Healthy gut at homeostasis: the colon lumen is hypoxic, so that commensal microbiota produces butyrate from sugars, which is consumed by the host with the blood-stream oxygen, regulating anaerobia.(b) Salmonella colonization process: the pathogen triggers inflammation, decreasing commensals levels.Butyrate production drops down, reducing availability for the host.Epithelial cell metabolism switches from aerobic to anaerobic: blood-stream oxygen is no longer consumed and starts flowing in the gut lumen creating an aerobic niche for the pathogen.We will first build a population dynamics model of Salmonella infection.The commensal microbiota will be represented by a unique strain of butyrate-producing bacteria: Faecalibacterium Prauznitzii.

Lumen
This bacteria belongs to one of the dominant genera in the gut microbiota, and is widely studied in the context of probiotic development [20].The model proposed in this section is an adaptation of the works of Muñoz et al. [22], where they model the human colon by dividing it in compartments that are treated as continuous-flow stirred tank reactors (CSTR).
Our adaptation comes from a simplification of the colon into a single CSTR (called luminal compartment), and the novelty comes from the inclusion of FBA for computing the growth rate and the addition of the epithelial compartment representing the epithelium.We need to add the former to our set of equations in order to model the Salmonella infection.Our model aims to reproduce the main steps of the Salmonella infection: 1) the neutrophils (immune system cells that sequester bacteria) release into the colon from the epithelial compartment after the virulence factor triggers the inflammatory response; 2) the resulting drop of butyrate producing bacteria which entails decreased butyrate levels; 3) the metabolic switch, induced by the butyrate drop, of the epithelial cell from aerobic to anaerobic metabolism resulting in oxygen flow into the luminal compartment of the oxygen that is not consumed; 4) the bloom of Salmonella growing in this newly aerobic luminal compartment.A mathematical model describing the infection and the shift from an anaerobic to aerobic environment in the colon has been introduced in [16] at a larger scale.The model we introduce here focuses on the host-microbiota-pathogen metabolic interactions.Many parameters contained in ODE models are normally estimated by fitting the model to experimental data.In view of the lack of it, we will content ourselves to qualitatively representing the 4 steps introduced above that are hallmark of Salmonella infection [27], however, some parameters can be known before hand, such as the hydraulic retention time.
State variables.The model is a compartment model: a first compartment describes the gut lumen while the second stands for the epithelial cells.The luminal compartment describes the dynamics of the bacteria S th and F prau , for Salmonella enterica Typhimurium and Faecalibacterium prauznitsii populations, n l , the luminal neutrophils, and m l a vector containing all the metabolites concentrations of interest in the luminal compartment that describe the nutritional environment.Vector m l is indexed by  2)) for each time step for each species.Nitric oxide has a special role in the host response to the pathogen, since it will react with oxygen to form nitrate which boosts the growth of S th and gives an edge to Salmonella in the competition for resources [27].The epithelial compartment has 4 state variables: n e , N O e , O 2e and but e representing neutrophils, nitric oxide, oxygen and butyrate, respectively.
Each of these state variables is transported to or from the luminal compartment, in order to model the host response effect in the colon to the pathogen invasion.The vector m e indexed by {N O, O 2 , but} will gather the epithelial metabolites.
Luminal compartment.The gut lumen is modelled as an open system, meaning that matter flows through it.A working hypothesis is that the volume of the gut lumen is preserved at all times, meaning that a volume entering the gut must be balanced by a volume going out, thus the gut lumen can be modelled as a reactor [10].The rate of change of the concentration of a component inside the gut lumen depends then on the difference between the input and output flow [22].More precisely , let s be the concentration of a component of interest, then Q in and Q out be the volumetric input and output flow, s in the concentration of the incoming flow, and V the reactor volume.) maps the upper bound of consumption to the uptake rates of metabolites for s ∈ {S th , F prau }.To couple Eq. ( 4) to the state equation, a relation between the state variable and the consumption upper bound c up is needed.We then define where m l,m is the substrate metabolite m of the luminal metabolites m l , L dt is a characteristic consumption time, 1 s (m l ) is an indicator function indicating whether the bacteria s metabolizes the substrate m, ε is a small regularization parameter and S m is the maximal substrate uptake when the metabolite m is at saturation in the media.As the upper bound c (up) s now depends on vector m l and bacterial densities, we will simply denote F s (m l , S th , F prau ) the uptake rates of metabolites for species s.Note that this vector also includes the biomass production rate, denoted by F s,1 (m l , S th , F prau ).Analogously, vector F s,m l (m l , S th , F prau ) is assembled from the uptake rates of metabolites in m l .Finally, we introduce the diag(•) operator, which maps a vector of size n to the corresponding diagonal matrix of size n.
where F S th (resp.F Fprau ) is the FBA metabolic model of the pathogen (resp.the commensal).The parameter ρ represents the trapping by the neutrophils n l .The term α The term d n n l is the death rate of neutrophils.Remark that mathematically we could have added the dilution rate (D m ) of neutrophils to its death rate d n and have a single term, however since neutrophils also die in the epithelial compartment which has no dilution rate we decided to keep this explicit form.
No entry of bacteria takes place, the bacteria getting into the system through initial conditions.
In equation (25), the first term describes the metabolite inflow, with m in a vector containing the concentration in the small intestine of component m l and D the passive dilution rate common to all the inert metabolites.The terms F b,m l (m l , S th , F prau )b for b ∈ {S th , F prau } correspond to the consumption or production of metabolites due to the bacterial metabolism.The term βm l O 2 l corresponds to the oxidation reactions, where β is a diagonal matrix with entries only in the index corresponding to the reduced-oxidized pairs, each metabolite of a reduced-oxidized pair have the same coefficient, but with opposite sign, thus ensuring mass conservation.The term diag(γ)T r (m e , m l ) shows the transport process to the epithelial compartment.We have for the transfer coefficient γ: Epithelial compartment The 4 state variables of the epithelial compartment have the following dynamics The term C but,n n e n e − L n bute K but +bute (L n − n e ) in equation ( 26) (and the analogue term in eq. (  26), (27), and (28), respectively, represent death terms.Terms γ n (n m − n e ) in equation (26) (and all its analogues in other equations) model the transport process towards the luminal compartment, which couple these equations to Eq. ( 22)- (25).Finally terms λ but but e O 2e in both equations ( 28) and ( 29) model the epithelial cell metabolism mainly based on butyrate oxydation.
The system is supplemented with initial conditions Y 0 that can be found in Table A.2.The system was simulated in absence of Salmonella for 40 hours, time at which a pulse of Salmonella is added and models the initial invasion.The model is solved with custom python scripts (see Sec.A in the Annexe).
The FBA models are taken from the literature: the S th model is taken from [26] as provided by Cobrapy [9] while the F prau model is taken from [31].The parameter values can be found in Table A.1.
In Figure 3 a simulation of the system can be found.The abundance of S th , F prau , and neutrophils is first plotted (Fig. by the depletion of thiosulfate (Fig. 3.c, purple), while the second is more based on the consumption of oxidized molecules, allowed by the flow of oxygen, and nitrate coming from the oxidation of N O.We note that oxygen actually recycles the end product of the metabolism of the oxydized molecules, maintaining the favourable niche for Salmonella.We can see that the dynamical system renders all the four steps of Salmonella infection as described in the literature (see Fig. 2b): 1) the inflammation-induced raise of neutrophils 2) the consecutive drop of butyrate-producing bacteria and butyrate, 3) the switch to anaerobic metabolism in the epithelium and the resulting oxygenation of the lumen, favourable niche for 4) the bloom of Salmonella.
In the remainder, we will use the notation to designate the vectorial state variable of the whole dynamical system.

Learning database definition
The assembling of the learning database is linked to the question of sampling the feature space of the RKHS method, which has dimension N up = 9 in our application.Building a uniform sampling of a ninedimensional hypercube necessitates a high number of points to cover all the volume of the hypercube.
To mitigate the number of samples in the learning database, we adopt a supervised strategy: we sample the feature space in the neighbourhood of feature time-series observed during different solutions of the ODE system ( 22)- (29).In this way, the feature co-variance of our learning database is closer to the co-variance imposed by the dynamical system structure.We then compute N sim = 60 repetitions of the ODE system ( 22)-( 29) with random initial conditions sampled in uniform distributions (cf Table B.3 for parameter values), multiplied for the metabolites of the luminal compartment by a Bernoulli distribution simulating their presence/absence to also simulate cases where a metabolite is not initially present in the system.
From these N sim = 60 replicates, we performed a time sampling of the state variables m l (i∆t), S th (i∆t) and F prau (i∆t), i = 1, • • • , N t from which we computed the corresponding FBA constraints using formula (21) to get X 1 after duplicate removal.The matrix X 1 only contains constraints that have been observed during the time course of the system dynamics.To enrich the database around these orbits, we then perturbed X 1 with a multiplicative Gaussian noise (σ = 0.1), and selected samples with resulting all negative constraints (i.e.substrate uptake) to get X 2 .The concatenation X large of X 1 and X 2 leads to a database of N obs = 47942 samples.We subsampled X large by uniformly picking up 1000  Notably thiosulfate accumulates until the infection moment, and then is consumed by S th , the rising levels of nitrate are also linked to the presence of S th and the available oxygen to transform nitric oxide in nitrate.
Plot e shows the behaviour in the epithelial compartment and one can see how the butyrate level drops since the appearance of neutrophils, the oxygen accumulation because of the reduced butyrate levels and the nitric oxide increased explained by the presence of Salmonella that triggers its production.Plot f shows the flows between compartments to show that indeed the accumulations or depletion of metabolites described before is linked to exchanges between compartments.The different stages of the infection are then qualitatively recovered by the ODE system.
samples.Since it is particularly important from a biological point of view to capture the dynamics when a given metabolite is not limiting (i.e. when its concentration is close to S m in eq. ( 21)) and when is nearly depleted (i.e. when its concentrations gets close to zero), we selected 1000 additional points by randomly taking 1000/(N sub * 2) additional samples in the first and last decile of each columns of X large to enrich the database in the distribution limits.We then finally obtained a learning database X with N obs = 2000 samples.Model outputs Y Fprau and Y S th were assembled for each species with the FBA model.The resulting distributions in X and X large can be seen in B.9.

Hyperparameters selection
We now are ready to learn the metamodel, i.e. to solve (18) in order to find the parameters θ providing the best trade-off between Y reconstruction and RKHS subspace selection.

Selection of the group-lasso weight µ
For each species s = S th , F P rau and model output j, we solve the problem (18) for where Ŷs test,j|µ = Fs,j|µ (X test ).
We display in Figures 4 and 5 the respective resulting lasso-paths for F prau and S th .Namely, we compute for each µ, species s and output j the norm n p µ,s,j = ∥ θs p,j|µ ∥ 2 for p ∈ P, where second order interactions only are considered in P, and derive their relative contribution n p µ,s,j / p∈P n p µ,s,j that is displayed in Figures 4 and 5.This relative contribution allows to display the groups p of θs p,j|µ that vanish for increasing µ, and the groups that remain non-null indicating input variables that are necessary to reconstruct the output j.In other words, for increasing µ, the group-lasso penalty becomes preponderant, turning off the parameters corresponding to the RKHS subspace p carrying the lower part of signal variance, which remains to perform variable selection.In the meantime, the loss tends to increase when a group of θ is discarded, since the signal is approximated in lower-dimensional subspaces.
We are then seeking, for each output j, for the parameter µ providing the best trade-off between signal reconstruction and reduced number of selected groups p, synonym of reduced computational load and speed-up.
For F prau (Fig. 4), we first observe that the lasso paths are very similar for the substrates (all the curves are similar in the glucose and galactose plots), indicating that these sugars have a comparable fate in the FBA model and similar influence on butyrate production (butyrate plots, the blue and orange plots are parallel).To predict the growth (F prau plot), both sugars and their interaction are needed to achieve correct predictions (blue, orange and gray lines): the loss curve (dark blue line with stars) shows sharp increases when a group is dropped off.Due to the reduced number of substrates for F prau (N up = 2), all groups are kept for the four model outputs (see Table C.4 for selected µ).
For S th (Fig. 5), input interactions are more complex.We first observe that O 2 intake (blue curve) is always preponderant for all model outputs plots, which is expected for this bacteria able to respire in aerobic environment.Again, glucose and galactose plots are very similar, such as glucarate and galactarate (their oxidated version).For these oxidated sugars, the loss increase (dark blue line with stars) is very limited when groups are dropped-off, indicating that the two groups that are kept (O  Test Loss Figure 4: Lasso path for F prau .For each metamodel, the lasso path is displayed: the relative contribution of the different blocks to the penalty is plotted for several values of the group lasso penalty µ, together with the loss function value.Namely, we plot ∥ θs p,j|µ ∥ 2 / p∈P ∥ θs p,j|µ ∥ 2 wrt µ.For increasing µ, the group carrying less information vanish (i.e. its relative contribution goes to zero), indicating that the remaining groups support the main part of the signal.Dashed dark gray lines indicate order 2 interactions involving the displayed compound.Dashed light gray lines indicate order 2 interactions that do not involve the displayed compound (i.e.involving other compounds).

Selection of the number of functional basis
For given regularization parameters µ, different numbers of functional basis can be involved in the approximation, i.e. according to the Representer theorem 2 different numbers of samples included in the learning set.Again, a trade-off between reconstruction accuracy and computation speed is expected, since more functional basis enlarges the discretized functional space where the optimum is searched in eq. ( 11), allowing for better approximation, but at the cost of additional computations during each metamodel evaluation in (20).
For the µ previously selected, we then performed additional metamodel learnings for varying N obs ∈ {50, 60, 70, 80, 90, 100, 200, 300, 400, 500, 600, 700}.We then computed n rep = 5 repetitions of the ODE system ( 22)-( 29), for random initial conditions sampled with the same procedure than for the learning set construction (see Sec. 4), and for the FBA model or its metamodel approximation in eq. ( 22) to (25).The L 2 relative reconstruction error between the dFBA solutions Y ode F BA and their metamodel approximations Y ode mm|N obs is plotted in Fig. 6, together with the computation speed-up, i.e. the computation time ratio using the metamodel in place of the FBA model.
We can observe that the best trade-off between speed-up and reconstruction error is obtained for 500 functional basis.A higher number of basis increases the number of numerical operations and degrades the computation time while a lower number worsens the reconstruction error.More counter-intuitively, the speed-up is decreased for low numbers of functional basis (N obs ≤ 100).This is due to a higher number of blocks p ∈ P that are conserved when the number of observation in the learning basis (i.e. the number of functional basis in the RKHS) is reduced: the block-lasso penalty tends to conserve a higher number of blocks to preserve the data reconstruction, which is mechanically decreased for lower numbers of samples in the learning set.

Validation of the selected RKHS metamodel
The accuracy of the selected RKHS metamodel is first assessed by testing the metamodel with the corresponding FBA model on n test = 1500 unseen points (Fig. 7a and 7b).We can see that the large majority of points lie in the vicinity of the line y = x, providing excellent R2 scores, with minimal value Figure 5: Lasso path for S th .For each metamodel, the lasso path is displayed: the relative contribution of the different blocks to the penalty is plotted for several values of the group lasso penalty µ, together with the loss function value (dark blue line with stars).Namely, we plot ∥ θs p,j|µ ∥ 2 / p∈P ∥ θs p,j|µ ∥ 2 wrt µ.For increasing µ, the groups carrying less information vanish (i.e. its relative contribution goes to zero), indicating that the remaining groups support the main part of the signal.Dashed dark gray lines indicate order 2 interactions involving the displayed compound.Dashed light gray lines indicate order 2 interactions that do not involve the displayed compound (i.e.involving other compounds).linear for sugar consumption for F prau , but the behaviour is more complex for S th , in particular for sugar consumption: sugar FBA uptake (y-axis) can vanish whereas glucose or galactose remain in the media (non-null constraints, x-axis) indicating metabolic switches.This behaviour is correctly predicted by the metamodel.
We then assess the metamodel approximation by comparing the ODE simulations with the FBA (plain lines) and the metamodel (dashed lines, Figure 8).Some limited discrepancies can be observed.
In Fig. 8.a, Salmonella approximation accuracy is reduced in the second phase of growth, when S th takes benefit of the micro-aerobic environment.In the same plot around hour 60, the metamodel is slightly off for F prau , inducing a slight lag for butyrate production around T = 60 (Fig. 8.b, orange curves) which is reflected in the epithelial densities (Fig. 8.e, orange) and trans-epithelial flow (plot 6, orange).
For metabolites, the time courses are particularly well reconstructed, except for glucose after T = 70h which goes awry, reflecting that there was little glucose consumption predicted by the metamodel, whereas in the original system it was completely consumed.Thiosulfate and tetrathionate are slightly off as well which might be linked with the oxygen lag observed in Fig. 8.e and f (blue lines).Less oxygen goes into the luminal compartment during the lag and the formation of tetrathionate by the oxidation of thiosulfate becomes impaired.This mechanism should be observed for other reduced-oxidized pairs, however since they are less abundant the effect might be attenuated.
Altogether, the behaviour of the metamodel is satisfactory in reproducing the dFBA system: it produces an overall reconstruction error ∥Y ode − Ŷ ode ∥ 2 /∥Y ode ∥ 2 of 4, 71% and it accurately renders all different phases of S th infection as observed in Fig. 3, such as F prau and consecutive butyrate drop-off, O 2 and N O flows between epithelial and luminal compartments and the resulting two-phase growth of S th .The metamodel furthermore allows computation speed-up by 45, which is a considerable gain.samples in the learning dataset is more likely to cover the feature space with reduced dimensions.In our context, the feature space has 9 dimensions for S th , and we could provide accurate predictions with 500 points.Working directly with classical 9-dimensional RKHS might have necessitated a higher number of training samples to provide the same accuracy.On the contrary, 500 points provides a good sampling of 1 or 2-dimensional feature spaces as observed in the f p of eq. ( 17).Benchmarking ANOVA-RKHS with other RKHS and other machine learning methods is kept as a perspective for this work.
Additionally, ANOVA-RKHS could be compared or enriched with other functional spaces.In particular, as the response curves of the metabolic models are quite regular except near the origin (see Figs. C.10a and C.10a), other approximation methods could be investigated, such as polynomial regression models.This kind of models could provide faster evaluations by compensating a lower number of functional basis by higher priors on the response shape.Again, variable selection approaches could speed up metamodel evaluation on unseen points.

Exploring other regularization penalties.
In eq. ( 11), we selected a classical group lasso penalty to regularize the optimization problem.This penalty could be problematic in practice since it does not involve the ANOVA-RKHS norm, which is the norm that theoretically ensures the existence of a solution through the Representer theorem 2. However, these difficulties did not occur in the context of the computations presented here.Other regularizations were explored in [14,13] and could be introduced in the future in our package.However, computing the ANOVA-RKHS norm involves the computation of the square root of large (N 2 obs ) dense matrices (as many matrices as card(P)), which can be expensive in computational time and memory, specially if high-order interactions are considered in the Hoeffding decomposition.Hence, dimension reduction techniques or active learning could be coupled with the ANOVA-RKHS method to select at the same time input variables (with the ridge-group-sparse penalty introduced in [13]) and the most informative samples in the testing test.

Conclusion
In this study, we provided a proof-of-concept of the potentiality of machine learning methods to provide fast approximations of metabolic model outputs: these metamodels could replace FBA models in large systems biology models necessitating a massive number of FBA computations such as spatio-temporal models of microbial communities.We leveraged existing metamodeling methods (ANOVA-RKHS), provided strategies for the assembling of the testing dataset, set a framework for hyperparameter selection and assessed the accuracy of the metamodel.Replacing the original FBA models by their metamodel in an ODE system dynamics model of Salmonella infection in an healthy gut accelerated the computations by 45 with a relative error of about 5%.This result makes reachable PDE models of microbial communities involving genome-scale metabolic models such as FBA models, by approximating them with their metamodel.

Acknowledgments
Experiments presented in this paper were carried out using the PlaFRIM experimental testbed, sup- Selected hyperparameter µ that tunes the grouplasso penalty is indicated for each species (rows) and each model output (columns).This parameter provides the best trade-off between signal reconstruction and reduced number of RKHS subspace that are kept for reconstruction.We display for each column 1 ⩽ c ⩽ N up of the database X large its marginal distribution (plain lines) together with the marginal distribution of X (dashed lines) obtained after subsampling and enrichment near the boundaries of X large .As expected, the main modes of X large are conserved in X, while points in the first and last deciles (near the boundaries) are over-represented by construction in X.
on the uptake fluxes ν up of the N up metabolites in the extra-cellular environment, c (up) (s, b) is a function mapping the state variables b and s to the constraints c (up) on the substrates applied in the FBA model 2. As an example, we can set c (up) (b, s) = s L dt b to model the fact that the remaining substrate pool s is shared between the current microbial population b at a time rate L dt .We indicate by F b,1 the biomass production flux (index 1) and F b,s the consumption flux of metabolite s (index s) of the FBA model of b.

Figure 1 :
Figure1: Sketch of the general methodology.Left panel: speeding up dFBA.The dFBA framework (upper panel) is defined by the coupling of a FBA metabolic model with a dynamic system.Numerically, this remains to loop over a time integration scheme in which a FBA is solved at each time step.We propose a new framework (lower panel) where the FBA model is replaced by a low-computational-cost metamodel speeding up the time integration process.Right panel: metamodeling framework.We set up the general statistical framework where the flux Y is the output of the FBA model F given the input X.We then assemble a learning dataset by sampling the input space (X i ) and computing the corresponding FBA output Y i with N obs observations.The metamodel is then defined as the solution of a non-linear non-parametric regression problem in a finite dimensional functional space H K of dimension K with regularization function G.In practice, we will choose a group-lasso regularization to perform feature selection together with the metamodel computation.This regression problem has two hyperparameters to be chosen: the regularization parameter µ, that will tune the number of selected input variables, and the dimension K of the functional space, which is related to the number of observations in the RKHS framework (see Sec. 2 and 5).Selecting lower number of features or lower K decreases the computation load of the metamodel evaluation in a new unseen point x unseen and thus accelerates the ODE model integration but decreases the metamodel accuracy: a trade-off must be sought (see Sec. 5).

Figure 2 :
Figure 2: Simplified illustrations recapitulating the biological regulation in an healthy gut, and S. Typhimurium colonization mechanisms.
and chemical reactions + transport to epithelial compartment.Particularly, under the constant volume hypothesis Q in = Q out = Q.Define D := Q V as the dilution rate, which is the inverse of the hydraulic retention time.Then we can write Qinsin−Qouts V = (s in − s)D.Recall from equation (4) that F s (c (up) s models the deleterious effect of the oxygen level O 2 on the obligate anaerobe F prau , with a Michaelis-Menten dynamics using tuning parameters α and K O2 .The terms D S th and D Fprau indicate the passive dilution plus a bacteria specific death rate.The term γ n (n e −n l ) represents the transfer process from the epithelial compartment.
3.a).Notice how the infection takes place at hour 40 and produces a spike of neutrophils in both the luminal (Fig.3.a,dark green curve) and epithelial compartment (Fig.3.e,dark green).After the immune response led by neutrophils we can observe the decline of F prau and the rise of S th achieving colonization.Plots of Fig.3.b,Fig.3.c and Fig.3.dshow the metabolite concentrations in time in the luminal compartment.Butyrate starts decreasing after S th infection (Fig.3.b,orange) because of the drop of F prau , and eventually the media becomes completely aerobic after hour 60 (Fig.3.b,blue).This can be explained by observing Fig.3.e which illustrates how in the epithelial compartment the decreasing levels of butyrate allow oxygen to accumulate and flow into the luminal compartment (blue), as shown in Fig.3.f(blue) plotting the flow between compartments, i.e. γ(m e − m l ).The same can be observed for nitric oxide (Fig.3.f,green) which starts flowing into the luminal compartment from the beginning of the infection.The growth of S th exhibits two phases (Fig.3.a,red): a first phase is mainly fueled

Figure 3 :
Figure 3: dFBA model of Salmonella infection.The output of the dFBA model of Salmonella infection is plotted.The fate of the different model components is displayed in the luminal and epithelial compartments.Butyrate and oxygen flows between epithelial and luminal compartments is also plotted.The results can be read as follows: Plot a shows the ecological dynamics, i.e. the abundance in time of the commensal microbiota and the pathogen.Neutrophils appear after the infection at hour 40, which affects negatively F prau and allows the posterior S th settlement.Plot b show how butyrate level drops after the infection, because of the decrease in the F prau population and how the colon becomes aerobic after hour 60.Plots c and d show the dynamics of the reduced and oxidized metabolites.Notably thiosulfate accumulates until the infection moment, and then is consumed by S th , the rising levels of nitrate are also linked to the presence of S th and the available oxygen to transform nitric oxide in nitrate.Plot e shows the behaviour in the epithelial compartment and one can see how the butyrate level drops since the appearance of neutrophils, the oxygen accumulation because of the reduced butyrate levels and the nitric oxide increased explained by the presence of Salmonella that triggers its production.Plot f shows the flows between compartments to show that indeed the accumulations or depletion of metabolites described before is linked to exchanges between compartments.The different stages of the infection are then qualitatively recovered by the ODE system.

Figure 6 :
Figure 6: Trade-off between speed-up and accuracy.The average speed-up obtained by replacing the model by the metamodel in 5 repetitions is indicated for varying numbers of functional basis included in the ANOVA-RKHS (red line) with the standard deviation.The average relative reconstruction error 1 Nrep Nrep r=1 ∥Y ode F BA,r −Y ode mm,r|N obs ∥

Figure B. 9 :
Figure B.9: Marginal distributions in the learning database.We display for each column 1 ⩽ c ⩽ N up of the database X large its marginal distribution (plain lines) together with the marginal distribution of X (dashed lines) obtained after subsampling and enrichment near the boundaries of X large .As expected, the main modes of X large are conserved in X, while points in the first and last deciles (near the boundaries) are over-represented by construction in X.

Figure C. 10 :
Figure C.10: Model response.The FBA model value F(c) (blue dots) is plotted with its metamodel approximation F(c) (orange dots, y-axis) for 1600 unseen constraints c (x-axis).
drops, pulling the state variable towards 0 or L n when n e exceeds this threshold.The term V F (S th ) is a Heaviside function in order to simulate the virulence factor that Salmonella secrets triggering neutrophils and the nitric oxide production.The terms d n n e , d N O N O e , and d O2 O 2e in equations ( )) is a bistable term with stable steady-state 0 and L n , the threshold separating the attraction areas being L n bute K but +bute .The threshold bute K but +bute tends to 1 when butyrate is abundant and drops to zero when butyrate level .001, .01,.05,0.075, .1,0.15, .2,.3,.4,.5, .75,1.0, 1.5} and a subsample of N obs = 400 observations of X and Y s and compute the loss L µ,s,j , i.e. the relative reconstruction error on a testing set (X test , Y s test ) of N obs = 300 unseen points of X 2 ,blue line, and galactarate, brown line) are enough for a correct signal reconstruction.The same kind of observation is made for the nitric oxyde, thiosulfate and tetrathionate plots.We next can see that O 2 and nitrate are badly reconstructed (O 2 and nitrate plots, dark blue line with stars), even with the whole set of subspaces (more than 30% loss).Finally, for S th growth rate (S th plot), we keep several groups of inputs, including O 2 , thiosulfate, tetrathionate, glucarate and their interactions (see TableC.4 for selected µ).

Table B . 3 :
by Inria, CNRS (LABRI and IMB), Université de Bordeaux, Bordeaux INP and Conseil Régional d'Aquitaine (see https://www.plafrim.fr).This study received fundings from Inria through the Exploratory Action SLIMMEST (for further information see https://www.inria.fr/en/slimmest).Simon Labarthe got support for this study from the France-Berkeley Fund through the project Articulate.Parameter of the random functions describing the intial conditions of the 60 repetitions of the ODEs computed for the learning database.The lower and upper bounds of the uniform distributions are indicated, together with the Bernouilli parameter that models the presence/absence of the metabolite at t = 0 when relevant.