Metamodel construction for sensitivity analysis

We propose to estimate a metamodel and the sensitivity indices of a complex model m in the Gaussian regression framework. Our approach combines methods for sensitivity analysis of complex models and statistical tools for sparse non-parametric estimation in multivariate Gaussian regression model. It rests on the construction of a metamodel for aproximating the Hoeffding-Sobol decomposition of m. This metamodel belongs to a reproducing kernel Hilbert space constructed as a direct sum of Hilbert spaces leading to a functional ANOVA decomposition. The estimation of the metamodel is carried out via a penalized least-squares minimization allowing to select the subsets of variables that contribute to predict the output. It allows to estimate the sensitivity indices of m. We establish an oracle-type inequality for the risk of the estimator, describe the procedure for estimating the metamodel and the sensitivity indices, and assess the performances of the procedure via a simulation study.


Introduction
We consider a Gaussian regression model where X is a d random vector with a known distribution P X = P 1 × . . . × P d on X a compact subset of R d , and ε is independent of X, and distributed as a N (0, 1) variable. The variance σ 2 is unknown and the number of variables d may be large. The function m is a complex and unknown function from R d to R. It may present strong non-linearities and high order interaction effects between its coordinates. On the basis of a n-sample (Y i , X i ), i = 1, . . . , n, we aim to construct metamodels and perform sensitivity analysis in order to determine the infuence of each variable or group of variables on the output.
Our approach combines methods for sensitivity analysis of a complex model and statistical tools for sparse non-parametric estimation in multivariate Gaussian regression model. It rests on the construction of a metamodel for aproximating the Hoeffding-Sobol decomposition of the function m. This metamodel belongs to a reproducing kernel Hilbert space constructed as a direct sum of Hilbert spaces leading to a functional ANOVA decomposition involving variables and interactions between them. The estimation of the metamodel is carried out via a penalized least-squares minimization allowing to select the subsets of variables X that contribute to predict the output Y . Finally, the estimated metamodel allows to estimate the sensitivity indices of m.
A lot of work has been done around meta-modelling and sensitivity indices estimation. For a complete account on global sensitivity analysis, see for example the book by Saltelli et al. [35]. Let us briefly present the context of usual global sensitivity analysis. Suppose that we are able to calculate the ouputs z of a model m for n realizations of the input vector X, such that z i = m(X i ) for i = 1, . . . , n. Starting from the values (z i , X i ), i = 1, . . . n, the objectives of meta-modelling and global sensitivity analysis are to approximate the function m by what is called a metamode or are to quantify the influence of some subsets of the variables X on the output z. This metamodel helps to understand the behavior of the model, or allows to speed up future calculation using it in place of the original model m.
In particular when the inputs variables X are independent, if m is square integrable, one may consider the classical Hoeffding-Sobol decomposition [36,41] that leads to write m according to its ANOVA functional expansion: where P denotes the set of parts of {1, . . . , d} with dimension 1 to d and where for all x ∈ R d , x v denotes the vector with components x j for j ∈ v. The functions m v are centered and orthogonal in L 2 (P X ) leading to the following decomposition of the variance of m: Var (m(x)) = v Var (m v (x v )). The Sobol sensitivity indices, introduced by Sobol [36], are defined for any group of variables x v , v ∈ P by Var (m(x)) .
They quantify the contribution of a subset of variables x to the output m(x). Several approaches are available for estimating these sensitivity indices, see for example Iooss and Lemaître [17] for a recent review. Among all of them, let us consider the one based on metamodel construction that allows to directly obtain the sensitivity indices. Generally one consider metamodels that correspond to an ANOVA-type decomposition, and that are candidate to approximate the Hoeffding decomposition of m. The ANOVA-type decomposition leads to consider functions defined as follows: for functions f v that are chosen to belong to some functionnal spaces. The polynomial Chaos construction, see for example Ghanem and Spanos [14], Soize and Ghanem [38], can be used to approximate the Hoeffding decomposition of m. This approach was considered by Blatman and Sudret [5] who propose a method for truncating (such that to keep polynomials with degree less than some integer) the polynomial Chaos expansion and then an algorithm based on least angle regression for selecting the terms in the expansion. For the same purpose, Gu and Wu [15] propose an algorithm based on the hierarchy principle (lower order effects are more likely to be important than higher order effects) and on the heredity principle (interaction can be active only if one or all of its parent effects are also active). This approach joins the one proposed by Bach [2] for variable selection based on hierarchical kernel learning. Inspired by Touzani [39], Durrande et al. [12] propose to approximate m by functions belonging to a reproducing kernel Hilbert space (RKHS). The RKHS is constructed as a direct sum of Hilbert spaces leading to a functional ANOVA decomposition (see Equation (3)), such that the projection of m onto the RKHS is an approximation of the Hoeffding decomposition of m.
Following Lin and Zhang [25], Touzani and Busby [40] propose an algorithm to calculate the penalized leastsquare estimator of m on the RKHS space, where the least-square criteria is penalized by the sum of the norms of m on each Hilbert subspace. This group-lasso type procedure allows both to select and calculate the non-zero terms in the functional ANOVA decomposition.
Our objective is to propose an estimator of a metamodel which will approximate the Hoeffding decomposition of m considering a Gaussian regression model defined at Equation (1) and to deduce from this estimated metamodel, estimators for the sensitivity indices of m. Contrary to the usual setting of sensitivity analysis where m(X i ) is available, only the observations Y are available, which leads us to consider the nonparametric multivariate regression setting.
Let us briefly describe the methods related to this regression setting and review their theoretical properties, starting with papers assuming an univariate additive decomposition for the function m in the context of highdimensional sparse additive models. Precisely, denote by F 1-add , the set of functions f defined on X such that f (x) = f 0 + d a=1 f a (x a ) where f 0 is a constant, and where for a = 1, . . . , d, the functions f a are centered and square-integrable with respect to P a . For each function f , the set S f of indices a ∈ {1, . . . d} such that f a is not identically zero is called the actice set of f .
Ravikumar et al. [32] propose a group-lasso procedure where each function f a is approximated by its truncated decomposition on a basis of functions. They provide an algorithm and, assuming that the function m belongs to the set F 1-add and that S m is sparse, prove the consistency of the active set and of the risk of the estimator of m.
Meier et al. [27] propose to combine a sparsity penalty (group-lasso) and a smoothness penalty (ridge) for estimating m(x). Considering the fixed design framework, they establish some oracle properties of the empirical risk for estimating the projection of m onto the set of univariate additive functions F 1-add .
Raskutti et al. [31] consider the case where each univariate function f a belongs to a RKHS and as Meier et al. combine a sparsity and a smoothness penalty. Assuming that the d variables X are independent, they derive upper bounds for the integrated and the empirical risks, as well as a lower bound for the integrated risk over spaces of sparse additive models whose each component is bounded with respect to the RKHS norm.
Additive sparse modelling is too restrictive in practical settings because it does not take into account interactions between variables that may affect the relationship between Y and X. The generalization of additive smoothing splines to interaction smoothing splines leading to an ANOVA-type decomposition (see Equation (3)) was proposed by several authors (see for example Wahba [42], Friedman [13], Wahba et al. [43]).
To control smoothness and to enforce sparsity in the ANOVA-type decomposition, Gunn and Kandola [16] propose to consider the ANOVA decomposition as a weighted linear sum of kernels and to use a lasso penalty on the weights to select the terms in the decomposition as well as a ridge penalty to ensure smoothness of the kernel expansion. The COSSO proposed by Lin and Zhang [25] is based on smoothness penalty defined as the sum of the RKHS-norms of the functions f v . The authors study existence and rate of convergence of the estimator. In a more general framework, where the function m is written as a linear span of a large number of kernels, Koltchinskii and Yuan [21] established oracle inequalities on the excess risk assuming that the function m has a sparse representation (the set of v ∈ P such that f * v is non zero is sparse). The authors generalized their results to a penalty function that combines sparsity and smoothness [22], as proposed by Meier et al. [27] and Raskutti et al. [31].
Recently Kandasamy and Wu [19] proposed an estimator called SALSA, based on a ridge penalty, where the ANOVA-type decomposition is restricted to set v ∈ P such that |v| ≤ D max . The authors propose to choose D max via a cross-validation procedure.

Our contributions
Using the functionnal ANOVA-type decomposition as proposed by Durrande et al. [12], we propose an estimator of a metamodel which approximates the Hoeffding decomposition of m. Following the most recent works in the framework of nonparametric estimation of sparse additive models, we propose a penalized leastsquare estimator where the penalty function enforces both the sparsity and the smoothness of the terms in the decomposition. We show that our estimator satisfies an oracle inequality with respect to the empirical and integrated risks.
Our procedure allows both to select and estimate the terms in the ANOVA decomposition, and therefore, to select the sensitivity indices that are non-zero and estimate them. In particular it makes possible to estimate Sobol indices of high order, a point known to be difficult in practice.
Finally, using convex optimization tools, we develop an algorithm (available on request) in R [30], for calculating the estimator. A simulation study shows the good performances of our estimator in practice.
The paper is organised as follows: The RKHS construction based on ANOVA kernels and the procedure for estimating a metamodel are presented in Section 2. The estimators of the Sobol indices are given in Section 3. The theoretical properties of the metamodel estimator are stated in Theorem 4.1 and Corollaries 4.1 and 4.2 whose proofs are postponed in Sections 7 and 8. Section 5 is devoted to the calculation of the estimator and Section 6 to the simulation study.

Meta-modelling
We start from the Hoeffding decomposition (see Sobol [37] and Van der Vaart [41], p. 157) of the function m that consists in writting m as in Equation (2) where P denotes the set of parts of {1, . . . , d} with dimension 1 to d and where for all x ∈ R d , x v denotes the vector with components x j for j ∈ v. For all v, v ′ in P, We propose to consider a functionnal space based on the tensorial product of Reproducing Kernel Hilbert spaces (RKHS), and to approximate the unknown function m by its projection denoted f * on such such RKHS space. One of the key point is to construct the space H such that the terms of the decomposition of a function f in H correspond to its Hoeffding-Sobol decomposition.

RKHS construction
Let us describe the construction of spaces H, based on ANOVA kernels, construction which was given by Durrande et al. [12].
Let X = X 1 × . . . × X d be a compact subset of R d . For each coordinate a ∈ {1, · · · , d}, we choose a RKHS H a and its associated kernel k a defined on the set X a ⊂ R such that the two following properties are satisfied The RKHS H a may be decomposed as H a = H 0a the kernel k 0a associated to the RKHS H 0a being defined as follows (see Berlinet et Thomas-Agnan [4]): The ANOVA kernel is defined as and the corresponding RKHS where the RKHS H v is associated with kernel k v . According to this construction, any function f ∈ H satisfies We get thus the Hoeffding decomposition of f .

Approximating the Hoeffding decomposition of m
Let 2 over functions f ∈ H. This f * can be viewed as an approximation of m and more specifically his Hoeffding decomposition is an approximation of the Hoeffding decomposition of m. Therefore if the Hoeffding decomposition of m is written as in Equation (2), each function f * v approximates the function m v . The idea is propose an estimator of f * as estimator of m.

Selection step
Since P is the set of parts of {1, . . . , d}, the number of functionsf * v is related to the cardinality of P = 2 d − 1 that may be huge. Our construction is thus associated to a selection strategy.
The selection of f * v in f * is based on a ridge-group-sparse type procedure which minimizes the penalized leastsquares criteria over functions f ∈ H. The least-squares criteria is penalized in order to both select few terms in the additive decomposition of f over sets v ∈ P, and to favour smoothness of the estimated f v . The ridge regularization is ensured by controling the norm of f v in the Hilbert space H v for all v, and the group-sparse regularization is strengthened by controling the empirical norm of f v , defined as For any f ∈ H such that f = f 0 + v∈P f v , and for some tuning parameters (µ v , γ v , v ∈ P), let L(f ) be defined as Let us define the set of functions F Then the estimator f is defined as Remark 2.1. The construction of the RKHS spaces described above, allows to consider functionnal spaces that suit well to the smoothness of the function m, irrespectively of the distribution of X. Indeed, the kernels k 0,a depend on the distribution of X only for calculating the projection onto the space of constant functions. In comparison, the decomposition based on the truncated polynomial Chaos expansion, used for sensitivity analysis (see Blatman and Sudret [5]), is based on the distribution of X, and only the choice of the truncation handles the smoothness of the approximation.

Sobol indices
Let us go back to the Hoeffding decomposition Equation (2). The orthogonality between two terms in this decomposition leads to the additive decomposition of the variance of m(x): Each of these variance terms are related to Sobol indices [36]. For example, the Sobol indice linked with the interaction between variables x v is defined as or the global Sobol indices for the variable x a , a ∈ {1, · · · , d}, is Those Sobol indices and global Sobol indices quantify the contribution of a subset of variables x to the output m(x). As it is said in the introduction direct estimation of these Sobol indices may require lot of calculations. We consider here methods based on metamodels to directly obtain Sensitivity indices.

Estimation of Sobol indices
Thanks to the orthogonality property of functions in H, the variance of m(x) will be estimated by In practice, in order to avoid calculating the variance of f v (X v ), one may use an estimator based on the empirical variances of functions f v . Precisely, if f v,· is the mean of the f v (X v,i ), for i = 1, . . . , n, then One of the main contribution of this approach is to allow the estimation of Sobol indices of any order, whereas classical methods only deal with small order, generally less than two.

Theoretical result: oracle inequality for metamodel
In this section we state the oracle inequality for the estimated metamodel f which approximates the Hoeffding decomposition of the unknown function m.

Notations and Assumptions
For a function f ∈ H, f = f 0 + v∈P f v , we denote by S f its support and |S f | its cardinality. More precisely We consider RKHS spaces H v , v ∈ P satisfying the following assumptions: For each v ∈ P, let ω v,k , for k ≥ 1 be the eigenvalues of the operator associated to the self reproducing kernel k v , arranged in the decreasing order. Let us define the function Q n,v (t), for positive t, as follows: and for some ∆ > 0 let ν n,v be defined as follows For each v ∈ P, ν n,v refers to the so-called critical univariate rate, the minimax-optimal rate for L 2 (P X )estimation of a single univariate function in the hilbert space H v (e.g. Mendelson [28]). Our choices of regularization parameters and rates are specified in terms of the quantities: Theorem 4.1. Let us consider the regression model defined at Equation (1). Let (Y i , X i ), i = 1, . . . , n be a n-sample with the same law as (Y, X). Let f be defined by (6). If there exist constants C l , l = 1, 2, 3, C 1 ≥ 1, and 0 < η < 1 such that the following conditions are satisfied: and then, with probability greater than 1 − η, The result can be easily generalized to the minimization of L(f ) over the space G defined as follows: For some positive constants r v , v ∈ P, Indeed, we just have to consider the RKHS H is equivalent to minimizing L(f ) over the set F r .
Let us now comment on the theorem.
• The term m− f 2 n refers to the usual bias term quantifying the approximation properties of the Hilbert space H as a distance between the true m and f , its approximation into H. • Koltchinskii et Yuan [22] considered what is called the multiple kernel learning problem, where the functions in H have an additive representation over kernel spaces. They do not assume that the variable X are independent, nor that the kernel spaces satisfy an orthogonality condition. In return, they assume that some decomposability type properties are satisfied, and they introduce some characteristics related to the degree of "dependence" of the kernel spaces. • In the particular case when the decomposition is limited to the main effects of the variables, then the problem comes back to the classical nonparametric additive model. The theoretical properties of the estimator based on a ridge-group-sparse type procedure have already been established (see for example Meier et al. [27], and Raskutti et al. [31]). • Weights in the penalty terms may be of interest for applications. The theoretical result highlights that the tuning parameters (µ v , γ v ) should depend on the decreasing of the eigenvalues of the kernel defining H v . Besides, we may be interested by introducing weigths that favor small order interaction terms. • Because we aim to approximate the Hoeffding decomposition of m, we need to have orthogonality between the spaces H v , v ∈ P. This condition, required by our objective, is also a key point in the proof of the Theorem, when the problem is to compare the euclidean norm of functions in H with the norm in L 2 (P X ). At this step we need to assume that Assumption (15) holds to conclude. • We do not assume that the functions in F are uniformly globally bounded, that is that the sup{|f (x)|, x ∈ X } is bounded by a constant that does not depend on f . Instead we assume that each function within the unit ball of the Hilbert space H v is uniformly bounded by a constant multiple of its Hilbert norm. In fact, the functions f in the space F , written as f 0 + v f v satisfy that for each group v, f v is uniformly bounded. This assumption is easily satisfied as soon as the kernel k v is bounded on the compact set X .
We refer to [31] for a discussion on that subject and a comparison with the work of Koltchinskii et Yuan [22]. • Assuming that nλ 2 n,v ≥ −C 2 log λ n,v allows to control the probability of the union of |P| events. This is a mild condition, satisfied for λ n,v = K n / √ n for K n of the order √ log n.
The following corollary gives an upper bound of the risk with respect to the L 2 (P X ) norm. It is mainly a consequence of Theorem 4.1 (its proof is given Section 8.2 page 21).
From this corollary we can compare the Var (m v (x v )) (see Equation (7)) with the variance of m v (x v ). Thanks to the following inequality and to Corollary 4.1, we get the following result

Rate of convergence
The result is non asymptotic in the sense that it is shown for any (n, d). Nevertheless, the upper bound is relevant when the infimum is reached for functions f whose decomposition in H is sparse, and when d is small face to n. In fact, the coefficient d occuring in the rate d|S f |/n comes from the logarithm of the cardinality of P equal to log(2 d − 1). When d is large, it may be judicious to limit the decomposition of functions in H, to interactions of limited order, so that the number of terms in the decomposition stays of the order log(d).
Let us discuss the rate of convergence given by v∈S f ν 2 n,v . For the sake of simplicity let us consider the case where the variables X 1 , . . . , X d have the same distribution P 1 on X 1 ⊂ R, and where the unidimensionnal kernels k 0a are all identical, such that . The kernel k 0 admits an eigen expansion given by where the eigenvalues ω 0,ℓ are non negative and ranged in the decreasing order, and where the ζ ℓ are the associated eigen functions, orthonormal with respect to L 2 (P 1 ). Therefore the kernel k v admits the following expansion .
Consider now the case where the eigenvalues ω 0,ℓ are decreasing at a rate ℓ −2α for some α > 1/2. It can be shown, see Section 8.3, that the rate ν n,v defined at Equation (12) is bounded above by a term of order n −α/(2α+1) (log n) γ , where γ ≥ (|v| − 1)α/(2α − 1). Note that in this particular case, the rate of convergence depends on |v| through the logarithmic term, and that up to this logarihmic term the rate of convergence has the same order than the usual nonparametric rate for unidimensionnal functions. It follows that the RKHS space H should be chosen such that the unknown function m is well approximated by sparse functions in H with low order of interactions.

Calculation of the estimator
The functional minimization problem described at Equation (6) is equivalent to a parametric minimization problem. Indeed, we know that if H is a RKHS associated with a kernel k : X ×X → R, then for all (x 1 , . . . , x n ) ∈ X n , and for all ( In particular, it can be shown that the solution to our minimization problem is written as f = f 0 + v∈P f v where, according to the representer Theorem (see Kimeldorf and Whahba [20] Let · denotes the usual euclidean norm in R n . For each v ∈ P, let K v be the n × n matrix with Let f 0 and θ be the minimizer of the following penalized least-squares criteria: Then the estimator f defined at Equation (6) satisfies Because C(f 0 , θ) is a convex and separable criteria, we propose to calculate θ using a block coordinate descent algorithm described in the following section.
Note that the estimator f defined at Equation (6) should satisfy (16). Usually we have no idea of the value of this upper-bound in practice, and we propose to remove this contraint in the optimization procedure. Nevertheless, if one wants to consider such an additional constraint, the problem can be solved at the price of some additional complication, considering a Lagrangian method for example, see Section 8.7 for more details.

Algorithm
We will assume that for all v ∈ P the matrices K v are strictly definite positive. If it is not the case, one modifies K v by K v + ξI n where ξ is a small positive value, in order to ensure positive definiteness.
Using a coordinate descent procedure, we minimize the criteria C(f 0 , θ) along each group v at a time. At each step of the algorithm, the criteria is minimized as a function of the current block's parameters, while the parameters values for the other blocks are fixed to their current values. The procedure is repeated until convergence, considering for example that the convergence is obtained if the norm of the difference between two consecutive solutions is small enough. See for exemple Boyd et al. [8] for optimization in such context.
For the sake of simplicity, we consider the minimization of the following criteria: this is exactly the same criteria as the one defined at Equation (17).
Let us begin with the constant term f 0 . Because the penalty function does not depend on f 0 , minimizing where Y · denotes the mean of Y and In what follows, we consider a group v, and fix the values of the parameters for all the other groups. We describe the algorithm and postpone the proofs in Section 8.7.
Let us first consider the case where both where The first task is to obtain necessary and sufficient conditions for which the solution θ v = 0 is the optimal one. Let Then the solution to Equation (20) is zero if and only if J * ≤ γ ′2 v . Calculating J * is a ridge regression problem that can be easily solved (see Propositions 8.2 and 8.3 in Section 8.7).
If the solution to Equation (20) is not θ v = 0, the problem is to solve the subgradient equation: Because θ v appears in both sides of the equation, a numerical procedure is needed (see Proposition 8.4).
Other cases.
• If all the µ ′ v are equal to 0, the parameters θ v are not identifiable. • If all the γ ′ v are equal to 0, then we have to solve a classical group-lasso problem with respect to the . Finally the algorithm is the following: (1) Start with an initial value θ = θ 0 .
(2) Calculate f 0 using Equation (19). For the group v, calculate R v and determine if the group v should be excluded or not either by solving the problem defined at Equation (21) if µ ′ v = 0 and γ ′ v = 0, or directly if one of them equals 0. If it is the case, set θ v = 0. If not, solve Equation (22) to obtain θ v .

Choice of the tuning parameters
For each value of the tuning parameters (µ ′ v , γ ′ v ), v ∈ P, the algorithm provides an estimate of m and of the Sobol indices. The problem for choosing these parameters values is crucial. We propose to restrict this choice by considering tuning parameters proportionnal to known weights: for all v ∈ P, where the weights ω v and ζ v are fixed. For example, one can take weights that increase with the cardinal of v in order to favour effects with small interaction order between variables. Or, according to the theoretical result given at Theorem 4.1, we can choose ω v = ν 2 n,v and ζ v = ν n,v , where ν n,v is an estimate of ν n,v based on the eigenvalues of the matrix K v . Any other choice, depending on the problem of interest, may be relevant.
Once the weights are chosen, we estimate m, on the basis of a learning data set (Y i , X i ), i = 1, . . . n, for a grid of values of (µ, γ). We first set γ = 0, and calculate µ max the smallest value of µ such that the solution to the minimization of Then we can consider µ ℓ = µ max 2 −ℓ for ℓ ∈ {1, . . . , ℓ max }, as a grid of values for µ. The grid of values for γ may be chosen after few attempts. For choosing the final estimator, say f , we suppose that we have at our disposal a testing data set (Y T i , X T i ), i = 1, . . . n T , and we propose two procedures.
Proc. GS: The first one uses the testing data set for estimating the prediction error. Precisely, for each value of (µ, γ) in the grid, let f (µ,γ) (·) be the estimation of m obtained with the learning data set. Then estimates the prediction error, and we propose to choose the pair (µ, γ) that minimizes PE(µ, γ), say ( µ, γ). Finally the estimator, denoted f GS is defined as f GS = f ( µ, γ) . In the following, we will refer to this procedure as the Group-Sparse procedure. Proc. rdg: Doing the parallel with the inconsistency of the lasso for estimating the support of the parameters in the classical regression problem, we propose to choose the tuning parameter that minimizes the risk of the ridge estimator over the support estimated by the ridge-group-sparse procedure. Indeed, if the tuning parameter is chosen to minimize the prediction error, the lasso is not consistent for support estimation (see [24] for example). One idea to overcome this problem, is to choose the tuning parameter that minimizes the risk of the Gauss-lasso estimator which is calculated in two steps: For a given value of the tuning parameter, the estimation of the support of the parameter is estimated using a lasso procedure, then the least square estimator over this support is calculated. When the objective is support estimation, some numerical simulations [33] and theoretical results [18] suggest that it may be more advisable not to apply the selection schemes based on prediction risk to the lasso estimators, but rather to the Gauss-lasso estimators. Our procedure, called the ridge procedure, applies the same idea in the framework of sparse nonparametric estimation. Precisely it considers the collection of supports composed of the different S f (µ,γ) when (µ, γ) belongs to the grid. For each support S in this collection, we estimate f using a ridge procedure assuming that the support of f is S: for a given λ, f rdg λ,S is defined as follows: We choose a grid of values for λ, and for each S in the collection and λ in the grid, we estimate the prediction error PE(λ, S) defined as follows: For each S in the collection, let λ(S) be the minimizer of PE(λ, S) when λ varies in the grid, and let S be the minimizer of PE( λ(S), S), then the estimator denoted f rdg is defined as If a testing data set is not available, we can use the classical V-fold cross validation (see [1] for example) either to estimate PE(µ, γ) or to estimate PE(λ, S).

Simulation study
In order to evaluate the performances of our method for estimating a meta-model and sensitivity indices of a function m we carried out a simulation study. We consider the g-function of Sobol defined on [0, 1] d as For each simulation, we generate three independent data sets as follows : a Latin Hypercube Sample of the inputs is simulated to give the matrix X with n rows and d = 5 columns, and a n-sample of independent centered Gaussian variable with variance 1 is simulated. This operation is repeated three times in order to obtain the learning and testing data sets and a third data set for estimating the estimators performances. As explained in Section 5.2, we choose optimal values of the tuning parameters, (µ, γ) by minimizing a prediction error PE, and get an estimator f of m, as well as estimates of the Sobol indices: .
Let Ω v be the matrix whose components satisfy The estimator Var (m v (x v )) is calculated as follows: We also propose to estimate these quantities by their empirical variances as in Equation (8).  Table 2. Estimated empirical risk ER for different values of n and σ with the Matérn kernel.
Performance indicators. To evaluate the performances of our method for estimating a meta-model, we use the classical coefficient of determination R 2 estimated using the third data set (Y P i , X P i ), i = 1, . . . , n: Moreover we calculate the empirical risk ER = m − f 2 n . For each simulation s, we get R 2 s and ER s and we report the means of these quantities over all simulations.
Similarly, for each v, and each simulation s, we get S v,s and we report its mean, S v,· , its estimated standarderror, and to sum up the behaviour of our procedure for estimating the sensitivity indices, we estimate the global error, denoted GE, defined as follows In order to assess the performances of our procedure for selecting the greatest Sobol Indices, precisely those that are greater than some small quantity as ρ = 10 −4 , we calculate for each group v ∈ P, the percentage of simulations for which v is in the support of the estimator f . Then we average these quantities, on one hand over groups v such that S v > ρ, and on the other hand, over groups v such that S v ≤ ρ. Let us denote these quantities pSel Sv >ρ and pSel Sv≤ρ respectively.
For each (µ, γ) in a grid of values, the estimator f µ,γ is defined as the minimizer of the criteria given at Equation (18) taking ω v = ζ v = 1 for all v ∈ P. In order to save computation time, we restrict the optimisation to sets v such that |v| ≤ 3. Some preliminary simulations showed that the terms coresponding to |v| ≥ 4 are nearly always equal to 0.
Choosing the tuning parameters. Let us begin with the comparison of the two methods proposed for choosing the final estimator, see Section 5.2. The results are given in Tables 1 and 2. It appears that the procedure based on the ridge estimator after selection of the groups outperforms the method based on the group-sparse estimator. As expected both methods perform better when n increases, and when σ = 0.
Similarly, the Sobol indices are better estimated, in the sense of the global error, with the procedure based on the ridge estimate of the metamodel, see Table 3. The means of the estimators for the Sobol indices greater than ρ = 10 −4 are given in Tables 4 and 5. It appears than S {1} is over-estimated using the procedure based on the group-sparse estimator, leading to under-estimate the Sobol indices associated with interactions of order 2. This tendancy is much less pronounced with the procedure based on the ridge estimator.     Table 5. The first line of the table gives the true values of the Sobol indices ×100 greater than 10 −2 , as well as their sum in the last columns. The following lines give the mean of the estimators as well as their standard-error (in parenthesis), calculated over 100 simulations, for n = 100, and σ = 0.2 with the Matérn kernel.  Table 6. The first two columns give respectively pSel Sv≤ρ and pSel Sv>ρ . The last columns give the values of pSel v for each group v such that S v > ρ. Results for n = 50 and σ = 0 with the Matérn kernel.
Let us now consider the performances of the procedure for selecting the non zero Sobol indices. In Tables 6  and 7 we report the percentages of simulations for which the Sobol indices smaller (respectively greater) than ρ are selected, and for which each of Sobol index greater than ρ is selected. From these results, we conclude that the procedure based on the ridge estimator is more strict for selecting non-zero Sobol indices.
Comparing different kernels. Finally we compare the performances of the procedures for different kernels, see Tables 8 and 9. The means of the estimated empirical risk, ER, and of the global error for estimating the sensitivity indices, GE, are calculated for each kernel. It appears that the Matérn kernel gives the best results, except for the case n = 50 and σ = 0 where the empirical risk of f GS is smaller for the Brownian kernel.
In practice, one may want to choose the kernel according to the smallest prediction error. For that purpose, we propose to calculate the estimators f GS and/or f rdg for each kernel, as described in Section 5.2, as well as their associated prediction errors. Then we choose the kernel for which the prediction error is minimized. The results are reported under the column "mixed" in Tables 8 and 9. It appears that the estimated empirical risks for this "mixed" procedure are nearly equal to the minimum estimated empirical risks over the different kernels. Proc. GS  38  85  100  100  100  100  99  98  18  Proc. rdg  6  58  100  100  100  98  92  77  3  Table 7. blablaThe first two columns give respectively pSel Sv ≤ρ and pSel Sv >ρ . The last columns give the values of pSel v for each group v such that S v > ρ. Results for n = 100 and σ = 0.2 with the Matérn kernel. n = 50, σ = 0 n = 100, σ = 0.

Sketch of proof of Theorem 4.1
We give here a sketch of the proof and we postpone to Section 8 for complete statements. In particular, we denote by C constants that vary from an equation to the other, and we assume that σ = 1.
The proof of Theorem 4.1 starts in the same way as the proof of Theorem 1 in Raskutti et al. [31]. Nevertheless it differs in several points, in particular because the terms occuring in the decomposition of functions in H depend on several variables and thus are not independent. Indeed, f v (X v ) and f ′ v (X v ′ ) are not independent as soon as the groups v and v ′ share some of the variables X a , a = 1, . . . , d. Moreover, we do not assume that the function m is in F .
Starting from the definition of f , some simple calculation (see Equation (28)) give that for all f ∈ F The main problem is now to control the empirical process. For each v, letting λ n,v as in (13), we state (see Lemma 8.1, page 18) that, with high probability, Therefore, if for all v, µ v and γ v satisfy Equation (14), we deduce that with high probability (setting g = f −f ) Besides we can express the decomposability property of the penalty as follows (see lemma 8.2, page 19): with high probability (in the set where the empirical process is controled as stated above) , .
Putting the things together, and noting again that g v Hv ≤ 2, we obtain the following upper bound The last important step consists in comparing v∈S f g v n to v∈S f g v n . More precisely, it can be shown (see lemma 8.3, page 19) that for all v ∈ P, with high probability, we have

Using the orthogonality assumption between the spaces H
Finally it remains to consider different cases according to the rankings of f −f 2 L 2 (PX) , f −f 2 n and v∈S f µ v + γ 2 v to get the result of Theorem 4.1.

Proofs
Recall that we cconsider the regression model defined at Equation (1), where X has distribution P X = P 1 × . . . × P d defined on X a compact subset of R d , and ε is distributed as N (0, σ 2 ). We denote by P X,ε the distribution of (X, ε). We observe a n sample (Y i , X i ), i = 1, . . . , n with law P X,ε .
The notation and the procedure are given in Sections 2 and 4. Let us add on few notations that will be used along the proofs.
For v ∈ P we denote |v| the cardinal of v. For a function φ : R |v| → R, we denote V n,ε the empirical process defined by For the sake of simplicity we assume σ = 1. Moreover, we set R ′ = 1, see (10). Consequently, for any function f ∈ H, f v Hv ≤ 1, and f v ∞ ≤ f Hv . The proofs can be done exactly in the same way by considering the general case. In the proofs, the · L 2 (PX) norm will be denoted by · 2 .

Proof of Theorem 4.1
The proof is based on four main lemmas proved in Section 8.5. In Section 8.4 other lemmas used all along the proof are stated. Their proof are postponed to Section 8.6.
Let us first establish inequalities that will be used in the following. Let f ∈ H and v ∈ S f (see (9)).
Using that for any v ∈ S f , and any norm · in v∈P Combining (24), and (25), to the fact that for any function f ∈ H, L( f ) ≤ L(f ), we obtain that If m − f 2 n ≥ B, we immediately get the result since in that case If m − f 2 n < B, we get that The control of the empirical process |V n,ε f − f | is given by the following lemma (proved in Section 8.5.1, page 26). Lemma 8.1. Let V n,ε be defined in (23). For any f in F , we consider the event T defined as where the quantities λ n,v are defined by Equation (13) and where κ = 10+4∆. Then, for some positive constants Conditionning on T , Inequality (28) becomes which may be decomposed as follows If we choose C 1 ≥ κ in Theorem 4.1, then κλ 2 n,v ≤ µ v /2 and κλ n,v ≤ γ v /2 and the previous inequality becomes Next we use the decomposability property of the penalty expressed in the following lemma (proved in Section 8.5.2 page 28).
Hence, by combining (30) and Lemma (8.2) we obtain For each v, f v − f v Hv ≤ 2 (because the functions f v et f v belong to the class F , see (5)), and consequently, for some constant C, To finish the proof it remains to compare the two quantities v∈S plus an additive term coming from concentration results (see the Lemma given below). Next, thanks to the orthogonality of the spaces H v with respect to L 2 (P X ), To conclude, it remains to consider several cases, according to the rankings of n , and d 2 (f ). This is the subject of the following lemma whose proof is given in Section 8.5.3, page 29.

Lemma 8.3. For f ∈ H, let A be the event
Then, for some positive constant c 2 , On the set A, Inequality (31) provides that, for all K > 0 Inequality (34) uses the inequality 2ab ≤ 1 K a 2 +Kb 2 for all positive K, and Inequality (35) uses the orthogonality with respect to L 2 (P X ).
In the following we have to consider several cases, according to the rankings of More precisely, we consider three cases (35), for any f ∈ H, we get Hence, using that for all K ′ > 0, we obtain for a suitable choice of K ′ , say 1 + K ′ < K/C, that, for some positive constant C ′ , This shows the result in Case 1.
Case 2: Inequality (35) becomes which gives the expected result since Case 3: Recall that in this case, This case is solved by applying the following Lemma (proved in Section 8.5.4, page 29), which states that with high (36), and let G(f ) be the class of functions written as g = v∈P g v , such that g v Hv ≤ 2 satisfying for all f ∈ F Then the event Because we will apply Lemma 8.4 to g v = f v − f v , this event has probability 0. Therefore the event has probability greater than 1 − η/3 for some 0 < η < 1.
Conditionning on the events T and A (defined by (29) and (32)), and according to Lemma 8.2, v∈P ( f v −f v ) belongs to the set G(f ). According to (35), we conclude in the same way as in Case 1.

Proof of Corrolary 4.1
We start from Theorem 4.1 which states that with high probability, and use that for all θ > 0, For d defined by (36), we consider once again the three cases defined page 20.
Case 1: According to (37) and (39), we get the result since Case 2: We directly obtain that Case 3: Recall that in this case, Apply Lemma 8.4 (page 21) and conclude that conditionning on the events T and A, defined by (29) and (32), then Lemma 8.4. Now, conditionning on the event C we get the result since

Rate of convergence
Recall that we consider the case where the variables X 1 , . . . , X d have the same distribution P 1 on X 1 ⊂ R, and where the unidimensionnal kernels k 0a are all identical.
We start from the fact that, , with a kernel k 0 admitting an eigen expansion given by where the eigenvalues ω 0,ℓ are non negative and ranged in the decreasing order at the rate ℓ −2α for some α > 1/2, and where the ζ ℓ are the associated eigen functions, orthonormal with respect to L 2 (P 1 ). Therefore the kernel k v admits the following expansion .
According to this expansion the ω v,ℓ are of order ( |v| a=1 ℓ a ) −2α . In order to control the rate ν n,v defined as , we have to calculate an upper bound for Q 2 n,v (t) We start with the following inequalities that hold up to some constant, for t −1/α > 1 Now let us mention that α > 1/2, j≥1 1 j 2α is a constant that depends on α. We thus focus on the first term in the right hand side of Equation (40).
Let u = t −1/α ≥ 1 and let B |v| be defined as follows: Let us prove that Proof of Equation (41): First note that In the same way, More generally, Let A |v| be defined as follows: then we get B |v| ≤ uA |v|−1 . If we show that then Inequality (41) is proved. Proof of Equation (42) : In the same way we have And Bound (42) is proved. Rate of convergence. Let us come back to the control of the rate ν n,v = inf t such that Q n,v (t) ≤ ∆t 2 . Thanks to (41) we obtain that, up to some constant that depends on |v| and α, It remains now to find t such that, up to constant If t = n −β (log(n)) γ with β = α/(1 + 2α), γ > 0, α > 1/2, then log(n) + γ log log(n) ≤ log(n) as soon as log(n) > 1 + 1 2α .

Intermediate Lemmas
For v ∈ P, let H v be the RKHS associated to the self reproducing kernel k v . Let Q n,v and ν n,v and be defined by Equations (11) and (12). For any function g v ∈ H v , let V n,ε be defined at Equation (23) and consider the following processes Lemma 8.5. If E X,ε denotes the expectation with respect to the distribution of (X, ε), we have for all t > 0, Its proof is given in Section 8.6.1 page 33.
Lemma 8.6. Let b > 0 and let G(t) be the following class of functions: Let Ω v,t be the event defined as Then for any t ≥ ν n,v , the event Ω v,t has probability greater than 1 − exp(−c 2 nt 2 ), for some positive constant c 2 .
Its proof is given in Section 8.6.2, page 34.
Its proof is given in Section 8.6.3, page 35.
Lemma 8.8. If E ε denotes the expectation with respect to the distribution of ε, we have Its proof is given in Section 8.6.4, page 36.
Lemma 8.9. Conditionnaly on the space Ω v,t defined by (46), we have the two following inequalities: Its proof is given in Section 8.6.5, page 36.
We start by writing that Consider the two following cases: Case A: g v n ≤ λ n,v g v Hv Case B: g v n > λ n,v g v Hv Case A: Since g v n ≤ λ n,v g v Hv , we have We then apply Lemma 8.10, page 26, and conclude that (51) holds in Case A for each v ∈ P since, with high probability Case B: Consider now the case g v n > λ n,v g v Hv and let us show that for any v ∈ P, W n,n,v g v n g v Hv ≤ κλ n,v g v n .
Let r v be a deterministic number such that r v > λ n,v . Our first step relies on the study of the process W n,n,v (r v ), for r v > λ n,v . In that case we state two results: R1 For any deterministic r v ≥ λ n,v , with probability greater than 1 − c 1 exp(−c 2 nλ 2 n,v ), R2 Inequality (54) continues to hold for random r v of the form Combining these two points implies that, with probability greater than 1 − c 1 exp(−c 2 nλ 2 n,v ), Consequently, in Case B, according to (52), for each v, Inequality (51) holds because This ends up the proof of Lemma 8.1.
Proof of R1. Taking t = r v and δ = λ n,v in (47), with probability greater than 1 − 4 exp(−nλ 2 n,v ), we have Next we prove that for some positive r v , with probability greater than 1 − exp(−ncλ 2 n,v ), we have Let ν n,v defined as the smallest solution of E ε [W n,n,v (t)] ≤ κt 2 . For W n,n,v , defined by (43), we write Besides, Lemma 8.10 stated that on the event Ω v,λn,v , E ε W n,n,v (λ n,v ) ≤ κλ 2 n,v . It follows from the definition of ν n,v , and Lemma 8.6, that ν n,v ≤ λ n,v for all v ∈ P with probability greater than 1 − exp(−nc 2 v∈P λ 2 n,v ). Consequently, for any deterministic r v such that r v ≥ λ n,v , (54) is satisfied with high probability.
Proof of R2. Let us prove R2 by using a peeling-type argument. Our aim is to prove that (54) holds for any r v of the form Since g v ∞ / g v Hv ≤ 1, we have g v n / g v Hv ≤ 1. We thus restrict ourselves to r v satisfying r v = g v n / g v Hv with g v n / g v Hv ∈ (λ n,v , 1]. We start by splitting the interval (λ n,v , 1] into M disjoint intervals such that (λ n,v , 1] = ∪ M k=1 (2 k−1 λ n,v , 2 k λ n,v ], for some M that will be chosen later. Consider the event D c defined as follows: We prove that, for some positive constants c 1 , c 2 , For g v ∈ D c , let k be the integer in {1, · · · , M }, such that This k satisfies Therefore, we get By taking r v = 2 k λ n,v in (54), we have Now let us write D c as follows: The set D c has probability smaller than c 1 M exp(−c 2 nλ 2 n,v ). If we choose M such that log M ≤ (c 2 /2)nλ 2 n,v , then the probability of the set T is greater than It follows that R2 is proved which ends up the proof of Lemma 8.1.

Proof of Lemma 8.2 (page 19).
Starting from (27) with B defined by Equation (26), we write On the event T defined in (29) we have Rearranging the terms we obtain that Now, thanks to Assumption (14) with C 1 ≥ κ we have κλ 2 n,v ≤ µ v and 2κλ n,v ≤ γ v and Lemma 8.2 is shown since

Proof of Lemma 8.3 (page 19):
Let us consider the following two cases: We apply Lemma 8.6 (page 25) to the function It follows that, for some positive c 2 , with probability greater than 1 − exp(−nc 2 γ 2 v ) f We apply Lemma 8.7 (page 25) to the function It follows that, for some positive c 2 , with probability greater than 1 − exp(−nc 2 γ 2 v ),

Proof of Lemma 8.4 (page 21):
Let d(f ) be defined by (36), and let G(f ) and G ′ (f ) be the following sets Let us consider the two events B and B ′ defined as follows: Let us first remark that B ′ is included into B: if h ∈ B ′ , then h ∈ G, h 2 = d(f ) and h 2 n ≥ d ( f )2/2. It follows that h 2 n ≥ h 2 2 /2 and h 2 ≥ d(f ). Therefore Lemma 8.4 is proved if B ′ holds with high probability.
We show that the event Z n (G ′ ) ≤ d(f ) 2 /2 has probability greater than 1 − c 1 exp(−nc 2 d(f ) 2 ). Let us briefly recall the notion of covering numbers for a totally bounded metric space (G, ρ), consisting of a set G and a metric ρ defined from G × G into R + . A δ-covering set of G is a collection of functions f 1 , · · · , f N such that for all f ∈ G there exists k ∈ {1, 2, · · · , N } such that ρ(f, f k ) ≤ δ.
The δ-covering number N (δ, G, ρ) is the cardinality of the smallest δ-covering set. A propercovering restricts the covering to use only elements in the set G. The proper covering number denoted N pr (δ, G, ρ) satisfies Let us now consider a d(f )/8-covering of (G ′ , · n ), so that, for all g in G ′ there exists g k such that g − g k n ≤ d(f )/8. The associated proper covering number is Now, for all g ∈ G ′ , T 1 = g k 2 n − g 2 n , and The proof is splitted into four steps: Step 1 The first step consists in showing that Step 2 The second step consists in proving that, for N pr given at Equation (57) and for some constant C, Step 3 The third step concerns the control of N pr : we show the following result Step 4 The last step consists in bounding from above the Gaussian complexity: Let us conclude the proof of the lemma before proving these four steps. Putting together Steps 3 and 4, for c 3 < C and C 1 large enough, then Step 2 states that
Proof of Step 1: We start by writing that Using that (a + b) 2 ≤ 2a 2 + 2b 2 , g ∈ G ′ , and g satisfies Condition C3, we get Besides, the covering set is constructed such that g k − g n ≤ d(f )/8. It follows that Step 1 is proved.
Proof of Step 2: We prove that for some constant C, As g k ∈ G ′ , d = g k 2 . Then Applying Theorem 3.5 in Chung and Lu [10] we have that for all positive λ Taking λ = nd(f ) 2 /4 and using that g k 2 2 = d(f ) 2 we get

It follows that
It remains to calculate E X g 4 (X) for g ∈ G ′ . Precisely we show the following result: This result comes from the property of the RKHS H: indeed g ∈ H is written g = v∈P g v where the functions g v are centered and orthogonal in L 2 (P X ). Therefore Eg 4 (X) is the sum of the following terms: Using the Cauchy-Schwartz inequality and the fact that g v ∞ ≤ g v Hv ≤ 2, and After calculation of the terms A i , since d(f ) 2 is assumed to be smaller than one, we get that Step 2 is proved by combining (59) and (60).
Proof of Step 3: Let N pr be defined at Equation (57). We prove that We start from (56) and write that Using the Sudakov minoration (see Pisier [29]) we have that for all positive ω Hence by taking ω = d(f )/16, Step 3 is proved.
Proof of Step 4: The last step consists in bounding from above the Gaussian complexity E ε sup g∈G ′ |V n,ε (g)|. This control is performed by using Lemma 8.5 (page 25). According to Inequality (51), with λ n,v defined by Equation (13) satisfying C 1 λ n,v ≤ γ v and C 1 λ 2 n,v ≤ µ v for all v ∈ P. It follows and according to Condition C1, because g v Hv ≤ 2. Now, according to Condition C2, and using that 2ab ≤ a 2 + b 2 , we get the last inequality coming from the fact for all g ∈ G ′ , g 2 . Finally, thanks to (36), we get 8.6. Proofs of intermediate Lemmas. Let us write that the kernel k v is written as : is an orthonormal basis of L 2 (P v ), where P v = a∈v P a . Let us consider the class of functions K(t) defined as It comes that In the following, we set µ k,v (t) = min t 2 , ω k,v . Hence as soon as g v ∈ K(t). Now, let us prove the lemma: The last inequality follows from the Cauchy-Schwartz inequality and Inequality (61). Now, simple calculation leads to 8.6.2. Proof of Lemma 8.6 (page 25): Hence The centered process satisfies a concentration inequality given, for example, by Theorem 2.1 in Bartlett et al. [3] : if C is a class of functions f such that f ∞ ≤ B and Ef (X) = 0, and if there exists γ > 0 such that for every f ∈ C, Var f (X) ≤ γ 2 . Then for every x > 0, with probability at least 1 − e −x , For any t > 0, for G(t) defined by (45), let us consider the class of functions C(t) defined as follows Note that if f ∈ C(t), E X f (X v ) = 0 and f ∞ ≤ b 2 . We have to study .
It is easy to see that Let ζ i be independent and identically random variables Rademacher distributed and let E X,ζ denotes the expectation with respect to the law of (X, ζ). By a symmetrization argument, Since g v ∞ ≤ b, applying the contraction principal (see ) we get that, for Q n,v (t) defined by (11), The last inequality was proved by Mendelson [28], Theorem 41 (see the proof of Lemma 8.5). Now, thanks to (62) we get that for all x > 0, with probability greater than 1 − e −x sup gv ∈G(t) Taking x = c 2 nt 2 , t ≥ ν n , we have that with probability greater than 1 − e −c2nt 2 sup gv ∈G(t) The infimum of the right hand side is reached in α = c 2 b/16∆, and equals The constants ∆ and c 2 should satisfy that this infimum is strictly smaller than b 2 /4. For example, if 16∆ < b/8, it remains to choose c 2 small enough such that

Proof of Lemma 8.7 (page 25):
Let t > ν n,v and h be defined as h = tg v / g v 2 . If g v satisfies the assumptions of the lemma, then h satisfies h 2 = t, h H ≤ 2 and h ∞ ≤ b. Applying Lemma 8.6 (page 25) to the function h, we obtain that for all t ≥ ν n,v , with probability greater than 1 − exp(−c 2 nt 2 ), we have |t − h n | ≤ bt/2 for all h ∈ G(t). This concludes the proof of the lemma.

Proof of Lemma 8.8 (page 25):
The proof of Lemma 8.8 is based on an isoperimetric inequality for Gaussian processes (Borell [6] or Cirel'son et al. [11]) as it is stated in Theorem (3.8), page 61 in Massart [26]. Let us recall this inequality: Lemma 8.11. Let P be the Gaussian probability measure on R n and let φ be a function from R n to R, and φ L its Lipschitz semi-norm: Let Φ be the cumulative distribution of the standard Gaussian distribution. Then for any u, We apply Lemma 8.11 to φ(ε 1 , . . . , ε n ) = W n,n,v (t). By Cauchy-Schwarz Inequality, φ L = t/ √ n. It follows that Lemma 8.8 is proved since 8.6.5. Proof of Lemma 8.9 (page 26): We start with the proof of (48) in Lemma 8.9 by applying once again Lemma 8.11 given above, to the function φ(ε) = φ(ε 1 , . . . , ε n ) = W n,2,v (t). On the event Ω v,t defined by (46), we have g v n ≤ bt/2 + g v 2 . Besides if g v Hv ≤ 2, then g v ∞ ≤ 2. Therefore applying Lemma 8.6 with b = 2, we get that if g v 2 ≤ t, leading to φ L = 2t/ √ n. It follows that (48) in Lemma 8.9 is proved since We now come to the proof of (49) in Lemma 8.9 using a Poissonian inequality for self-bounded processes (see Boucheron et al. [7]) and Theorem 5.6, p 158 in Massart [26]). Let us recall it in the particular case we are interested in: Theorem 8.1. Let X 1 , · · · , X n be n iid variables. For i ∈ {1, · · · , n} let X (−i) = (X 1 , . . . , X i−1 , X i+1 , . . . , X n ). Let Z be a nonnegative and bounded measurable function of X = (X 1 , · · · , X n ). Assume that for all i ∈ {1, · · · , n}, there exists a measurable function We apply this result to Z defined as Z = Z(X 1 , · · · , X n ) = nE ε W n,2,v (t) = nE ε sup {|V n,ε (g v )|, g v 2 ≤ t, g v Hv ≤ 2} .
The variable Z is positive, and because the distribution of (ε 1 , . . . , ε n ) is symmetric, we have that Let τ be the function in H v such that Z = E ε nV n,ε (τ ) (note that τ depends on (X 1 , . . . , X n ) and on (ε 1 , . . . , ε n )), and let Z i = E ε sup gv j =i ε j g v (X j ).
We show that Z and Z i satisfy the assumptions of Theorem 8.1: where the last inequality comes from the fact that sup x∈X |τ (X)| ≤ τ Hv ≤ 2. Moreover Z − Z i ≥ 0 since Finally we have: Therefore, following Theorem 8.1, we get that for all postive u P X,ε E ε W n,2,v (t) − E X,ε W n,2,v (t) ≤ u n ≤ exp − u 2 E X,ε W n,2,v (t) .
As E X,ε W n,2,v (t) ≤ Q n,v (t), see Lemma 8.5 page 25, we get the expected result since for all positive x P X [E ε W n,2,v (t) ≥ E X,ε W n,2,v (t) + x] ≤ exp − nx 2 Q n,v (t) .
The next step consists in comparing W n,n,v (λ n,v ) and W n,2,v (2λ n,v ). Recall that λ n,v ≥ ν n,v , see (13). Let g v such that g v n ≤ λ n,v .
This ends the proof of the lemma by taking κ = 10 + 4∆. 8.7. Algorithm: Propositions 8.1-8.4 We consider the minimization of C ′ (f 0 , θ) given at Equation (18). Because C ′ (f 0 , θ) is convex and separable, we use a block coordinate descent algorithm, group v by group v. We refer to Bubeck [9] for a review on convex optimization.
In what follows, the group v is fixed, and for given values of f 0 and θ w , w = v, we look for the minimizer of C ′ with respect to θ v : Setting we aim at minimizing with respect to θ v In what follows we consider the case where at least one of both is non zero.
If ∂Q v denotes the subdifferential of Q(θ v ) with respect to θ v , we need to solve 0 ∈ ∂Q v . Let us recall that for all v ∈ P the matrices K v are symmetric and strictly definite positive.
Let us begin with the calculation of the subdifferential of K v θ v with respect to θ v : if θ v = 0, we have and if θ v = 0 the subdifferential is the set of x ∈ R n such that K −1 v x ≤ 1. Therefore if θ v = 0, Let us begin with the case µ ′ v = 0. Proposition 8.1. Let (ρ) + denotes the positive part of ρ ∈ R. If µ ′ v = 0, Proof of Proposition 8.1. The problem comes to minimize and to take θ v = K −1 v β v . The subdifferential of U is given by We get Let µ ′ v > 0, the following proposition gives a necessary and sufficient condition for θ v = 0 to be the minimizer of Q. For the sake of clarity let us recall the definitions of J and J * given in Section 5.1: For all x ∈ R n J(x) = 2R v − µ ′ v K −1 v x 2 , and J * = min J(x), for x such that K −1/2 v x ≤ 1 .
Taking into account that K v 1/2 θ v ≤ r v . As already mentionned in Section 5, one may want to minimize C(f 0 , θ) under the additional constraint that K v 1/2 θ v ≤ r v for some positive constant r v , v ∈ P.
For each group v, we have thus to minimize Q(θ v ) under the constraint K v 1/2 θ v ≤ r v . We know that this problem is equivalent to minimize for some λ that depends on r v . Let us first remark that, for a fixed γ ′ v , and λ ≥ 0, if θ v (µ ′ v + λ) minimizes Q(θ v ) + λ K 1/2 v θ v with respect to θ v , then K . It can be easily proved by writing Therefore, one can proceed as follows: calculate θ v for λ = 0. If K 1/2 v θ v ≤ r v , then one go to the next step of the algorithm. If K