FROM REGRESSION FUNCTION TO DIFFUSION DRIFT ESTIMATION IN NONPARAMETRIC SETTING

We consider a diffusion model dXt = b(Xt)dt + σ(Xt)dWt, X0 = η, under conditions ensuring existence, stationarity and geometrical β-mixing of the process solution. We assume that we observe a sample (Xk∆)0≤k≤n+1. Our aim is to study nonparametric estimators of the drift function b(.), under general conditions. We propose projection estimators based on a least-squares type contrast and, in order to generalize existing results, we want to consider possibly non compactly supported projection bases and possibly non bounded volatility. To that aim, we relate the model with a simpler regression model, then to a more elaborate heteroscedastic model, plus some residual terms. This allows to see the role of heteroscedasticity first and the role of dependency between the variables and to present different probabilistic tools used to face each part of the problem. For each step, we try to see the "price" of each assumption. This is the developed version of the talk given in August 2018 in Dijon, Journées MAS. Résumé. Nous considérons un modèle de diffusion dXt = b(Xt)dt+σ(Xt)dWt, X0 = η, sous des conditions garantissant l’existence, la stationarité et le β-mélange géométrique du processus solution. Nous supposons que nous disposons d’observations (Xk∆)0≤k≤n. Notre objectif est d’étudier un estimateur nonparamétrique de la fonction b(.), sous des hypothèses générales. Nous proposons des estimateurs par projection, basés sur un contraste des moindres carrés. Afin de généraliser les résultats existants, nous voulons des jeux d’hypothèses autorisant des bases de projection à support non compact, ainsi que des fonctions de volatilité non bornées. Ainsi, nous relions le modèle de diffusion à un modèle plus simple de régression, puis à un modèle hétéroscédastique, plus des termes de reste. Cela nous permet de détailler le rôle de l’hétéroscédasticité puis de la dépendance, et de présenter les différents outils probabilistes utilisés pour affronter chaque problème. A chaque étape, nous étudions le prix des hypothèses. Ceci est la version développée de l’exposé présenté en Août 2018 à Dijon, lors des Journées MAS. Introduction This is a version of my talk about nonparametric drift estimation of a discretely observed continuous-time diffusion. The results are proved in a series of three papers jointly written with V. Genon-Catalot, [9], [8], [10] and thus, proofs are not repeated here. The intention is rather to present an overview of the topic, and to explain the link between diffusions and heteroscedastic regression results from nonparametric estimation point of view: link means of course similarities and differences. The three aforementioned works propose a new visit of the topic, formerly studied in several other papers: Baraud (2000, 2002) studied nonparametric regression estimation, while Baraud et al. (2001a, 2001b) considered the same type of questions in dependent context. Then Comte et al. (2007) applied the regression strategy to 1 Université Paris Descartes, Laboratoire MAP5, email: fabienne.comte@parisdescartes.fr c © EDP Sciences, SMAI 2020 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Article published online by EDP Sciences and available at https://www.esaim-proc.org or https://doi.org/10.1051/proc/202068002 ESAIM: PROCEEDINGS AND SURVEYS 21 diffusion models. All these references rely on compactly supported bases and domains of estimation, and assume the density of the random regressors to be lower bounded on the compact estimation set. Moreover, the density lower bound appears in several denominators, making potentially some constants explode, since this term can obviously be very small. When heteroscedasticity is present is these works, the volatility function is also systematically assumed to be upper bounded: the assumption is not strong if the support is compact, but can be undesirable for classical diffusion models when considering non compact support: for instance, in the square-root (or Cox-Ingersoll-Ross) model σ(x) = σ √ x, is not bounded on R+. This is the reason why we re-consider these models, aiming at extending the result to the non compact support context, which means avoiding several classical lower or upper bound assumptions. This new visit was permitted by new probabilistic tools as Tropp’s (2012) matricial Chernoff and Bernstein inequalities. The relevance of this powerful result for the regression setting has already been noticed by Cohen et al. (2013) who proposed stability conditions: we reformulate them in a way allowing to deal with the general regression estimator. Moreover, we re-define the regression estimator in a truncated version involving a matricial random cutoff, and this is new. We emphasize that, from several aspects, the regression problem seen under this general setting turns out to have similarities with inverse problems. This is especially the case for model selection results, where the new penalties and the collection of models which appear now are rather elaborate and not easy to deal with from theoretical point of view (while easy to implement in practice). The presentation that follows relies on all the papers mentioned above, which of course do not constitute an exhaustive review of the domain. 1. From diffusion to regression We observe with sampling interval ∆, the random variables (Xi∆)0≤i≤n+1, from the diffusion process (Xt)t≥0, dXt = b(Xt)dt+ σ(Xt)dWt, X0 = η, t ≥ 0, (1) where (Wt)t≥0 is a standard Brownian motion independent of η. Precise assumptions will be given later on, but let us just say that they are such that equation (1) admits a unique solution; moreover the initial condition η is chosen such that the process is stationary. Now, we set Yi∆ = X(i+1)∆ −Xi∆ ∆ , Zi∆ = 1 √ ∆ σ(Xi∆) ( W(i+1)∆ −Wi∆ √ ∆ ) , and see that (1) can be re-written Yi∆ = b(Xi∆) + Zi∆ } {{ } regression equation +Ri∆, (2) where Ri,∆ = Ri∆,1 +Ri∆,2 Ri∆,1 = 1 ∆ ∫ (i+1)∆ i∆ (b(Xs)− b(Xi∆))ds, Ri∆,2 = 1 ∆ ∫ (i+1)∆ i∆ (σ(Xs)− σ(Xi∆))dWs. Equation (2) is almost a regression equation, with nonparametric regression function b(.). In this model, the process Zi∆ plays the role of an heteroscedastic noise, and Ri∆ is an additional residual term that we take into account. We note that, among the difficulties, only one process is observed Xi∆, i = 0, 1, . . . , n+ 1. We emphasize that the decomposition proposed in (2) is different from the one in Comte et al. (2007), where the noise term was ∆−1 ∫ (i+1)∆ i∆ σ(Xs)dWs and residual term Ri∆,1 : this change, which at first sight seems minor, is in fact important because it allows to use the Talagrand deviation inequality in the model selection part, together with coupling methods, while in the former paper, we had to apply martingale deviations and chaining methods. With such probabilistic tools, the assumption that σ is bounded seemed unavoidable, while we can handle this with the new definition of Zi∆ in (2). 22 ESAIM: PROCEEDINGS AND SURVEYS Under conditions on b(·), σ(·) and η (see Assumptions (A1)-(A4) in section 4), there is a unique strictly stationary solution, with stationary density denoted by π. Moreover, we assume that the process (Xt)t≥0 is geometrically β-mixing (and it would be interesting to study the influence of arithmetical mixing). It is clear from (2) that we have reduced model (1) to a regression model, involving an heteroscedastic noise Zi∆, and dependency between the observations. So, let us start by simpler regression, before studying the impact of the presence of the volatility function and lastly, the price of dependency between the variables. 2. The origins − Standard regression model The standard regression model writes Yi = b(Xi) + εi, i = 1, . . . , n (3) with noise variables (εi)1≤i≤n i.i.d. centered with variance σ2 ε , explanatory variables (Xi)1≤i≤n i.i.d. with density π, and the assumption that the sequence (Xi)1≤i≤n is independent of (εi)1≤i≤n. Here, the observations are the couples (Yi, Xi)1≤i≤n and our aim is nonparametric estimation of b(.). 2.1. Projection estimators We intend to build a projection estimator of b(.). For that purpose, we define (φj)0≤j≤m−1 an orthonormal basis in L2(A, dx), where A ⊆ R. In other words, the (φj)j satisfy


Introduction
This is a version of my talk about nonparametric drift estimation of a discretely observed continuous-time diffusion. The results are proved in a series of three papers jointly written with V. Genon-Catalot, [9], [8], [10] and thus, proofs are not repeated here. The intention is rather to present an overview of the topic, and to explain the link between diffusions and heteroscedastic regression results from nonparametric estimation point of view: link means of course similarities and differences.
The three aforementioned works propose a new visit of the topic, formerly studied in several other papers: Baraud (2000Baraud ( , 2002 studied nonparametric regression estimation, while Baraud et al. (2001aBaraud et al. ( , 2001b considered the same type of questions in dependent context. Then Comte et al. (2007) applied the regression strategy to

From diffusion to regression
We observe with sampling interval ∆, the random variables (X i∆ ) 0≤i≤n+1 , from the diffusion process (X t ) t≥0 , where (W t ) t≥0 is a standard Brownian motion independent of η. Precise assumptions will be given later on, but let us just say that they are such that equation (1) admits a unique solution; moreover the initial condition η is chosen such that the process is stationary. Now, we set and see that (1) can be re-written where R i,∆ = R i∆,1 + R i∆,2 Equation (2) is almost a regression equation, with nonparametric regression function b (.). In this model, the process Z i∆ plays the role of an heteroscedastic noise, and R i∆ is an additional residual term that we take into account. We note that, among the difficulties, only one process is observed X i∆ , i = 0, 1, . . . , n + 1.
We emphasize that the decomposition proposed in (2) is different from the one in Comte et al. (2007), where the noise term was ∆ −1 (i+1)∆ i∆ σ(X s )dW s and residual term R i∆,1 : this change, which at first sight seems minor, is in fact important because it allows to use the Talagrand deviation inequality in the model selection part, together with coupling methods, while in the former paper, we had to apply martingale deviations and chaining methods. With such probabilistic tools, the assumption that σ is bounded seemed unavoidable, while we can handle this with the new definition of Z i∆ in (2).
Under conditions on b(·), σ(·) and η (see Assumptions (A1)-(A4) in section 4), there is a unique strictly stationary solution, with stationary density denoted by π. Moreover, we assume that the process (X t ) t≥0 is geometrically β-mixing (and it would be interesting to study the influence of arithmetical mixing). It is clear from (2) that we have reduced model (1) to a regression model, involving an heteroscedastic noise Z i∆ , and dependency between the observations. So, let us start by simpler regression, before studying the impact of the presence of the volatility function and lastly, the price of dependency between the variables.

The origins − Standard regression model
The standard regression model writes with noise variables (ε i ) 1≤i≤n i.i.d. centered with variance σ 2 ε , explanatory variables (X i ) 1≤i≤n i.i.d. with density π, and the assumption that the sequence (X i ) 1≤i≤n is independent of (ε i ) 1≤i≤n . Here, the observations are the couples (Y i , X i ) 1≤i≤n and our aim is nonparametric estimation of b(.).

Projection estimators
We intend to build a projection estimator of b(.). For that purpose, we define (ϕ j ) 0≤j≤m−1 an orthonormal basis in L 2 (A, dx), where A ⊆ R. In other words, the (ϕ j ) j satisfy where δ j,k denotes the Kronecker symbol. Then, we look for an estimator that may be written where the coefficient estimates (â j ) 0≤j≤m−1 should be computed from the observations (Y i , X i ) 1≤i≤n .

Quotient estimators
Let us mention that we do not want to handle Nadaraya-Watson or quotient estimators, because we consider that they are not exactly of projection type as given by (4). The principle of such estimates is to define the function r = bπ where b is the regression function of interest and π denotes the density of the design (X i ) 1≤i≤n . Indeed, this function is often simple to estimate. Then a quotient estimator is defined by Separately,r m andπ m are projection estimators and are easy to study. Note that the quotient can be performed coefficient by coefficient: These estimators can work, and in some cases, nothing else can be theoretically justified. However, the consequences are the same for both b m,m and b m,m : they do not provide a development in a basis. Their study leads to risk bounds involving the risk of both numerator and denominator estimators: each separately is rather easy, but final rates depend on the regularity of functions r and π and not only on the regularity of b. Moreover, making the ratio requires a careful choice of a cutoff to avoid that the denominator gets too small; lastly, it depends on two dimension parameters which have to be selected. It is not clear if the best selection of each separately gives the best final quotient: maybe a joint selection should be studied.

Least squares estimator
To define our projection estimator, let us consider the m-dimensional linear space In the sequel, the least squares estimator iŝ To compute the coefficients, it is enough to see that everything works as if a 0 , . . . , a m−1 where the parameters in the linear model Y i ≈ a 0 ϕ 0 (X i ) + · · · + a m−1 ϕ m−1 (X i ) + ε i and thus, we know how to get the least squares estimator with classical formula: where s. We want to study the risk of the estimator for fixed m, to select an adequate modelm from the data and to bound the risk of the final estimator,bm.
And indeed, in several computations, the constant π min is crucial and more precisely 1/π min is involved in several bounds. Obviously, this assumption can not hold on a non compact A. Assumption (6) is useful to relate conveniently the three norms appearing in the problem: • the empirical norm t 2 n = 1 n Clearly, the empirical norm a.s. converges to the L 2 (A, π(x)dx)-norm, so the link between the first two norms is always possible, and specifically improved by Tropp's (2012) results. Then, the L 2 (A, π(x)dx)-norm and the L 2 (A, dx)-norm are equivalent for A compact if π satisfies assumption (6): such equivalence allows to indifferently use π-weighted norm, which is natural in the problem, or usual norm, which is natural due to the choice of the basis. The upper-bound part of (6) does not seem a strong constraint in all cases, but the lower bound side of (6) is crucially related to the compactness of A. A solution may be to consider a L 2 (A, π(x)dx)-orthonormal basis: this is done in part of the proofs, but it is not possible in practice since π is unknown.

Non compact support, what for?
One may wonder why we are interested in this non compact assumption, so let me explain our motivations. 1-First, we aim at generalization of existing results, if it is possible. 2-We also want to better understand the dependence of results w.r.t. π min , and to see how certain bounds may explode if this term is small.
3-We have at hand simple and convenient non compactly supported bases: Laguerre (R + -supported) and Hermite (R-supported) bases, which have nice properties. The Hermite basis is especially natural for diffusion models.
4-We also have in mind to extend the regression strategy to other models with unknown support of the regressor, such as: survival function estimation in presence of interval censoring, hazard rate estimation in presence of right censoring (these two cases can be expressed as univariate regression and since nonnegative random variables are often involved in such models, the Laguerre basis is of natural use), conditional density estimation . . . 5-Lastly, we believe that regression strategy may be useful for inverse problems (for instance, to handle noisy observations of X).

The bases: compact support or not
Let us give a quick insight of the bases we concretely have in mind.
Examples of compactly supported bases, A = [0, 1]. Classical compactly supported bases are: All these collections satisfy with c 2 ϕ = 1 for histograms and trigonometric basis, c 2 ϕ = r + 1 for piecewise polynomials. Moreover, the associated spaces S m are nested (in general or for m = 2 k for increasing values of k).
Laguerre basis, A = R + . Laguerre polynomials (L j ) and Laguerre functions ( j ) are given by The spaces (S m = span{ 0 , . . . , m−1 }) m are nested, and (7) implies that Hermite basis, A = R. Hermite polynomials and Hermite functions of order j for j ≥ 0 are given by: We have: (h j , j ≥ 0) orthonormal basis of L 2 (R, dx) and and thus the (S m = span{h 0 , . . . , h m−1 }) m are nested and (8) implies that

No support condition for the first basic result
To begin with, let us state a very simple and general result, recalled e.g. in Baraud (2000).
Consider the least squares estimatorb m of b, given by (5). Then where π denotes the common density of the X i 's.
Proof of Proposition 2.1. Let Π m be the orthogonal projection (for the scalar product of R n ) on the sub-space Now we have, thanks to (ε i ) 1≤i≤n independent of (X i ) 1≤i≤n , by elementary matrix computation, Therefore, 2 We can check that, in (9), the bias is getting small when m grows, while the variance increases. This implies that a compromise will have to be found, by relevant choice of m. But we mainly detail this result to emphasize that • The bound (9) is general, almost exact, and holds for any basis support, • The variance term is exactly equal to σ 2 ε m/n, and this does not depend on the basis. This is important and not so obvious.

Comparison with density estimation
Why is it important to notice the equality (11)? This is due to comparison with density estimation. Recall that, for i.i.d. X i with density π, the projection estimator is defined by (see section 2.2): This estimator satisfies For all the bases we described above, we have In some cases, we even have m−1 j=0 ϕ 2 j = c 2 ϕ m: this holds for histograms and trigonometric polynomials with odd dimension, with c ϕ = 1. Thus we obtain a variance bound which is exactly m/n, and thus is sharp for some bases.
However, for the Laguerre basis, it is true that m−1 j=0 ϕ 2 j (0) = 2m and thus sup x∈R + m−1 j=0 ϕ 2 j (x) = 2m: therefore, at first sight, we may have the same conclusion. However, for Hermite and Laguerre bases, it is proved in Comte and Genon-Catalot (2018a, Prop. 3.1), that, for some constant c, Thus, for density estimation, the variance order depends on the basis. This is why we emphasize forb m , that the bound stated in Proposition 2.1 is equal to σ 2 ε m/n whatever the basis.

Bound on the integrated risk
The first step to the inverse problem is encountered when looking for a bound on the integrated risk which is such that Provided that Ψ m is invertible a.s., formula (5) can be re-written aŝ To control the estimator, we have to study the distance between Ψ m and its expectation, and more precisely, The link between empirical and integrated-π norms is controlled on the random set defined by We can relate it with aforementioned distance beween Ψ m and Ψ m and bound the probability of the set by using the strategy of Theorem 1 in Cohen et al. (2013). (13) is satisfied. Then for all 0 ≤ δ ≤ 1,

Proposition 2.2. Assume that Ψ m is invertible and assumption
As a consequence, we choose δ = 1/2, define Ω m := Ω m (1/2) and obtain that if m is such that Condition (15) is also called stability condition. It is worth noting that these results are available for all possible classical bases, whether compactly supported or not. This explains why we define the trimmed estimator This truncation, based on an empirical version of the stability condition, is mandatory to obtain a bound on the integrated risk and is the new point of our generalisation, in particular compared to the results of Baraud (2002), Cohen et al. (2013). This is also what makes regression study rely on methods used for inverse problems. π). Then for all m satisfying (15), we have where c is a constant depending on E(ε 4 1 ) and b 4 A (x)π(x)dx. Note that the constant in front of the squared bias term is near of 1, especially for large n.

Case of compact support A
If A is compact, one usually assumes that (6) holds, and that b A ∈ L 2 (A, dx). This last condition is not very strong when A is compact. Under the upper-bound part of (6) (i.e. π(x) ≤ π max < +∞, ∀x ∈ A) and this integrability condition, we get that Therefore, the bias term in bound (17) can be related to standard orders on regularity spaces associated to the bases (generally Besov spaces). In addition, using the lower-bound part of (6), we can prove This is why under (6) and for bases such that L(m) ≤ c 2 ϕ m, the stability condition (15) reduces to m ≤ c (π min )n/ log(n), which is weak and standard. No random cutoff for the estimator is needed, and the problem gets simpler.
Note that we then recover standard rates on Besov spaces (i.e. rates of order n −2α/(2α+1) for b belonging to Besov spaces associated with regularity α).

Laguerre and Hermite bases
As already said, Laguerre and Hermite bases have non compact support (A = R + in Laguerre and A = R for Hermite). For these bases, we have the first favorable property: In fact, we believe that the order in m of Ψ −1 m op can be much more explosive, but we can only prove the following upper bound.

Heteroscedastic regression model
The next question is: is the risk bound valid for observations from the diffusion model? Not exactly, since the discrete diffusion observations involve two additional problems: • heteroscedasticity in the noise • dependence between the observations. Let us now deal with heteroscedasticity and consider observations (X i , Y i ) 1≤i≤n , from the model (19) Then if σ is bounded on A, by say σ A ∞ , it is easy to prove that Therefore, extensions of the previous results hold with σ 2 ε replaced by σ A 2 ∞ . Now, we study the case where we only assume E[σ 4 (X 1 )] < +∞. To that aim, we set Then we can prove: Then, for all m satisfying (15), = σ 2 m and we recover the homoscedastic case. But the order of the general variance term is not so obvious. To study this new quantity, the following properties are useful.

] < +∞ and the basis satisfies
Moreover, if we intend to provide a bound for the integrated risk, we still have to consider a truncated estimator: the definition ofb m is the same as previously and given by (16), under the same stability condition.

Proposition 3.3.
Assume that Ψ m is invertible a.s. and that E(σ 4 (X 1 )) < +∞, E(b 4 (X 1 )) < +∞. Let m satisfy condition (15) andb m be given by (16), then where c is a constant depending on E(ε 4 1 ) and b 4 A (x)f (x)dx. Note that the variance term must be replaced by an estimator in order to propose a model selection criterion (choice of m from the data), a new difficulty related to the inverse problem aspect arises.

Diffusion model
Now, let us get back to our original problem. The last step is about managing with dependent variables.

The price of dependence
We consider the set of assumptions Assumption (A1), ensures that Equation (1) has a unique strong solution adapted to the filtration (F t = σ(η, W s , s ≤ t), t ≥ 0). The functions b, σ have linear growth: The additional assumption (A2), implies that Equation (1) admits a unique invariant probability π(x)dx.
The previous results have to be extended by replacing X i by X i∆ and Y i by Y i∆ = (X (i+1)∆ − X i∆ )/∆. The set Ω m (δ) is defined as previously to compare the empirical norm to its expectation. We can prove the result: We can see that, under geometrical mixing, the cost of dependency is a log(n∆) in the exponential and a negligible additive term. The extension from independent to mixing is based on coupling method and Berbee's Lemma as presented in Viennet (1997).

Diffusion risk bound
Among the difficulties, we can notice that we have to consider the truncated version ofb m for both empirical and integrated risks. Moreover, the cutoff involves additional log terms and change in constants: where C 0 is a numerical constant, C 0 ≥ 72. For geometric β-mixing, the price to pay for dependency is an additional log term in the bound, and the fact that the constant c is now unknown. In practice, we will take n∆ large enough, a constant equal to 1 and a cutoff equal to n∆/ log 2+ (n∆) for some > 0.
with c given in (23), we have where c 1 , c 2 , c 3 are positive constants.
Note that condition (24) is similar to condition (15) with n replaced by n∆ and a log term which becomes log 2 due to dependency. Still compared to the independent case, there is also a loss in the multiplicative constants of the two bounds.
Let us comment on the diffusion bounds. There are three main differences in comparison with our 2007-result (Comte et al (2007)): • the estimator is truncated with a random cutoff, • the variance order is new and more general, • the stability condition (24) is new and expressed with respect to the basis at hand. This is what allows to estimate the constraint to define the truncation in (23). Of course, we can distinguish special cases • As in the independent case, if A compact and π(x) ≥ π min for all x ∈ A, we have Ψ −1 m op ≤ 1/π min and the restriction (24) on m reduces to the simpler condition m ≤ cn∆/ log 2 (n∆), as in Comte et al. (2007).

Model selection and adaptive result
For sake of brevity, we present the model selection procedure in the diffusion context directly.

Procedure of selection
Now, we aim at selecting an adequate value for the dimension of the projection space, and the procedure should rely on avalaible data. For this, we consider the following collection of models and C 0 is a numerical constant.
The stability condition here has to be reinforced: this is due to the fact that the problem is very difficult to handle, from the theoretical point of view. Moreover, we have to obtain an additional control of Ψ m − Ψ m op , and this requires more constraints than the bound on P(Ω m (δ) c ).
Proposition 5.1. (i) Independent case. Let X 1 , . . . , X n be i.i.d. with common density π such that π ∞ < ∞. Then for all u > 0 (ii) Diffusion (dependent) case. Assume (A1)-(A4) (thus (X i∆ ) i is strictly stationary and geometrically β-mixing i.e. β(i) = β X (i∆) ≤ Ke −θi∆ for some constants K > 0, θ > 0). If in addition π is upper bounded (i.e. π ∞ < +∞), then, for all u > 0, For model selection we work with the collection of nested spaces S m and assume that the basis (ϕ 0 , . . . , Now, we can define the final estimator. We consider the empirical counterpart of M n∆ , namely M n∆ , a random collection of models defined by The proposed criterion follows from standard approximations, which make the procedure likely to perform an automatic bias-variance tradeoff: has the order of an upper bound on the variance term, which can be estimated. In the independent regression cases, the term b m 2 n is unchanged, n∆ is replaced by n and squares on log terms disappear. In the homoscedastic case, the penalty is equal to σ 2 ε m/n, that is simpler (no random matrix) and however sharper.

General risk bound
We can prove the following result.
Some practical considerations are in order. In the definition of M n∆ , the constant d is replaced by 1, and the log 2 (n∆) by log 2+ (n∆), which preserves the result for n∆ large enough. The term E[σ 2 (X 0 )] is replaced by a residual least squares estimator 2 for m n an arbitrary dimension. A theoretical study of this step is done in Baraud (2002). The constant κ is calibrated by preliminary simulation experiments, as usual for model selection methods.
Let us say what is new here. First, we provide a general result with no support constraint. Second, the result requires only moment conditions for σ. Lastly, the collection of models M n∆ is new and random, and contains implicitly the random truncation of the estimator.
Examples of simulation results are given in the papers [9], [8] and [10]. They show that the method works with Laguerre and Hermite bases, and that in some cases Ψ −1 m op can increase very fast, which considerably reduces the number of models satisfying the stability condition. It seems however that this is also associated in such cases with very good estimators even for small dimensions.

Concluding Remarks
We have presented a generalization of nonparametric least squares procedure for regression function estimation, which is from the theoretical point of view compatible with non compactly supported bases and non bounded volatilities, and remains from practical point of view, fast and simple. We have introduced a new random cutoff in the definition of the estimator and a new stability condition. We propose an associated model selection procedure which relies importantly on methods used to handle inverse problems.
There remains clearly open questions. Optimality results are available for homoscedastic regression, but remains an open problem in heteroscedastic regression, under our general assumptions. Gaïffas (2005Gaïffas ( , 2007 probably initiated a possible way to explore the question. Another question is related to penalization. In the heteroscedastic context, the variance term is proportional to Tr Ψ −1/2 m Ψ m,σ 2 Ψ −1/2 m divided by n in the i.i.d. context, and n∆ in the diffusion setting. This term should provide the correct value of the penalty in model selection step. For diffusion models, we used instead the bound E[σ 2 (X 0 )]c 2 ϕ m Ψ −1 m op /(n∆) but it is probably too large in practice. Numerical tests show that separating the matrices Ψ −1 m and Ψ m,σ 2 may not be a good idea because a kind of compensation seems to happen in their product.
In the i.i.d. heteroscedastic case, further investigations lead us to finally obtain risk bounds for a penalty proportional to m Ψ −1 m Ψ m,σ 2 op /n with [ Ψ m,σ 2 ] j,k = n −1 n i=1 ϕ j (X i )ϕ k (X i )σ 2 (X i ) which is numerically better but still requires the knowledge of σ. In practice, it is computed by replacing σ 2 (X i ) by (Y i −b m (X i )) 2 with m defined as M n − 2 where M n is the maximal element of M n , the random collection of models considered in the regression setting. This is the beginning of further theoretical questions on the topic.