SOME RESULTS ON STATISTICS AND FUNCTIONAL DATA ANALYSIS

. This paper presents some recent results on regression and classification in Functional Data Analysis setting. The first work decated to Multiple kernel SVM for classifying functional data in Sobolev spaces focus on optimization of the information from each derivatives of the original functions, the second work devoted to An input/output-based aggregation rule for functional data classification when the last is interested with the problem of regression if the functional data lives in a finite dimensionnal submanifold. R´esum´e. Cet article pr´esente quelques r´esultats r´ecents sur la r´egression et la classification dans le cadre de l’analyse des donn´ees fonctionnelles. Le premier travail consacr´e au Multiple kernel SVM pour la classification des donn´ees fonctionnelles dans les espaces de Sobolev se concentre sur l’optimisation des informations de chacune des d´eriv´ees des fonctions d’origine. Le second travail propose une r`egle d’agr´egation bas´ee sur les entr´ees/sorties pour la classification des donn´ees fonctionnelles; enfin la derni`ere contribution s’int´eresse au probl`eme de r´egression lorsque les donn´ees fonctionnelles vivent dans une sous-vari´et´e dimensionnelle finie.

derivatives {D j x i }.In order to combine these different views of the same objects, we consider the extension of the Multiple Kernel SVM framework to FD.In what follows, we recall in section 1.2 the basic procedure to reconstruct the functional form of an object given its discrete observations, and using a fixed set of B-spline functions.Then, in section 1.3, we recall the SVM technique and adapt its more general framework, the multiple kernel SVM, to FD, where each set of derivatives results in a kernel matrix.Finally, in section 1.4, we illustrate the interest of our approach with a real-world dataset.

Spline smoothing
For a given function x i , we assume that we only have p measurements {y ij } j=1,...,p at discrete time points {t j } j=1,...,p in [0, T ].These measures might have been corrupted by noise ϵ ij : where ϵ ij are considered to be independent across i and j.
From these measurements, we estimate the functional form of the data by Spline smoothing.We suppose a fixed basis of q B-splines {ϕ k } k=1,...,q of order 6 with q = 6 + p.In other words, we assume that: where c i = c i1 . . .c iq ⊤ and ϕ(t) = ϕ 1 (t) . . .ϕ q (t) ⊤ ∈ R q .
We penalize the L 2 norm of the 4th derivative function since we would like to control the curvature of the 2nd derivative function.

A refresher on SVM
The SVM method was introduced in [Cortes and Vapnik, 1995] for vectorial data and was extended to FD in [Rossi and Villa, 2006].When seeking for a hyperplane that linearly separate functions that belong to two different classes, the SVM method aims to maximize the margin δ which is the distance between the frontier and the closest objects.From a formal point of view, the SVM approach solves the following convex optimization problem, assuming the two classes are linearly separable: where a ∈ L 2 defines the orientation of the hyperplane.
where {ξ i } i=1,...,n are slack variables allowing objects to be on the wrong side of the hyperplane and C is a non-negative hyper-parameter balancing the objective function between the margin and the soft error n i=1 ξ i .The previous problem being convex it is equivalent to solve its associated dual problem: The hyperplane is then recovered by a linear combination of the support vectors2 : Moreover, if C is not too big, the solution α is sparse and only a few support vectors are required to define the hyperplane.
It is important to notice that the objective function of the dual problem is defined by the inner products {⟨x i , x i ′ ⟩ L2 } i,i ′ =1,...,n .This is also the case for the hyperplane: Then, one can replace the latter inner products with any kernel function k and, by the kernel trick, apply the margin maximization principle in a RKHS: where k : Using a kernel function and representing the data in another space can ease the task of linearly separating the two classes.

Multiple kernel SVM for FD
We propose an extension of [Rossi and Villa, 2006] by considering the derivative functions determined from FD.In effect, the derivatives {D j x i } can convey discriminant information that is complementary or more relevant as compared to the original functions {x i }.In spectrometry for example, the classification results with {Dx i } are often better than the ones resulting from {x i }.In fact, we consider that each set of derivative functions of successive orders {Dx i }, {D 2 x i }, . . .present different views of the initial objects {x i }.Our goal is to propose a model that combine these distinct views in a complementary manner, in the goal of improving the classification performances.
We use the multiple kernel SVM framework to this end.This amounts to work with a composite kernel function.Without loss of generality, we restrict ourselves to up to the 2nd derivatives3 : where: • ∥w∥ ℓ2 = ( s w 2 s ) 1/2 ≤ 1.Since we have a finite sample of n FD, in practice, we work with kernel matrices, K s = (K s ii ′ ) i,i ′ =1,...,n and s = 0, 1, 2: It is worth mentioning that the different kernel matrices can have different ranges, thus, in practice, we pre-process each K s using the cosine normalization4 .
The model we study is the following one: The estimation procedure is based on an alternating optimization scheme: • Fix w and solve with respect to α with a regular SVM solver.
• Fix α and solve with respect to w which is a convex problem with a closed-form solution that is stated below.
Proposition. 1.Let α be fixed then the following optimization problem: is convex and the optimal solution is given by, ∀s = 0, 1, 2: A proof of a more general case of Proposition 1 can be found in [Kloft et al., 2010].
We wrap up the whole procedure in Algorithm 1. Algorithm ..,n }, ∀s = 0, 1, 2; Normalize the kernel matrices K s , ∀s = 0, 1, 2; Initialize a uniform weight vector w; while Stopping condition not reached do Fix w and apply the SVM algorithm with composite kernel K = 2 s=0 w s K s to determine a new α; Fix α and apply (13) to determine a new w; end

Illustration with a real-world dataset
In order to illustrate its interest, we apply our method to the Poblenou dataset.It consists of (discretized) curves measuring every hour of a day the nitrogen oxide level causing air pollution.These data were recorded by a control station in Poblenou in Barcelona, Spain.The classification task is to predict if a given (discretized) curve corresponds to a working day or not.
In Figure 1, we represent the smoothed functions and their respective 1st and 2nd derivative functions.The two distinct colors mark the two different classes.
We tested with different kernel functions: linear and Gaussian 5 .We did not mix the types of kernel across the different derivatives but apply them in a homogeneous fashion.
Furthermore, we experimented with the following different single or composite kernel matrices, in order to compare the effects of mixing the views and different fusion strategies: • Multiple views with uniform weights: -K 0 + K 1 denoted 01-a, -K 0 + K 1 + K 2 denoted 02-a.• Multiple views with cosine normalization and optimized weights (Algorithm 1): - In Figure 2, we reported for each setting above, the accuracy rate estimated using a 10 fold cross-validation.
5 The hyper-parameter was set according to the median of the distribution of squared Hilbert distances of the functions with their 7th nearest neighbor.This approach is inspired from [Zelnik-Manor and Perona, 2005].When using the linear kernel and single views (blue bars in the graphic on the left hand side of Figure 2), it is the 1st derivative function that performs better.We observe that using multiple views can outperform this baseline.Overall the best result is obtained when mixing the three distinct views with the optimization procedure we proposed in Algorithm 1.
The graphic on the right hand side in Figure 2 demonstrates the impact of using kernel methods since, globally, the Gaussian kernel provide higher accuracy rates than the linear kernel.In this case as well, the best score is reached by mixing all three views and applying the optimization procedure we introduced in Algorithm 1.

An input/output-based aggregation rule for functional data classification
Pamela Llop A joint work with Aurélie Fischer 1 & Mathilde Mougeot6

Introduction
Due to their inumerable applications in different areas of interest, both theoretical and applied, supervised classification techniques are one of the most studied and used tools in the statistical field.As it is well known, to classify a given object X is to assign it one class (or label) to which it belongs.The list of classification rules existing in the literature in vast and, particularly, nowadays are in vogue techniques that aggregate different rules as bagging, boosting, and stacking.
The main goal of aggregation is to gather and aggregate different techniques in a efficient manner so that the resulting tool is enhanced.In classification or regression, the new rule is obtained by aggregating a collection of basic estimators previously calibrated using a training data set, taking in this way advantage of the abilities of each initial rule.
In this direction, in the finite dimensional setup, the original consensus idea in classification was introduced by [Mojirsheibani, 1999].For the regression problem, [Biau et al., 2016] built a new nonlinear aggregation strategy called COBRA, which is flexible since it discards a small percentage of those preliminary experts that behaves differently from the rest.Using the same ideas, also in [Cholaquidis et al., 2016] the authors worked in a nonlinear aggregation rule for functional data.Very recently, in [Fischer and Mougeot, 2019] the authors propose an alternative procedure for the finite dimensional setting, which adds to the classifier information about the imputs, in oder to improve the results by discarding data that are not similar to the point to be classified.
In this work we combine and extend the ideas from [Cholaquidis et al., 2016] and [Fischer and Mougeot, 2019] by creating an aggregation rule for functional data which takes into account imput and output information.The classifier is built from nonparametric regression by plugging in the estimate of the regression function into the Bayes rule.As previosly, the classifier is built from a collection of M arbitrary training classifiers, taking advantage of the abilities of each expert.Althought the method allows to combine classifiers of very different nature, it also perfoms very well when aggregating experts of the same nature, for instance, when combining different nearest neighbors classifiers.We prove the consistency of the aggregation rule and we show its good performance via simulations studies and also via real data examples.

Setting and notation
Let (F, d) denote a separable and complete metric space.We assume that we have at hand a training sample D n = {(X 1 , Y 1 ), . . ., (X n , Y n )}, of independent random elements identically distributed as the pair (X, Y ) ⊂ F × Ḡ := {1, . . ., G} which follows the model where η(x) = E(Y |X = x) is the regression function, the distribution of X is a Borel probability measure denoted by µ, the error e satisfies E(e|X) = 0 and E(e 2 |X) = σ 2 < ∞.
We split the sample D n into two subsamples For simplicity of notation, the elements of D ℓ are renamed (X 1 , Y 1 ), . . ., (X ℓ , Y ℓ ) .This abuse of notation should not cause any trouble since the context will be clear throughout.With D k , we build up M classifiers r mj : F → Ḡ, m = 1, . . ., M and with D ℓ , we construct our aggregate classifier.For ease of exposition, we assume in the sequel that the initial rules built on D k are actually given and fixed.Hence, in our classification problem, the asymptotic is in ℓ.Let place the initial rules in the vector r k (X) := r 1k (X), . . ., r M k (X) .In the sequel, 0/0 will be assumed to be 0.
Following [Fischer and Mougeot, 2019], for X ∈ F, we define the combined regression estimator as with the weights ω α,β i (X) given by Here, d F stand for a distance in the metric space F and d for a distance in R M (in general, the Euclidean distance).Moreover, the function K is a 2-dimensional regular Kernel.This is, Remark 2.1.Note that the performance of the estimator depends on the distances used when computing the weights as well as the kernels used.In this paper, in all the numerical experiments, we will use a Gaussian Kernel and the L 2 distance.
At it is well known, the Bayes classifier is defined as the one that maximizes the posterior probabilities This is, the Bayes classifier T * is defined as and the Bayes error as L * = P T * (X) ̸ = Y .In this context, the posterior probabilities are estimated using (15) and finally, the empirical classifier T n : F → Ḡ is defined as with error given by We will compare our method with the consensus-based strategy described by [Cholaquidis et al., 2016] (see also [Mojirsheibani, 1999]).
with the weights ω γ i (X) given by Remark 2.2.Observe that the flexibility is given by the parameter γ in this estimator.In the new definition (15) that flexibility is given by the kernel bandwidths.
Remark 2.3.Observe that for γ = 0 all the M initial rules have to match and in this case, the weights are given by .

A Stone-like theorem
We first state a Stone's conditions type result, which will be applied in our context.
is a sequence of weights satisfying the following conditions: (i) There is a sequence of nonnegative random variables a n (X) such that a n (X) → 0 a.s.satisfying (iii) for all ε > 0 there exists 0 < δ < ε such that for any η * bounded and continuous function fulfilling More specifically, we will use the next corollary.
Corollary 2.5.Let us suppose that the assumptions H1-H3 hold and η ∈ L 2 (F, µ).Let U i be a sequence of probability weights satisfying conditions (i), (ii) and (iii) of Theorem 2.4.If W n (X) is a sequence of weights such that n i=1 W ni (X) = 1 and, for each n ≥ 1, |W n | ≤ M U n for some constant M ≥ 1, then the estimator is L 1 -consistent.

Main result
Then, we use Theorem 2.4 to prove the consistency of the kernel regression estimate (15) with weights given in ( 16).To do it, we need to introduce the Besicovitch Condition Theorem 2.6.Let us suppose that the assumptions H1-H3 hold with F a separable metric space, and let η be a function satisfying the Besicovitch condition ( 19).If K is a regular kernel then there is a sequence α n (X) → 0 a.s.for which the kernel estimator is L 1 -consistent.

Numerical results
In this Section we show the performance of the new technique via simulations studies and real data applications in functional data settings.In all the cases, we split the whole sample into two: the training and testing samples.Following Section 2.2, with the training sample D n we construct the combined estimator by splitting it into the subsamples: D k used to compute the M initial estimators and D ℓ used to compute optimal parameters γ and β to build up the combined classifier.This is made via leave-one-out cross-validation by minimizing the loss function (see [Ferraty and Vieu, 2006a]) Finally, with the testing sample (of size p) we measure the discriminant power of our method: we evaluate the combined classifier in the testing sample in order to get the estimated classes and compare them with the true ones.

Simulation studies
To show the performance of the new technique with functional data, we follow the two simulated examples presented in [Cholaquidis et al., 2016] (see also [Delaigle and Hall, 2012]).We will assume that the data have the same distribution as the pair (X, Y ) ⊂ F × {0, 1}.Here M = 5 and the initial rules are k-nearest neighbors estimators, with k ∈ {1, 3, 5, 7, 9} and, in each case, we generate samples of size k = 60, ℓ = 30 and p = 100.
For both models, we report in Table 1  In conclusion of these simulated experiments, we can observe the very good performance of the proposed method.

Real data aplications
To show how our method behaves in real situations for functional data, we apply our strategy on the following datasets: • Cancer data: This data contains the mass spectra from blood (spectrogram curves) samples of 216 women of which 121 suffer from an ovarian cancer condition and the remaining 95 are healthy women, G = 2.The idea behind spectrograms is that, when cancer starts to grow, its cells produce a different kind of proteins than those produced by healthy cells.Hence, it is necessary to control the amount of them.Following [Cholaquidis et al., 2016], for both groups we take the spectrogram curves corresponding to the mass charge between [7000, 9500] and we take k = 60, ℓ = 30 and p = 100.Next, in order to have all the spectra defined in a common equi-spaced grid, we apply a Nadaraya-Watson kernel smoother with Gaussian kernel and automatic smoothing parameters selection.• Phoneme data: This dataset, which is available at http:www-stat.stanford.edu/ElemStatLearnand was used in [Ferraty and Vieu, 2006a], consists in 400 log-periodograms corresponding to G = 5 different phonemes and measured in 150 instants of time.Therefore, the data consist of 2000 log-periodograms of length 150, with known class phoneme memberships.In this case we take, for any sample, 50 observations from each of the 5 groups which means that k = ℓ = p = 250.
For these functional datasets, we report in 3. On kernel regression estimation for functional data with values in a finite dimensional manifold.

Introduction
In many data analysis, some observations can be summarized as curves and then are observations of a random function.This specificity has led to a challenging research area called statistics for Functional Data Analysis (FDA) with a large and dynamic literature since 2000's years for which the most cited references in this setting are the monographs of [Ramsay and Silverman, 2005], [Bosq, 2000], [Ferraty and Vieu, 2006b] and references therein.In this work, we are interested with the kernel regression estimation for FDA with values in a manifold subspace.
Actually, there exist some theoretical works for data with manifold structure in a finite dimensional space.We refer for example [Aswani et al., 2011], [Cornea et al., 2017], [Pelletier, 2006], [Loubes and Pelletier, 2008] and references therein.But, only a few were dedicated to regression for functional manifold data.The main idea is to raise up the fact that if the functional data lives in a submanifold embedded in some ambient Hilbert space of H, for example, L 2 (D) (for some domain D ⊂ R), the estimation will be more efficient if one focus on this submanifold than using any classical approach for FDA in H.This is the motivation of our work.In this setting, [Chen and Müller, 2012] consider the representation of functional data on a manifold with only one chart such is restrictive and then essentially reduced to a linear manifold, [Zhou and Pan, 2014] deal with functional principal component analysis on an irregular planar domain which can be also viewed as a linear manifold.This aspect is overcome by the work of [Lin and Yao, 2021] which consider that the functional data lives in an unknown manifold which is estimated by a tangent space estimation approach.In this work we are interested in the problem of kernel regression estimation where the response is real and the input is a functional variable which lives in a finite dimensional submanifold of H.We present here an asymptotic result in this setting as well a real data application.

Theoretical framework
Let Z i = (X i , Y i ), i = 1, ..., n be a sequence of random variables defined on a probability space (Ω, A, P) such that the Z ′ i s are independent identically distributed (i.i.d.) as a variable Z = (X, Y ).Y is with values in {0, 1} and X valued in a Riemannian submanifold, (M, g) of a d−dimensional subspace, H d , of H, a Hilbert space endowed with some inner product (semi-)metric.We assume that the link between Y and X such that for all x ∈ M where ε is the random error and r(.) the regression function defined by We assume that (M, g) is endowed with a measure, µ g and is geodesically complete.Then, by the Hopf Rinow Theorem, (M, d g ) is a complete metric space, where d g is the metric induce by g.We assume that X belongs to M, a sub-manifold of dimension d − 1. Supp {X} ⊆ M is a compact without boundary.

Geometry background, notations and assumptions
For any x ∈ M, we denote by T x M, the tangent space to M at x.In the following, we denote by 0 x and λ x (or λ to ease the reading) respectively the null vector and the Lebesgue in measure in T x M. The inner product in T x M, will be defined by < u, v >= g(u, v) for any u, v ∈ T x M and the associated norm by ||.|| when B (x, h) and B (0 x , h) will denote the balls of radius h centered respectively at x and 0 x .
We recall that the injectivity radius of M by injM and inj(S d−1 ) = π (S d−1 , the unit sphere of dimension d − 1).We assume that inj(M) > 0.
We recall that the completeness property of M allows to define the exponential map at any x ∈ M: exp x : T x M → M which is such that for any v ∈ T x M, exp x (v) = γ v (1) where the function γ v defined by γ v (t) = exp x (tv), t ∈ R is the unique geodesic with γ v (0) = x and γv (0) = v (where γv denote the derivative function of γ v ), γv = 0. We deal with regular balls in M (we recall that a r egular (or convex) ball, B is a ball in M such that the for any p, q ∈ B the segment geodesic (i.e.{exp x (tv) : t p ≤ t ≤ t q , ||v|| = 1}) from p = exp x (t p v) to q = exp x (t q v) lies in B).In other words, we will deal with balls, B (x, h) such that h < h * where h * = min{inj(M), π 2 √ κ } where κ is the supremum of sectional curvatures (see below) of M if this upper bound is positive, and κ = 0 otherwise.Then, B(x, h) = exp x B (0 x , h).We recall that κ = 1 for M = S d−1 .For all u ∈ T x M, let g x (u) = (g ij (u)), such that g ij (u) = g (∂x i |u, ∂x j |u) is the gram matrix of the local tangent vectors (∂x i |u, i = 1, ..., d).The volume of the parallelepiped generated by (∂x i |u, i = 1, ..., d) is the is the density of µ exp * x g with respect to µ x on T x (M).(See for example [Chavel, 1993], p.18. or [Gallot et al., 2004] p. 165).

The kernel regression estimate
We aim to estimate the regression function r(x) = E (Y |X = x) of Y given X.To do this, we propose the kernel estimate of the function r based on observations of the process (Z i , i = 1, ..., n) where h n → 0, the kernel K is on R d .

Assumptions
H1 We assume that Y is bounded.As usual, one can replace this assumption by : κ } where κ is the least upper bound of sectional curvatures of M if this upper bound is positive , and κ = 0 otherwise (see for example [Gallot et al., 2004] or [Kobayashi and Nomizu, 1996] ).
H3: X admits a density f , with respect to µ g the canonical measure on M, then for any x ∈ M, the density of log x (X) with respect to the Lebesgue measure in T x (M) is given by : f Tx (s)=f(exp(s))-g(s)| 1/2 for all s ∈ T x (M).
H4: r and f satisfies a Lipschitz condition.HK 1 :: We assume that the kernel K : R → R + is of integral 1 and is such that: K 2 < ∞ there exist two constants 0 < C 1 < C 2 < ∞: where Theorem 3.1.Under the previous assumptions, we have 3.4.Applications In practice, the first main problem concerns the knowledge of M. Here, we assume that we deal with the case are interest where X belongs to family of shapes (as spheres) of a d-dimensional subspace.More precisely, we consider as ambient space, the Hilbert space where u (k) denote the k−th derivative function of u.The ambient space endowed with some semi-inner product < ., .> with a the projection on a subspace (a classical approach in functional data) H d = span{e 1 , ...e d } where (e i , i = 1, ...) is an orthonormal basis.In fact, if u = More precisely, we will deal with the three semi-inner products (based on derivative order) such that c In our real data application setting, we focus on the case where the assumptions is that Y only depends on X(d) = X (d)  ||X (d) || and the other information on X are nuisance parameters.Then, we rewrite model (20) as and X(d) plays the role of X as state in theoretical part.The corresponding kernel estimator, r n , is then computed with K dg(x,y) hn such that d g (x, y) = cos −1 <x,y> ∥x∥∥y∥ and h n < π 2 .For a fix d, each function u, the representations u (k,d) , for k = 0, 1, 2 are respectively referred as Initial, Velocity and Acceleration, L2 representation (corresponding to the classical estimator) when their spherical versions in the unit sphere of H d denoted by u (k,d)  ||u (k,d) || for k = 0, 1, 2 are referred as Geodesic representation.

A real data application
We have applied our approach to the well-known dataset problems Tecator where the aim is to predict the water, fat or protein content of a meat based its near infrared absorbance spectrum.The dataset concerns are 215 meats and is described on the website http://lib.stat.cmu.edu/datasets/tecator.

Results
We have run our kernel estimator based both for the X (d) i 's and the X′ i s.In order to evaluate the performance of both estimators, we have iterated each, 110 times by proceeding to a sampling strategy with 60%=129 observations for the training set and 30% for the test set (of course the same training and test sample was used at each step).
The results on the performance in Root Mean Square Error (RMSE) meaning of estimators are given in Figure 5 which represent the distributions of the 110 RMSE corresponding to the 110 test samples.It shows that the estimator based on the geodesic representation, X(d) i 's most often outperforms the classical estimator.

Figure 1 .
Figure 1.From top to bottom: {x i }, {Dx i } and {D 2 x i }.

Figure 2 .
Figure 2. From left to right: accuracy values for the linear kernel and the Gaussian kernel and with different kernel matrices (single and composite).

d
i=1 c i e i and v = d i=1 b i e i , then < u, v >= d i=1 c i b i .

Figure 3 .Figure 4 .
Figure 3.The characteristics Y i 's versus the X (d) i 's 1: Multiple kernel SVM for FD in H 2 .

Table 1 .
the mean of the misclassification error rates (and the standard deviation) over 100 replications.The best results are indicated in bold.Mean and standard deviation of the misclassification error rate over 100 replicates for functional Models I and II.

Table 2 .
Table 2 the mean of the misclassification error rates (and the standard deviation) over 100 replications.The best results are indicated in bold.Mean and standard deviation of the misclassification error rate over 100 replicates for functional datasets.