Statistical data analysis in the Wasserstein space

This paper is concerned by statistical inference problems from a data set whose elements may be modeled as random probability measures such as multiple histograms or point clouds. We propose to review recent contributions in statistics on the use of Wasserstein distances and tools from optimal transport to analyse such data. In particular, we highlight the benefits of using the notions of barycenter and geodesic PCA in the Wasserstein space for the purpose of learning the principal modes of geometric variation in a dataset. In this setting, we discuss existing works and we present some research perspectives related to the emerging field of statistical optimal transport.

).An subset of 4 histograms representing the distribution of children born with that name per year in France, and the whole dataset of n " 1060 histograms (right), displayed as pdf over the interval r1900, 2013s intensity informations when comparing elements in a dataset that can be modeled as probability distributions.This approach leads to an intuitive geometric interpretation of the variability in such datasets.
Examples of such data are numerous.As an illustration, one may consider histograms that represent, for a list of first names, the distribution of children born with that name per year in France between years 1900 and 2013.In Figure 1, we display the histograms of four different names, as well as the whole dataset.Another example arises from the statistical analysis of data acquired from flow cytometry which is a high-throughput technique in biotechnology that can measure a large number of surface and intracellular markers of single cell in a biological sample.With this technique, one can assess individual characteristics (in the form of multivariate data) at a cellular level to determine the type of cell, their functionality and the way they differ.The development of this technology now leads to datasets made of multiple measurements (e.g. up to 18) of millions of individuals cells from different subjects.This leads to statistical inference problems from multiple points clouds such as those displayed in Figure 2 that can be modeled as discrete probability measures supported on R d .
This paper, whose content is biased by the point of view of a statistician, is an introduction to some recent developments of optimal transport for statistical data analysis.It is focused on the notions of barycenter and geodesic principal component analysis (PCA) Figure 2: Example of flow cytometry data [42] obtained from a renal transplant retrospective study conducted by the Immune Tolerance Network (ITN) and measured from n " 15 patients (restricted to a bivariate projection).The horizontal axis -resp.vertical axis represents the values of the forward-scattered light (FSC) -resp.side-scattered light (SSC) cell marker, after an appropriate scaling trough an arcsinh transformation and an initial gating on total lymphocytes to remove artefacts.The number of considered cells by patient varies from 88 to 2185.
in the Wasserstein space and their connection to nonparametric statistics and functional data analysis.We also report the results of numerical experiments to highlight, in an intuitive manner, the potential benefits of such tools for statistical inference.A recent review on the statistical aspects of Wasserstein distances with a rich bibliography on the mathematics of optimal transport can also be found in [50].More generally, computational optimal transport for data science is a rapidly growing research field.For a detailed introduction and examples of application beyond the prism of statistics we refer to [25].
In Section 2, we present the notions of Wasserstein space and regularized optimal transport.Section 3 is focused on Wasserstein barycenters and statistical inference problems related to this concept.Finally, geodesic principal component analysis (GPCA) in the Wasserstein space is briefly presented in Section 4. Note that the presentation of the paper is rater informal on the mathematical aspects, and we mainly give intuition of the benefits of statistical optimal transport through the presentation of various numerical examples.Throughout the paper, we also discuss some research perspectives and open problems related to the statistical aspects of barycenters and GPCA in the Wasserstein space.
2 Wasserstein space and regularized optimal transport

The Wasserstein metric
We denote by P 2 pΩq the space of probability measures with finite second moment and supports included in a convex domain Ω Ă R d .This set can be endowed with the Wasserstein metric W 2 defined as , for µ, ν P P 2 pΩq, where Γpµ, νq is the set of probability measures (also called transport plans or coupling measures) on the product space Ω ˆΩ with respective marginals µ and ν, and | ¨| denotes the usual Euclidean norm on R d .The metric space pP 2 pΩq, W 2 q is called the (2-)Wasserstein space.
Remark 2.1.The minimisation problem (1) corresponds the so-called Kantorovich formulation of optimal transport.Other cost functions than the quadratic one cpx, yq " |x ´y| 2 may be used, but we restrict the discussion to this setting to simplify the presentation.For further details on the formulation of optimal transport for probability measures support on general metric space we refer to [63].
For probability measures supported on the real (that is when d " 1), the Wasserstein metric has a closed form expression  for any µ, ν P P 2 pRq, where F μ (resp.F ν pαq) denotes the quantile function of the measure µ (resp.ν).Hence, in the one-dimensional setting, the use of the Wasserstein metric amounts to perform data analysis in the space of quantile functions (supported on r0, 1s) endowed with the usual Hilbert L 2 metric.When µ and ν are absolutely continuous (a.c.) measures with square integrable probability density functions (pdf) f µ and f ν , this remark allows to easily illustrate the difference between choosing either the Wasserstein metric W 2 pµ, νq or the Hilbert metric }f µ ´fν } 2 :" ¯1{2 for statistical analysis.Indeed, this choice implies two different constructions of the modes of variation in a dataset as illustrated in Figure 3, where we display the geodesics between too Gaussian distributions in the Hilbert space of densities and in the Wasserstein space of probability measure.For the Wasserstein metric, the measures remain Gaussian along the geodesic.This illustrates the fact that performing statistical data analysis the Wasserstein allows to simultaneously take into account variations in location and intensity (or mass) in a data set of densities (or histograms).

Regularized optimal transport
A serious limitation for the use of the Wasserstein metric for data analysis is its computational cost between probability measures supported in higher-dimensional spaces (that is beyond the case d " 1).Indeed, the computational cost to evaluate a Wasserstein metric between two discrete probability distributions with supports of equal size N is generally of order OpN 3 log N q.In this discrete setting, computing a Wasserstein distance amounts to solve the linear program (1) whose solution is constraint to belong to the convex set Γpµ, νq.Hence, the cost of this convex minimization becomes prohibitive for moderate to large values of N .To reduce the complexity of a linear program, a classical approach in optimization [65] is to add a regularizing entropy term.This is the approach followed in [22,23] by adding entropy regularization to the transport transport plan in (1), which yields the strictly convex problem (3) below.
To simplify its presentation, we consider discrete probability measures supported on the same finite space Ω N " tx 1 , . . ., x N u P pR d q N of cardinal N .A discrete probability distribution µ (with fixed support included in Ω N ) is identified by a vector of weights in the simplex Σ N " tr " pr 1 , . . ., r N q P R N `with ř N k"1 r k " 1u such that µ " ř N k"1 r k δ x k where δ x is the Dirac distribution at x.The Wasserstein distance between two discrete distributions µ " ř N k"1 r k δ x k and ν " ř N k"1 q k δ x k (identified to vectors r and q in Σ N ) then becomes W 2 pr, qq :" min where the set of coupling matrices is defined as U pr, qq :" tU P R N ˆN `such that U 1 N " r, U T 1 N " qu with 1 N the N dimensional vector with all entries equal to 1 and C the cost matrix given by C ml " |x m ´xl | 2 , for all m, l P t1, . . ., N u.Then, the notion of entropy regularized optimal transportation [19,22] leads to the so-called Sinkhorn divergence W 2,ε pr, qq between two discrete measures r, q P Σ N which is defined as 2,ε pr, qq :" min U PU pr,qq xC, U y ´εhpU q. (3) where ε ą 0 is a regularization parameter, and the discrete (negative) entropy for a given coupling matrix U P U pr, qq is given by hpU q :" ´řm, U m log U m .The introduction of Sinkhorn divergence has been the starting point of the development of computational optimal transport [25] for machine learning as it makes feasible (from a computational point of view) the use of smoothed Wasserstein metrics for data analysis.Moreover, there now exists various toolbox and librairies to carry out data analysis using (possibly regularized) optimal transport in Matlabhttps://optimaltransport.github.io/,Pythonhttps://pot.readthedocs.io/en/stable/and the statistical computing environment Rhttps://cran.r-project.org/package=Barycenter.

Wasserstein barycenters
The most basic statistical inference task from a data set is certainly to compute first order statistics that is to estimate an average location of the data.In this paper, we focus on the estimation of a population mean measure (or density function) from a dataset whose elements are probability measures.The notion of averaging depends on the metric that is chosen to compare elements in a given data set.In this paper, we consider the Wasserstein distance W 2 associated to the quadratic cost for the comparison of probability measures.As introduced in [2], an empirical Wasserstein barycenter μn of set of n probability measures ν 1 , . . ., ν n (not necessarily random) in P 2 pΩq is defined as a (possibly not unique) minimizer of The Wasserstein barycenter corresponds to the notion of empirical Fréchet mean [35] that is an extension of the usual Euclidean barycenter to nonlinear metric spaces.For the case d " 1, to the quantile formula (2) implies that μn is the measure with quantile function For n " 2, the Wasserstein barycenter μ2 is the probability measure located at half distance (that is at "time t " 1{2") along the geodesic between two a.c.measures ν 1 and ν 2 .As illustrative examples, we display Wasserstein barycenters μ2 between two probability measures in dimension d " 1 in Figure 3    Computing a Wasserstein barycenter between n ě 3 probability measures is generally a delicate optimisation problem in dimension d ě 2 that we do not discuss here.We refer to [7,24,25] for further details and references.In what follows, we focus on the mathematical statistical aspects of Wasserstein barycenters of random probability measures and the problem of their estimation from a nonparametric point of view.

Statistical models of random probability measures
Let us first introduce the notion of a population Wasserstein barycenter, as well as a two-sampling model of random probability measures that is appropriate to consider the problem of its nonparametric estimation.To this end, let P be a distribution on the space of probability measures pP 2 pΩq, B pP 2 pΩqq, where B pP 2 pΩqq is the Borel σ-algebra generated by the topology induced by W 2 .Throughout the paper, it is always assumed that P satisfies the square-integrability condition 2 pµ, νqdPpνq ă `8, for some µ P P 2 pΩq.
Definition 3.1.A population Wasserstein barycenter µ P of a distribution P over P 2 pΩq is defined as a (possibly not unique) minimizer of W 2 2 pµ, νqdPpνq, over µ P P 2 pΩq. ( For a discrete distribution P n " 1 n ř δ ν i on P 2 pΩq, one has that µ Pn corresponds to the empirical Wasserstein barycenter defined as a minimizer of (4).The existence and unicity of Wasserstein barycenters has been discussed in various works [2,3,15,43,47].As argued in [3], a sufficient condition for the unicity of µ P is that the distribution P gives a strictly positive mass to the subset P ac 2 pΩq of a.c.measures in P 2 pΩq.Moreover, if P is supported on the set of measures P 2 pΩq X L q pΩq for some q P p1, `8q (i.e.measures in P ac 2 pΩq with L q pΩq densities), it follows that µ P admits a density in L q pΩq.
To define an appropriate sampling model, let us now assume that ν 1 , . . ., ν n are independent and identically distributed a.c.random measures sampled from a distribution P satisfying P pP ac 2 pΩq X L 2 pΩqq " 1.Hence, under such assumptions, the population Wasserstein barycenter µ P is unique and absolutely continuous.As raw data in the form of densities are generally not directly available, we focus on the setting where one has access to a set of random vectors pX i,j q 1ďjďp i ; 1ďiďn in R d organized in the form of n experiment units, such that, conditionaly on ν i , one has that X i,1 , . . ., X i,p i are iid observations sampled from the measure ν i for each 1 ď i ď n.The observed data are thus in the form of multiple point clouds X " pX i,j q 1ďiďn; 1ďjďp i .Elements in such a set can be modeled as n discrete probability measures from which one may construct another empirical Wasserstein barycenter μn,p defined as a (not necessarily unique) minimizer of with the convention that p " min 1ďiďn p i .We shall now discuss the issue of estimating the population Wasserstein barycenter µ P using either µ Pn (with P n " 1 n ř δ ν i ) or μn,p as n Ñ `8 and p Ñ `8.

Law of large numbers
By the law of large numbers, we refer to the question of showing that W 2 pµ Pn , µ P q almost surely converges to zero as n Ñ `8.This problem has been addressed in [15] for a parametric model of random distributions P over P 2 pΩq with Ω assumed to be compact.General results for arbitrary distributions can be found in [47].Beyond the law of large numbers, it is natural to ask for a (functional) central limit theorem for the empirical Wasserstein barycenter.Nevertheless, this notion is more delicate to formulate as the Wasserstein space is not a Hilbert space.For recent contributions in this direction (but in specific parametric settings) we refer to [3,45].

Rate of convergence in the one dimensional case
Let us now restrict to the one dimensional case of probability measure in P 2 pRq and assume that p 1 " . . ." p n " p. Thanks to the quantile formula (2) for the expression of the Wasserstein metric when d " 1, it follows that the empirical Wasserstein barycenter μn,p minimizing ( 7) is the unique discrete measure μn,p " where X i,1 ď X i,2 ď . . .ď X i,p are the order statistics the i-th sample, and Xj " ř n i"1 X i,j for all 1 ď j ď p.Moreover, as shown in [13], the population Wasserstein barycenter µ P is the unique a.c.measure with quantile function F 0 pαq " E `F ´pαq ˘where F ´is the quantile function of the random measure ν with distribution P.
In this setting, the following upper bound on the expected quadratic risk E `W 2 2 `μ n,p , µ P ˘has been obtained in [13].
Theorem 3.1.Let Y 1 , . . ., Y p be iid variables sampled from µ P , and µ p " Hence, the rate of convergence of μn,p is decomposed into the sum of two terms.The first one goes to zero at the parametric rate Op 1 n q, while the second one depends on the rate of convergence of the empirical measure µ p towards it population counterpart in expected squared Wasserstein distance.For d " 1, the question of deriving upper and lower bound for E `W 2 2 pµ p , µ P q ˘has been thoroughly studied in [16].In arbitrary dimension d, there is a rich literature in probability on studying the convergence rate of an empirical measure towards it population version in Wasserstein distance.We refer to [9,16,27,34,64] for extensive references and the most recent contributions on this topic.For example, using results from [16], one has that E `W 2 2 pµ p , µ P q ˘ď 2 p `1 J 2 pµ P q, where J 2 pµ P q " where F 0 is the cumulative distribution function of µ P and f 0 denotes its density.Hence, if J 2 pµ P q ă `8, it follows that μn,p converges to zero at the parametric rate Op 1 n `1 p q.A detailed discussion on the derivation of upper bounds on the rate of converge of μn,p is proposed in [13] with some results on minimax optimality of such bounds.

Open problems in dimension d ě 2
Beyond the one-dimensional case d " 1, it appears that deriving the rate of convergence for empirical Wasserstein barycenters, e.g. for the expected quadratic risk E `W 2 2 pμ Pn , µ P q ˘, is a much more involved problem.A few works [1,45] exist on the convergence rate of µ Pn to µ P .However, the more general question of deriving the convergence rate of a Wasserstein barycenter μn,p (7) obtained from samples of random measure is open for d ě 2 (to the best of our knowledge).

Regularization of Wasserstein barycenters
An empirical Wasserstein barycenter μn,p of the discrete measures ν p 1 , . . ., ν pn , defined by (6), is generally irregular (and even not unique).Irregularity has to be understood in the sense that μn,p is, in most situations, a discrete probability measure as shown by its expression (8) when d " 1.Therefore, as a discrete measure, it poorly represents the smoothness of the population Wasserstein barycenter of a random a.c.measure ν with distribution P, as µ P is absolutely continuous in this setting as discussed previously.Hence, we now explain how regularizing the Wasserstein barycenter μn,p in order to construct a.c. and consistent estimators of an a.c.population barycenter in the asymptotic setting where both n and p " min 1ďiďn p i tend to infinity.To this end, one may consider two ways of regularizing an empirical barycenter.

Penalized Wasserstein barycenters
A first possibility is to follow the approach proposed in [12].Thus, we now introduce the penalized Wasserstein barycenter associated to the distribution P that is defined as a solution of the minimization problem where γ ą 0 is a penalization parameter and the penalty function writes Epµq " otherwise.(10) where } ¨}H k pΩq denotes the Sobolev norm associated to the L 2 pΩq space, α ą 0 is arbitrarily small and k ą d ´1.The choice (10) for the penalty function E is motivated by the discussion in Section 5 of [12].Its main advantages are to impose that the penalized Wasserstein barycenter is absolutely continuous with a smooth density.Other examples of penalty functions are discussed in [12] including the class of relative G-functional described in Section 9.4 in [5].
Existence and uniqueness of these barycenters are discussed in [12] for a large class of penalty functions enforcing them to be a.c.One can thus let f γ n,p (resp.f γ P ) be the densities associated to μγ n,p (resp.µ γ P ).The choice (10) for the penalty function E imposes that the densities of the penalized Wasserstein barycenters are bounded from below by a small constant α ą 0 over their support that is equal to Ω.In this setting, the following result (see Section 5 in [12]) gives the convergence rate in expected squared L 2 pΩq-distance between these densities for a fixed value of γ ą 0. Theorem 3.2 (Section 5 in [12]).Assume that Ω Ă R d is compact.Let f γ n,p and f γ P denote the density functions of μγ n,p and µ γ P , induced by the choice (10) of the penalty function E.Then, provided that d ă 4, there exists a constant c ą 0 depending only on Ω such that where p " min 1ďiďn p i .
Letting f 0 P be the density of the population Wasserstein barycenter µ P , the convergence of the approximation error term }f γ P ´f 0 P } 2 L 2 pΩq is also discussed in [12], where it is shown that it converges to zero as γ Ñ 0. However, no rate of convergence (as a function of γ) has been derived in [12] for this approximation term, and deriving such a result remains an open problem for regularized Wasserstein barycenters.Therefore, only the consistency of f γ n,p has been derived in [12] in the sense that the quadratic risk E ´}f γ n,p ´f 0 P } 2 L 2 pΩq ¯Ñ 0 under the assumption that γ " γ n,p decays to zero as minpn, pq Ñ `8 at a rate which guarantees that the right hand size of inequality (13) converges to zero.Remark 3.1.Consistency of regularized empirical Wasserstein barycenter is discussed in [12] for the Bregman divergence associated to the penalization function E. The specific choice (10) for E finally yields to the use of the L 2 pΩq norm to compare the densities of f γ n,p , f γ P and f 0 P .The question of deriving consistency results for μγ n,p using the Wasserstein metric (that is in the space of probability measures and not densities) has not been considered in [12]

Sinkhorn barycenters
Another way to regularize an empirical Wasserstein barycenter is to use the Sinkhorn divergence (3) to compare probability measures.As this divergence is defined for discrete measures supported on the same points, one first needs to choose a fixed grid Ω N " tx 1 , . . ., x N u made of N points x k P R d (bin locations that are typically equally spaced).Then, one performs a binning of the data (6) on this grid, leading to a dataset of discrete measures (with supports included in Ω N ) that we denote δ Xi,j , where Xi,j " arg min for 1 ď i ď n.Binning (i.e.choosing the grid Ω N ) surely incorporates some sort of additional regularization, and a discussion on the influence of the grid size N on the smoothness of the barycenter can be found in [11].The size N of the grid is also guided by numerical issues on the computational cost of the algorithms used to approximate a Sinkhorn divergence.Then, another possibility to introduce regularization in the computation of Wasserstein barycenters [11,23,24] is to consider the following estimator rε n,p " arg min that can be interpreted as a Fréchet mean with respect to a Sinkhorn divergence.We call rε n,p a Sinkhorn barycenter.From the point of view of statistics, it can be interpreted as a M-estimator defined over the compact set Σ N .Two key properties of the Sinkhorn divergence have been used in [11] to study the convergence properties of such barycenters: (i) for any q P Σ N , the function r Þ Ñ W 2 2,ε pr, qq is ε-strongly convex for the Euclidean 2-norm (ii) let q P Σ N and 0 ă ρ ă 1{N an arbitrarily small constant.Then, one has that To guarantee the Lipschitz continuity of the mapping r Þ Ñ W 2 2,ε pr, qq, the analysis of the statistical properties of Sinkhorn barycenters in [11] is restricted to discrete measures belonging the convex set Σ ρ N (thus having non-vanishing entries).Therefore, one has to slightly modify the data at hand so that the observations belong to Σ ρ N .This yields to the following definitions of sampling model, and empirical/population Sinkhorn barycenters in Σ ρ N .
Definition 3.3.Let 0 ă ρ ă 1{N , and P be a probability distribution on Σ ρ N .Let q 1 , . . ., q n P Σ ρ N be an iid sample drawn from the distribution P. Assume that p Xi,j q 1ďjďp i are iid random variables sampled from q i , for each 1 ď i ď n and consider the discrete measures qp i i " 1 p i ř p i j"1 δ Xi,j and qp i i " p1 ´ρN qq p i i `ρ11 N , where 1 1 N is the vector of R N with all entries equal to one.Then, we define r ε " arg min rPΣ ρ N E q"P rW 2  2,ε pr, qqs the population Sinkhorn barycenter (16) Theorem 3.3 (Section 3 in [11]).Let ε ą 0 and p " min 1ďiďn p i .Then, one has that The upper bound (18) allows to shed some light on how the expected squared distance between the empirical Sinkhorn barycenter and its population version is influenced by the number n of observed measures, the minimal number p of samples per measure and the size N of the grid.However, this upper bound does not converge to zero as minpn, pq Ñ `8 if ρ remains fixed.The main advantage of this upper bound is to allow the derivation of a data-driven strategy to select ε that is presented in the following subsection below.Finally, one should remark that the behavior of the population Sinkhorn barycenter r ε as ε Ñ 0 has not been considered in [11].Remark 3.2.A third way to regularize an empirical Wasserstein barycenter (that might look more natural for statisticians) is to perform a preliminary smoothing step of the discrete measures ν p 1 , . . ., ν pn .For example, one may first smooth the data using standard kernel density estimation Wasserstein barycenter as μh n,p " arg min where dν h i i pxq " f h i i pxqdx, and h " max 1ďiď h i .The consistency of this approach (as h Ñ 0) has been studied in details by [49,50] using multiple point processes for the sampling procedure, and it is also discussed in [13].

Data-driven regularization in computational optimal transport
In this paper, we report results on numerical experiments from [11] that are based on the approach proposed in [24] to compute Sinkhorn barycenters on a grid Ω N of equi-spaced points.We do not report results on the computational aspects of penalized Wasserstein barycenters, and we refer to [11] for illustrations on their numerical performances.
As any nonparametric method, the use of regularized Wasserstein barycenters involves the delicate calibration of a regularization parameter.Choosing the value of ε for the Sinkhorn barycenter rε n,p might be guided by the classical bias and variance tradeoff principle as illustrated by the synthetic example displayed in Figure 5.For small values of ε the Sinkhorn barycenter shows many oscillations while it is much smoother for larger values of ε.In some sense, the regularization parameters ε can be interpreted as the usual bandwidth parameter in kernel density estimation and its choice greatly influences the shape of rε n,p .Let us now briefly discuss the question of choosing ε in an automatic way.In [11], the following (un-regularized) population Wasserstein is considered 2,0 pr, qqs with W 2 2,0 pr, qq :" min The term Ep|r ε ´r ε n,p | 2 q can then be interpreted as "variance term", while |r ε ´r0 | is referred to as a bias term.Ideally, one would like to select a value of ε which gives a good tradeoff between these two terms whose values change in opposite directions as ε varies.However, the decay of the bias term (as ε Ñ 0) is typically unknown and only an estimation of the variance term is available.In [11], it is thus proposed to use the upper bound (18) to automatically calibrate the parameter ε ą 0 by following Goldenshluger-Lepski (GL) principle [39] as formulated in [46] for standard kernel density estimation.In [11], an adaptation of the GL's principle is used to to estimate a bias-variance trade-off functional that is minimized over a finite set of regularization parameters to obtain an automatic selection of ε for Sinkhorn barycenters.The method consists in comparing estimators pairwise (for a given range of regularization parameters) with respect to a loss function, and we refer to [11] for further details.As an illustration of the usefulness of the GL's principle in statistical optimal transport, we report in Figure 6 numerical experiments from [11] on the computation of a Sinkhorn barycenter (with a data-driven choice of ε) in dimension d " 2 for the flow cytometry dataset displayed in Figure 2(b).This regularized barycenter clearly corrects some mis-alignment issues in these measurements.We also display in Figure 6(a) the Euclidean mean of this dataset (after a preliminary kernel smoothing step).The support of this Euclidean mean is more spread out due to the presence of a strong variance in spatial location in the data from one individual to another.
4 Geometry of the Wasserstein space and geodesic PCA Principal Component Analysis (PCA) is certainly the most widely used approach to reduce the dimension of datasets whose elements are high-dimensional vectors or matri- ces.When such elements can be modeled as probability measures, we explain how using the Wasserstein metric to compute their modes of variation accordingly around their barycenter in the Wasserstein space.We briefly describe the setting where the data can be modeled as histograms supported on the real line, and we refer to extensions in the case of probability measures supported on R d for d ě 2.

Geometry of the Wasserstein space
The space P 2 pΩq endowed with the 2-Wasserstein distance is not a Hilbert space.As a consequence, standard PCA in a linear space based on the diagonalisation of a covariance matrix cannot be applied directly to compute principal modes of variations with respect to the Wasserstein metric.Nevertheless, a meaningful notion of PCA [14,20,58] can still be defined by relying on the pseudo-Riemannian structure of the Wasserstein space which was extensively studied in [5].
This geometric structure allows to define a notion of geodesic PCA in the Wasserstein space which shares similarities with analogs of PCA for data belonging to a Riemannian manifolds.Principal Geodesic Analysis (PGA) [33,61] is a geometric method for statistical analysis that extends the notion of classical PCA in Hilbert spaces [26,53] to curved Riemannian manifolds by replacing Euclidean concepts of vector means, lines and orthogonality by the more general notions of Fréchet mean, geodesics in Riemannian manifolds and orthogonality in tangent spaces.For d " 1, the geometric structure of the Wasserstein space is relatively easy to describe, and we refer to [14,20] for a detailed presentation.The setting d ě 2 is more involved.Some elements on the geometry of the Wasserstein space can be found in Section 4.4 in [50], and an extensive study can be found in [5].

Geodesic PCA in the one-dimensional case
Following these principles of statistical analysis in manifolds, a framework for geodesic PCA (GPCA) of probability measures supported on a interval Ω Ă R has been introduced in [14].GPCA is defined as the problem of estimating a principal geodesic subspace (of a given dimension) which maximizes the variance of the projection of the data to that subspace whose base point is the Wasserstein barycenter of the observations.In [14], a detailed study is proposed on the existence and consistency of GPCA.It is also shown that GPCA in the Wasserstein space is equivalent to map the data in the tangent space at the empirical Wasserstein barycenter (assumed to be a.c.) and then to perform a PCA that is constrained to lie in a convex and closed subset of functions.Mapping the data to this tangent space is not difficult in the one-dimensional case as it amounts to computing a set of optimal maps between the data and their Wasserstein barycenter, for which a closed form is available using their quantile functions.In particular, optimal maps belong to the set of non-decreasing functions.This convex constrained PCA can also be understood from the quantile formula (2) for the Wasserstein metric when d " 1.Indeed, GPCA for probability measures supported on Ω Ă R corresponds to a PCA of quantile functions that is constrained to lie in the convex subset of non-decreasing function in L 2 pr0, 1sq.Nevertheless, such a convex PCA leads to the minimization of a non-convex and non-differentiable objective function.This is a delicate optimisation problem (even for d " 1), and numerical methods to compute such a PCA have been proposed in [20].
To shed some lights on the benefits of GPCA for data analysis, we report numerical experiments on synthetic data in Figure 7 and for the dataset on children's first name at birth in Figure 8.When the size or locations of significant bins in observed histograms significantly vary from one histogram to another, standard PCA with respect to the Euclidean metric (in the space of densities) fail in recovering geometric variations in location.To the contrary, GPCA in the Wasserstein space is able to capture meaningful sources of variations in a dataset by separating variability in location and scaling in the observed histograms along the first and second geodesic principal components.

Geodesic PCA in dimension d ě 2
The study of Geodesic PCA of histograms in the Wasserstein space is much more involved in dimension d ě 2, both on the theoretical and numerical aspects.Indeed, for d ě 2, one could always define GPCA as the construction of a principal geodesic subspace maximizing the variance of the projection of the data to that subspace.However, a rigorous analysis of such a construction in dimension d ě 2 has not been proposed so far.There also exist two main differences with the one-dimensional case.First, GPCA cannot be equivalent to map the data in the tangent space at their barycenter (assumed to be a.c.) and then to perform a standard PCA in that linear space as the Wasserstein space is a curved manifold for d ě 2. Secondly, optimal maps sending a set of histograms  to such a tangent space are more difficult to compute for d ě 2.Moreover, it is well known [18,21] that such maps are equal to the gradient of convex functions.This implies that an adaptation of existing algorithms for d " 1 leads to optimisation problems with more difficult constraints to handle than a PCA constrained to lie in the convex set of non-decreasing functions.Nevertheless, numerical approximation to the construction of GPCA in dimension d " 2 have been proposed in [20,58].As an illustrative example, we report numerical experiments from [20] on the analysis of data from the MNIST database [48] which contains grayscale images of handwritten digits.All grayscale images in this dataset are of size 28 ˆ28 pixels.We normalize them so that the sum of pixel grayscale values sum to one.In this manner, the images are histograms supported on a 2D grid of size 28 ˆ28.The ground metric for the Wasserstein distance is then the squared Euclidean distance between the locations of bins in this 2D grid.In Figure 9, we display the first principal components computed from 1000 images of each digit using the algorithmic approach proposed in [20].It can be seen that GPCA captures very well the main source of variability of each digit such as changes in the shape of the '0' or the presence or absence of the lower loop of the '2'.

Figure 1 :
Figure 1: Children's first name at birth -this dataset has been provided by the IN-SEE (French Institute of Statistics and Economic Studies).An subset of 4 histograms representing the distribution of children born with that name per year in France, and the whole dataset of n " 1060 histograms (right), displayed as pdf over the interval r1900, 2013s

Figure 3 :for 0 ă t ă 1 .
Figure3: First row: densities of two Gaussian measures on R. Second row: geodesics between µ and ν with respect to the Hilbert metric and the Wasserstein metric.The dotted curves in (c) correspond to the densities f t " p1 ´tqf µ `tf ν , and in (d) they correspond to the densites of the measures with quantile functions F t " p1´tqF μ `tF ν and d " 2 in Figure4.

Figure 4 :
Figure 4: A numerical example taken from [17].Wasserstein barycenter μ2 in (b) between two probability measures ν 1 and ν 2 on R 2 assumed to a.c. and uniform over their support displayed in (a) and (c).

Figure 5 :Figure 6 :
Figure 5: (a) A subset of 8 histograms (out of n " 15) obtained with random variables sampled from one-dimensional Gaussian mixtures distributions ν i (with random means and variances).The population Wasserstein of ν 1 , . . ., ν n is expected to be a mixture of two Gaussian distributions.Histograms are constructed by binning the data pX i,j q 1ďiďn ;1ďp on a grid Ω N of size N " 2 8 with p 1 " . . ." p n " p " 50.(b) Three Sinkhorn barycenters rε n,p associated to the parameters ε " 0.18, 1.94, 9.5.

Figure 7 :Figure 8 :
Figure7: Synthetic data set: Gaussian probability densities with random mean and variance.Euclidean PC: standard PCA with respect to the Euclidean metric between densities -first and second principal components (PC) of the data around their Euclidean mean (blue curve).The yellow to red curves encode the progression of the first two main directions of variation in the Hilbert space of square integrable functions.Wasserstein PC: geodesic PCA in the Wasserstein space -first and second principal components (PC) of the data around their Wasserstein barycenter (black curve).The curves from light blue to violet encode the progression of the pdf of the first two main geodesics directions of variation in P 2 pΩq.

Figure 9 :
Figure9: First principal geodesics for 1000 images of each digit from the MNIST database[48].The middle column represents the pdf of the Wasserstein barycenter of each digit (computed from 1000 images whose pixels are normalized to form a set of histograms).Each line allows to visualize the first geodesic of variation of each digit around their barycenter in the sense of optimal transport.The cost is quadratic and it corresponds to the squared distance between pixel locations.