MACHINE LEARNING AND OPTIMAL TRANSPORT: SOME STATISTICAL AND ALGORITHMIC TOOLS. ∗

. In this paper, we focus on the analysis of data that can be described by probability measures supported on a Euclidean space, by way of optimal transport. Our main objective is to present a first and second order statistical analyses in the space of distributions in a concise manner, as a first approach to understand the general modes of variation of a set of observations. In the context of optimal transport, these studies correspond to the barycenter and the decomposition into geodesic principal components in the Wasserstein space. In particular, we aim attention at a regularised estimator of the barycenter, in order to handle the noise coming from the observations. Additionally, we leverage these tools for time series analysis, whose spectral informations are compared using optimal transport


The Wasserstein distance in brief
The Wasserstein distance between probability distributions is a special case of optimal transport introduced by Monge (1781) and generalised by Kantorovich (1940').It was designed to find the most efficient way, i.e. requiring the least possible effort, to transport a pile of sand into a hole of the same volume.In other words, in mathematical language, Monge's problem comes to minimising the cost of transferring mass from one probability measure to another, that is, for probability measures µ and ν supported respectively on X and Y: inf T :T #µ=ν X c(x, T (x))dµ (x) for an arbitrary cost c : where T belongs to the set of measurable functions T : X → Y such that T #µ = ν.The pushforward measure T #µ is a probability measure on Y defined by T #µ(B) = µ{x ∈ X | T (x) ∈ B}, for any measurable set B ⊂ Y (see Figure 1).However, such a map T with T #µ = ν does not always exists.
Figure 1.Mass transfer through the pushforward operator and the map T between two probability distributions µ (in red) and ν (in blue).Figure adapted from Thorpe's book [Thorpe, 2019].
In its modern formulation, Kantorovich's problem is a relaxed version of Monge's that consists in finding a transport plan between a source measure µ and a target measure ν, which minimises the global effort (see e.g.[Villani, 2008] and [Ambrosio et al., 2004]).In this article, we focus on measures with support included in R d and on a Euclidean cost, thus defining the Wasserstein p-distance introduced by Leonid Wasserstein (1969), given for two probability distributions µ, ν of finite p-momentum by ∥x − y∥ p dπ(x, y) where π contains the behaviour of the mass transfer.More precisely Π(µ, ν) is the set of measures supported on R d × R d of respective marginals µ and ν (see Figure 2).This distance has in particular the advantage of characterising the weak convergence of measures on the metric space (P p (X ), W p ) of probabilities admitting a moment of order p (see e.g.Chapter 7 of [Villani, 2003]).et al., 2012].
Additionally, on the real line, the Wasserstein distance is closed-form : let Ω be a (possibly unbounded) interval in R and let ν be a probability measure over (Ω, B(Ω)) where B(Ω) is the σ-algebra of Borel subsets of Ω.The cumulative distribution function (cdf) and the (generalized) quantile function of ν are denoted respectively by F ν and F − ν .Then, the Wasserstein distance W p is defined for probability measures µ and ν in P p (Ω) by Note that if µ ∈ P p (Ω) is absolutely continuous with respect to the Lebesgue measure dx, then T * = F − ν • F µ will be referred to as the optimal mapping to pushforward µ onto ν in the Monge's problem (1).For a detailed analysis of P p (Ω) and its connection with optimal transport theory, we refer to [Villani, 2003].For a computational point of view, including applications, we refer to [Peyré and Cuturi, 2019].
In the following, we discuss the first order statistical analysis of a set of probability distributions, namely the barycenter in the Wasserstein space.In particular, we present an entropy regularised estimator of the barycenter and study its variance.Next, we outline the difficulties of conducting a principal component analysis for a set of probability measures, and present a PCA based on the geodesics in the Wasserstein space.The final section addresses the analysis of time series using the tools presented here.

Wasserstein barycenters
A statistical analysis of order one requires an object equivalent to the Euclidean mean, adapted to non-linear spaces, in this case the Wasserstein space (P 2 (R d ), W 2 ).In this regard, the Fréchet mean [Fréchet, 1948] for the W 2 metric is a natural tool.As introduced by [Agueh and Carlier, 2011], an empirical Wasserstein barycenter νn of a set of n probability measures ν 1 , . . ., ν n in P 2 (R d ) is given by νn ∈ argmin A detailed characterisation of these barycenters in terms of existence, uniqueness and regularity for probability measures whose support is included in R d is available in [Agueh and Carlier, 2011].The notion of Wasserstein barycenter was first generalized in [Le Gouic and Loubes, 2017] for random probability measures (see also [ Álvarez-Esteban et al., 2015] for similar concepts).A probability measure ν in P 2 (R d ) is said to be random if it has distribution P on (P 2 (R d ), B P 2 (R d ) , where B P 2 (R d ) is the σ-Borel algebra generated by the topology induced by the distance W 2 .In other words, when well defined, the Wasserstein barycenter of a random probability measure of law P supported on the space of distributions P 2 (R d ) is given by In the case where ν 1 , . . ., ν n are independent and identically distributed random probability measures (iid ) of law P, the barycenter ν P is referred to as the population counterpart of νn .We can then construct regularised versions of these barycenters, and conduct their statistical analysis.The motivation is twofold: adding an entropy term to the Wasserstein distance not only allows us to take advantage of a fast algorithm to compute the barycenter but also to obtain a smoother estimator.Moreover, we present the first bound on the variance of the proposed regularised estimator, which will allow to choose the regularisation parameters appropriately.

Entropy regularised barycenters
We consider a dataset composed of n random discrete measures ν p1 , . . ., ν pn obtained from random observations X = (X i,j ) 1≤i≤n; 1≤j≤pi organised as n subjects (or experimental units), such that ν pi is defined To better understand these objects, we can lean on the example of the cytometry dataset (Immune Tolerance Network http://bioconductor.org/packages/release/bioc/html/flowStats.html) in Figure 3, which shows the FSC (forward-scattered light, x-axis, ranging from 200 to 600) and SSC (side-scattered light, y-axis, ranging from 0 to 250) marker values of human cells.In that specific example, the population barycenter would be seen as the cell measurements of a super patient from which ν p1 , . . ., ν pn patients are sampled, such that for each of them we have only a finite number p i of observations (i.e.cell's measurements).From such a sample, we are interested in finding the underlying two-dimensional distribution, a priori absolutely continuous, of the patients' FSC and SSC markers.A regularised version (in the form of a entropy penalty term) of the Wasserstein barycenter can be of great advantages since one usually only has access to a dataset of observations X. Therefore the resulting barycenter may suffer from irregularities due to outliers or a lack of observations per measurement.
In the following, we will consider the discrete setting, meaning that the measures are supported on a fixed finite number of points X := {x 1 , . . ., x N }.A probability measure is then identified by a vector of positive weights summing to 1, that is an element of the N -dimensional simplex denoted Σ N .In this framework, the optimal transport corresponds to a linear optimisation problem on the space of transport matrices of size N , with constraints on the marginals.However, the excessive cost of computing such an optimal mass transfer -of the order of O(N 3 log N )-is clearly prohibitive.To alleviate this computational cost, [Cuturi, 2013] proposed to add an entropy regularisation term to the classical linear transport problem, leading to the notion of entropy regularised optimal transport, or Sinkhorn divergence, between discrete probability measures.The Sinkhorn divergence is then defined for a, b ∈ Σ N and regularisation parameter ε > 0 by where h and C is the cost matrix between the points of the support X , i.e.C ij = ∥x i − x j ∥ 2 for i, j ∈ {1, . . ., N }.Note that entropy regularised transport has considerably gained popularity in machine learning and statistics, as it makes it possible to use an approximation of transport distances for high-dimensional data analysis; in particular for generative models, multi-label learning, dictionary learning or image processing, see e.g.[Cuturi andPeyré, 2016, Rabin andPapadakis, 2015], text extraction by keyword comparison and in the averaging of neuroimaging data.Peyré and Cuturi's book [Peyré et al., 2012] presents a large part of the applications specific to optimal transport, and in particular to regularised transport.As previously explained, the initial purpose of this entropy regularisation was to efficiently compute the Wasserstein distance, by way of an iterative algorithm for which each iteration costs O(N 2 ).Singularly, such a regularised transport can be instrumental in handling outliers or smoothing an estimator of barycenter, beyond the purely computational advantage.This approach leads to the Sinkhorn barycenter [Cuturi and Doucet, 2014, Cuturi and Peyré, 2016, Carlier et al., 2017, Benamou et al., 2015].
The following presents the first statistical study of this regularised barycenter.We consider n discrete random measures q 1 , . . ., q n ∈ Σ N generated from a distribution P ∈ Σ N .Additionally, for each i, we assume that the observations (X i,j ) 1≤j≤pi are random variables with distribution q i .We then define for ε > 0 the empirical Sinkhorn barycenter rε n,p , where p depends on (p 1 , . . ., p n ), and its equivalent in population r ε rε n,p = arg min which correspond to Fréchet averages with respect to the Sinkhorn divergence W 2 2,ε , and depend on the regularisation parameter ε that informs on the amount of entropy within the transport problem.We can notice (see Figure 4) that the parameter ε has a smoothing effect on the barycenter rε n,p : the larger the parameter, the more the mass spreads.Thus the entropy penalty is no longer only of computational interest (in order to speed up the calculation time of a transport distance), but becomes a real regularisation tool.
We proved the strong convexity of the Sinkhorn divergence [Bigot et al., 2019], which allowed us to obtain a bound on the variance of the estimator rε n,p of the Sinkhorn barycenter.For this, it is necessary to restrict the analysis to discrete measures belonging to the space as well as constraining the barycenter to belong to this space.This amounts to imposing a constraint on the support of the Sinkhorn barycenter.The bound is given by the following theorem, where all constants are explicit: Theorem 1.1 (Bigot, C., Papadakis, 2018).Let p = min 1≤i≤n p i et ε > 0. So where we recall that C is the cost matrix in the entropy regularised optimal transport problem and N the number of support points of the distributions.

Application and choice of the regularisation parameter ε
Histogram registration problems have applications in many fields.In bioinformatics, for example, researchers aim to automatically normalise large datasets to compare and analyse characteristics within a single population of cells, taking into account phase variability (see the previous cytometry example in Figure 3).Unfortunately, the acquired information is often noisy due to misalignment, caused by technical variations in the environment.The need to take phase variability into account in the statistical analysis of such datasets is a known problem.Examples can be found in the one-dimensional case (d = 1) with biodemographic and genomic studies [Zhang and Müller, 2011], economic studies [Kneip and Utikal, 2001], analysis of neuronal activity in neuroscience [Wu and Srivastava, 2011] or the functional connectivity between brain regions [Petersen et al., 2016].In higher dimension, d ≥ 2, the data registration problem comes for instance from the statistical analysis of spatial point processes [Gervini, 2016, Panaretos andZemel, 2017] or from flow cytometry data [Hahne et al., 2010, Pyne et al., 2014].
Optimal transport allows to correct the effects of misalignments within a dataset, however its usefulness has only been exploited by few authors.Additionally, the noise can be dealt with a smoothing step, that in our case is included in the computation of the regularised Wasserstein barycenter defined in (8).The estimator then fulfils its role in efficiently recovering the structure of a dataset from a small number of observations, as shown in Figure 5.In order to automatically choose the regularisation parameter, the Goldenshluger-Lepski method suggests a solution based on the variance of the estimators, for a choice of parameters guided by the dataset.Therefore the theoretical results on the upper bound of the variance of the estimator in Theorem 1.1 allow us to tackle the problem of histogram registration and especially the automatic choice of the regularisation parameter ε of the barycenter estimator in (8).In Figure 6, we present a toy example for n = 15 mixtures of Gaussian distributions, each with p = 50 observations.The GL bias-variance trade-off function associated to the Sinkhorn barycenters and plotted on the left suggests to choose ε = 2.55 as the optimal regularisation parameter.In Figure 6 (right), we display the associated Sinkhorn barycenter rε n,p .This work [Bigot et al., 2019] is the first to propose an automatic choice of parameters in the context of regularisation related to optimal transport.

Second order statistical analysis : Geodesic PCA
We present in this section a second-order statistical analysis, which is naturally expressed through a principal component analysis (PCA).In the same way as a usual PCA, the aim is to calculate the main modes of variation of one-dimensional histograms around their average element in order to better summarise and represent the information of a dataset.However, as the number, size or locations of significant bins in the histograms of interest may vary from one histogram to another, using standard PCA on histograms (with respect to the Euclidean metric) is bound to fail.The usual (functional) PCA of a set of probability densities (f i ) i=1,...,n seen as functions of L 2 (R) consists in diagonalizing the covariance operator Cov.The eigenvectors of Cov associated with the largest eigenvalues describe the main modes of variability of the data around the Euclidean mean fn .The functional PCA results are very unsatisfactory for several reasons.Firstly, the functions obtained are not probability densities, in particular they take negative values.Secondly, the L 2 metric only takes into account variations in the amplitude of the data.
In order to overcome these two drawbacks, it is essential to work directly on the space of probability measures P 2 (R) endowed with the 2-Wasserstein distance.However, this space is not Hilbertian.Consequently, standard PCA, which involves the calculation of a covariance matrix, cannot be applied directly to compute the principal modes of variation in the Wasserstein sense.Nevertheless, a meaningful notion of PCA can still be defined based on the pseudo-Riemannian structure of the Wasserstein space, which has been extensively studied in [Ambrosio et al., 2004] and [Ambrosio et al., 2005].Following this principle, a structure for the geodesic principal component analysis (GPCA) of probabilities measures supported on an interval Ω ⊂ R has been introduced in [Bigot et al., 2017].GPCA is defined as the problem of estimating a principal geodesic subspace (of a given dimension) that maximises the variance of the projection of the data into this subspace.In this approach, the base point of the subspace is the Wasserstein barycenter fn of the data f i as mentioned in (4).The existence, consistency and a detailed characterisation of the GPCA in P 2 (Ω) have been studied in [Bigot et al., 2017].In particular, the authors showed that this approach is equivalent to projecting the data into the tangent space of P 2 (Ω) at the Fréchet mean, and then performing a PCA in this Hilbert space, while constraining the problem to a convex and closed subset of functions.Projecting the data into this tangent space is not difficult in the one-dimensional case since it boils down to computing a set of optimal mappings, or Monge maps, between the data and their Wasserstein barycenter, for which an explicit form is available (see eq. ( 3)).However, the authors of [Bigot et al., 2017] did not construct an algorithm to solve the GPCA problem, only a numerical approximation of the computation of the geodesic principal components has been proposed.This last approach consists in applying a log-PCA, i.e. a standard PCA of the dataset previously projected in the tangent space of P 2 (Ω) to its Wasserstein barycenter fn .
In our paper [Cazelles et al., 2018], we proposed to compare the log-ACP and GPCA methods as introduced in [Bigot et al., 2017, Seguy andCuturi, 2015].In this setting, histograms are seen as piecewise constant probability densities supported on a given interval Ω.Therefore, the modes of variation of a set of histograms can be studied through the notion of geodesic PCA of probability measures in the Wasserstein space P 2 (Ω) admitting these histograms for density.The results are presented in Figure 7.The components recover well the translation effects (first component, left) and the amplitude effects (second component, right) of the dataset, when a so-called Euclidean functional PCA is not able to do so.However, the computation of the GPCA remains complicated even in the simplest case of R supported probability densities.
We have therefore provided a novel algorithm forward-backward to perform geodesic principal component analysis of measures defined on the real line, by solving the non-convex GPCA optimization problem exactly.This allowed us to present a detailed comparison between log-PCA and geodesic PCA of one-dimensional histograms, for different datasets.We have also extended the results for two-dimensional measures.The codes are available online at https://github.com/ecazelles/2017-GPCA-vs-LogPCA-Wasserstein.

Application to time series analysis
As an object allowing displacements on the support of the measures, the Wasserstein distance can be instrumental for signal processing problems and stationary time series analysis.The previous theoretical studies then naturally led to comparing time series in terms of their normalised Power Spectral Density (PSD) (i.e. of mass 1) for three main reasons: (i) comparing two signals in terms of their PSD is possible even when they do not share the same sampling rate, length, magnitude or phase; (ii) in the very simple case of a cosine of frequency ω, its PSD is given by the sum of Dirac mass at −ω and ω.It is thus reasonable to use the Wasserstein metric to emphasise the location of the support of the PSD, which contains all the information of the cosine; (iii) in order to leverage the vast literature on the Wasserstein distance, which deals with positive functions of mass 1.
Going into details, the PSD of a signal is given by the modulus of its Fourier transform.After normalisation, it is possible to characterise an equivalence class for signals of the same Normalised PSD (NPSD).Once we have these objects in hand, we can define the so-called Wasserstein-Fourier distance between two classes of signals as the Wasserstein distance between their NPSDs, as proposed in [Cazelles et al., 2021].From this construction we can easily deduce basic properties of this distance, on time and frequency translations for example.Similarly, we validated the proposed distance as a measure of interest for time series by relating convergence results between the time and frequency representations when the number of observations tends to infinity.Once this framework is well established, we can apply the known statistical tools defined in the Wasserstein space.We emphasise that for many applications, differentiating and quantifying information across the spectrum of a signal, in the frequency domain, is more coherent then in the time domain.In particular, we have focused on the following applications.The code to reproduce the experiments is available in Python at https://github.com/GAMES-UChile/Wasserstein-Fourier.
Interpolation.As in the intuitive example (ii), an interpolation between two cosines of frequency ω 1 ≤ ω 2 is interpreted as a cosine of frequency ω γ with ω 1 ≤ ω γ ≤ ω 2 .More generally, an interpolation boils down to computing the geodesic in the Wasserstein space between the normalised PSDs associated with the signals.This is summarised in Figure 8 (left).We also present in Figure 8 (right) the results when the signals are Gaussian processes with different kernels, indeed Bochner's Theorem directly relates the PSD and the kernel of a Gaussian process.In this special case, the interpolation appears to be a natural way of encoding the deformation of one signal onto another.As a consequence, interpolation allows us to generate new data along the geodesic whose dynamic content is close to the source and target data, thus performing data augmentation.Note that a GPCA procedure could also be applied to a set of signals for data augmentation.However, in order to properly capture the diversity of the dataset, the data must have a strong common geometric structure.PCA.The counterpart of classical PCA in the space of probability distributions, presented in Section 2, can be directly applied to the NPSDs space.That allows to visualise and identify data that have similar dynamic content as soon as their projections are close on the principal components, and then to deduce groups of signals sharing some behaviour in frequency.
Classification.We propose a simple logistic regression framework, as well as a classifier based on nearest neighbours to classify time series based on their NPSD, which we compare through Wasserstein and Euclidean distances and Kullback-Leibler divergence.

Conclusion
In this paper, we present a small sample of the statistical analyses that can be performed using optimal transport when dealing with a dataset of objects described by probability distributions.The main message is that using an adequate space and metric to process data can be critical in some applications.

Figure 2 .
Figure 2. Representation of an optimal transport plan π in eq.(2) between two probability distributions µ (in red) and ν (in blue) that are discrete (left) and absolutely continuous with respect to Lebesgue measure (right).Figure adapted from Peyré and Cuturi's book[Peyré et al., 2012].

Figure 3 .
Figure 3. Cytometry dataset from the Immune Tolerance Network.Each point cloud represents the FSC and SSC marker values for a set of cells from a patient.

Figure 4 .
Figure 4.A simulated example of n = 2 distributions constructed with p 1 = p 2 = 300 observations generated from mixtures of Gaussians of random means and variances.(Left) The blue and red graphs are histograms of equal and small bins.(Right) 400 Sinkhorn barycenters rε n,p for ε ranging from 0.1 to 5. The colours encode the variation of ε.

Figure 5 .
Figure 5. Sinkhorn barycenter associated to the cytometry dataset in Figure 3.

Figure 7 .
Figure 7. (Right) Data set of n = 100 randomly translated and expanded Gaussian histograms.(Top-left) Usual PCA via the Euclidean metric.The Euclidean barycentre is shown in blue.(Bottom-left) Geodesic PCA with respect to the Wasserstein distance.The black curve represents the density of the Wasserstein barycentre.The colours encode the progression of the densities along the geodesic principal components.