On a few statistical applications of determinantal point processes

Determinantal point processes (DPPs) are a repulsive distribution over configurations of points. The 2016 conference Journées Modélisation Aléatoire et Statistique (MAS) of the French society for applied and industrial mathematics (SMAI) featured a session on statistical applications of DPPs. This paper gathers contributions by the speakers and the organizer of the session. Résumé. Les processus ponctuels déterminantaux (DPP) sont des distributions répulsives sur des configurations de points. Au cours des journées Modélisation Aléatoire et Statistique (MAS) 2016 de la Société française de Mathématiques Appliquées et Industrielles (SMAI), nous avons organisé une session sur les applications statistiques des DPP. Cet article rassemble des contributions des orateurs et de l’organisateur.


Introduction
Determinantal point processes (DPPs) are distributions over configurations of points that encode repulsiveness in a kernel function. Since their formalization in [27] as models for fermions in particle physics, specific instances of DPPs have half-mysteriously appeared in fields such as probability [21], number theory [35], or statistical physics [30]. More recently, they have been used as models for repulsiveness in statistics [24] and machine learning [23]. During the 2016 Journées Modélisation Aléatoire et Statistique (MAS) of the French society for applied and industrial mathematics (SMAI), we organized a session specifically on statistical applications of DPPs. This paper gathers contributions by the speakers (FL, XM, AV) and the organizer (RB), which cover and extend the talks of that session. Jamal Najim also contributed a talk to the session, on characterizing the distribution of eigenvalues of large covariance matrices. The content of his talk has already been the topic of a recent paper in the MAS proceedings [18].
The paper follows the schedule of the MAS session. Section 2 gives a tutorial introduction to DPPs, Section 3 investigates DPPs as a model in spatial statistics. In Section 4, DPPs are shown to have desireable properties in survey sampling design. In Section 5, DPPs are used as a model for the spatial distribution of antennas in cellular networks, and asymptotic results are obtained to justify the observed loss of repulsiveness in the superposition of independent DPPs. Finally, in Section 6, Monte Carlo methods are built on DPPs that provide a stochastic version of Gaussian quadrature.

Determinantal point processes
DPPs can be defined on any abstract locally compact Polish space E, see [21,37]. For statistical applications the two important cases, detailed in this section, are E being a (finite) discrete space, say E = {1, . . . , N }, and E = R d .

DPPs on a finite discrete space
In this section we consider a finite discrete state space E = {1, . . . , N }. In this case, a point process on E is simply a probability measure on the set 2 E of all subsets of E. As demonstrated in the sequel, DPPs on E form a flexible family of processes, that generate subsets of E exhibiting diversity. Most properties of DPPs in this case, including those detailed in the following, can be found in [23,26,39].
We say that X is a DPP on E if there exists a N × N matrix K such that for any A ⊂ E where K A is the sub-matrix with entries (K ij ) i,j∈A , with the convention K ∅ = 1. If K is a Hermitian matrix, then the existence of X is ensured if and only if all eigenvalues of K are in [0,1]. In the remainder of this section, we restrict ourselves to real symmetric matrices. Equation (1) already leads to insightful interpretations. First, the diagonal of K encodes the marginal probability for each element of E to belong to X. Indeed, from (1), for any i ∈ E, P (i ∈ X) = K ii . Second for i = j, (1) implies Thus, if K ij measures similarity between i and j, then X favours diversity. In particular, the correlation between the events {i ∈ X} and {j ∈ X} is always negative, and all the more negative that i and j are similar. In this sense, X generates sets having diverse elements. If all eigenvalues of K are strictly less than one, or equivalently if (I − K) is invertible, then we can deduce from the inclusion-exclusion principle that for any A ⊂ E where L = K(I − K) −1 . As a side note, Equation (2) provides an alternative way to define a DPP: if X satisfies (2) where L is a positive semidefinite matrix, then it is called an L-ensemble.
Note that all L-ensembles are DPPs in the sense of (1) where K = L(L + I) −1 , but the converse is not true (if (I − K) is not invertible). L-ensembles are a popular point of view in the machine learning community, see [23]. It allows to define easily a DPP through a positive semidefinite matrix L, without the constraint that all eigenvalues must be less than or equal to 1, unlike K. The main drawback though is that L does not encode a clear interpretation of the process. Another limitation is the fact that the probability of the empty set is necessarily positive for L-ensembles, which is unrealistic for some applications, see e.g. survey sampling in Section 4.
The use of DPPs on finite sets in machine learning or for survey sampling is motivated by the flexibility of this family of processes, through the choice of K (or L), and also by its many appealing properties, some of which are detailed below.
i) [unicity] Given a symmetric matrix K with all eigenvalues in [0, 1] , X in (1) is unique [42]. However the converse is not true: two symmetric matrices K andK have the same principal minors if and only if K = DKD −1 where D is a diagonal matrix with entries ±1, see [14,33]. Therefore in this case, K andK generate the same DPP. ii) [cardinality] In general, the number of elements in X, denoted by |X|, is random.
Specifically, if we denote by λ 1 , . . . , λ N the eigenvalues of K, the law of |X| corresponds to the sum of N independent Bernoulli random variables with respective mean λ i . In particular where tr means trace. An important particular case occurs when K is an orthogonal projection matrix. Then X is called a determinantal projection process and |X| = rank(K) is deterministic. This property is of particular relevance in survey sampling where fixed-size sampling designs play a major role, see Section 4. iii) [stability by restriction] The restriction of X to a subset F of E remains a DPP with kernel matrix K F . iv) [stability by complement] The complement of X in E, i.e.X = E \ X, is also a DPP and its kernel matrix is I − K. v) [stability by conditioning] If we condition X to contain all elements of a subset F of E and/or to exclude all elements of a subset G of E (where F ∩ G = ∅), then the resulting process on E \ (F ∪ G) remains a DPP with explicit kernel, see [23,26]. vi) [simulation] The last property makes possible exact simulation of DPPs. The procedure, detailed below, basically amounts to generating a first element for X, then a second element given the first one, then a third element given the first two, and so on. A generic algorithm to sample from a DPP is given in [21]. We start from an orthonormal eigendecomposition of K, where v * i denotes the conjugate transpose of v i , that for generality we assume to possibly be complex (v i ∈ C N ). The procedure presented in [21] exploits the fact that a DPP is in fact a mixture of determinantal projection processes. Specifically, the DPP X with kernel K has the same distribution as the DPP with kernel where the B i 's are Bernoulli random variables with mean λ i [21] (whence property ii) above). The algorithm thus consists in first selecting a determinantal projection process composing this mixture, and second sequentially generating a realization of this process using the conditional properties of DPPs. The algorithm is summed up as Algorithm 1.
The projection in the second step of Algorithm 1 is the composition of successive projections, each on the orthogonal complement of V (k), for all element k selected in the previous steps. In other words, at an intermediate step of the algorithm we have .
• Denote by V the matrix whose columns are the selected vectors, by n their number, and by V (k) the vector corresponding to the k-th row of V .
) (the vector space of all linear combinations of V (k) and the elements of H). end Algorithm 1: Sampling from a finite DPP So P H ⊥ can simply be updated at each step by multiplication. However, this leads to some numerical instabilities since after some multiplications P H ⊥ can differ from a proper projection. A more stable alternative is to use a Gram-Schmidt procedure to sequentially construct an orthonormal basis of H, from which P H ⊥ is easily deduced. The details are given for the continuous case in Section 2.2 and its adaptation to the discrete case is straightforward.
Further properties of DPPs on discrete sets are described in Section 4.

DPPs on a continuous space
We assume in this section that E = R d . This is in fact the historical setting where DPPs have been developed, see the seminal paper [27]. The initial motivation was to characterize the probability distribution of the locations of particles in physics, specifically for particles in repulsion (so-called fermions). We will come back to this interpretation later. For a detailed presentation of DPPs in a continuous setting, we refer to [21,24,37].
A general background on point processes on R d can be found in [9,29]. In brief, a point configuration is a subset of R d that has a finite number of elements on any bounded subset of R d . A point process in R d is a probability law on the set of all point configurations in R d . It is said simple if there is at most one point at each location, almost surely.
Let X be a simple point process on R d . For a bounded set D ⊂ R d , denote by X(D) the number of points of X in D. Let µ be a reference measure on R d that we assume for simplicity to be absolutely continuous with respect to the Lebesgue measure. If there exists a function ρ (k) : R dk → R + , for k ≥ 1, such that for any family of mutually disjoint subsets D 1 , . . . , then this function is called the joint intensity of order k of X, with respect to µ. From its definition, the joint intensity of order k is unique up to a µ-nullset. Henceforth, for ease of presentation, we ignore nullsets. In particular we will say that a function is continuous whenever there exists a continuous version of it. The joint intensity ρ (k) (x 1 , . . . , x k ) can be understood as the probability that X contains a point in each neighborhood of x i for i = 1, . . . , k. In this sense joint intensities are continuous analogues of the probabilities P (X ⊃ A) involved in the discrete case in (1).
We say that X is a DPP on R d if there exists a function C : R d × R d → C, called the kernel of X, such that for all k ≥ 1 for every (x 1 , . . . , In view of the interpretation of ρ (k) , this definition is the continuous counterpart of (1).
Let us briefly explain why the particular form in (4) arises naturally for the distribution of fermions in quantum mechanics. A similar description can be found in [22], see also [30,Section 5.4] for the physical aspects. Let us consider for simplicity the case of a bounded set where the number of points (hereafter particles) n is fixed, in which case ρ (n) reduces to the joint distribution of the location of these points, up to a constant. In quantum mechanics, the quantum state of a system of n particles is described by a complex-valued n-body wave function, say ψ(x 1 , . . . , x n ) where x 1 , . . . , x n stand for the position of each particle. The probability distribution of the position of particles then admits the density |ψ| 2 , which corresponds in our notation to ρ (n) (up to a constant). As a naive proposition, we might assume that this n-body wave function is simply the tensorial product of all individual wave functions ψ i (x i ). But since the particles are indistinguishable, this could equally well be ψ i (x π(i) ), where π is any permutation of {1, . . . , n}. Moreover, fermions repel and for this reason, in virtue of the so-called Pauli exclusion principle, the n-body wave function must satisfy the anti-symmetric property ψ(. . . , x i , . . . , x j , . . . ) = −ψ(. . . , x j , . . . , x i , . . . ) for any i, j. These requirements (indistinguishability and repulsiveness) led physicists to construct the wave function as an anti-symmetric version of the tensorial product, namely ψ(x 1 , . . . , x n ) ∝ π sgn(π)ψ 1 (x π(1) ) . . . ψ n (x π(n) ), which is the determinant of the matrix Ψ with entries ψ i (x j ). Finally we get |ψ| 2 proportional to det(Ψ) det(Ψ) = det(Ψ ) det(Ψ) = det(Ψ Ψ) = det C[x 1 , . . . , x n ] where C(x, y) = ψ i (x)ψ i (y).
Beyond quantum mechanics, DPPs surprisingly arise in many examples of probability, the most famous of them being the law of the eigenvalues of fundamental random matrix models [1]. For this reason, they have received a lot of attention from a theoretical point of view. From a statistical perspective, their increasing popularity to encode negative dependencies, as demonstrated in the following sections, is due as in the discrete case to their flexibility through the choice of the kernel C, and to their nice properties, mainly the same as in the discrete case, as discussed now.
A particular case of DPP is the Poisson point process with intensity ρ(x), which is associated to C(x, x) = ρ(x) and C(x, y) = 0 if x = y. This is in some sense the extreme case of a DPP without interaction, whereas a DPP implies in general repulsiveness. Assume that C is Hermitian, i.e. C(x, y) = C(y, x). It is not very restrictive for statistical applications to further assume that C is continuous, which only excludes from useful models the peculiar example of the Poisson point process. On any compact set S, C then admits a spectral decomposition where (φ S i ) forms an orthonormal basis of L 2 (S) with respect to µ. In this setting the existence of X satisfying (4) is equivalent to 0 ≤ λ S i ≤ 1 for any i ≥ 0 and any S [27,42]. In the particular case where µ is the Lebesgue measure and C is invariant by translation, i.e. there exists C 0 such that C(x, y) = C 0 (x − y), leading to a stationary DPP, this condition for existence takes a simpler form. Namely, it boils down to C 0 ∈ L 2 (R d ) and 0 ≤ ϕ ≤ 1, where ϕ denotes the Fourier transform of C 0 , see also Theorem 1 in Section 3. For a general kernel C, existence is ensured if C is a continuous covariance function and there exists C 0 as before such that C 0 (x − y) − C(x, y) remains a covariance function (see e.g. Corollary 3.2.6 in [41]).
As in the discrete case, DPPs on R d enjoy a lot of appealing properties, see [24,37]. Given C and a bounded set S, a DPP with kernel C is unique. The distribution of X(S) is known (this is the sum of Bernoulli variables with mean λ S i ). All joint intensities of X are known by the very definition. The density of X on S with respect to the standard Poisson point process on S is explicitly known. The restriction of a DPP to a subset S remains a DPP and its kernel is simply the restriction of C to S × S. If we condition X to contain a finite set of points, then the resulting process is a DPP with an explicit kernel.
By the last property, we can simulate X on any bounded set S, using the same algorithm by [21] as in the discrete case. The first step consists in sampling the indexes involved in the sum of the spectral decomposition (5) according to Bernoulli variables with mean λ i . Up to a re-ordering, let us denote by {1, . . . , n} the selected indexes and ) . The second step is akin to the Gram-Schmidt procedure, and is presented in Algorithm 2.

Initialize by
• sampling x n according to the distribution with density (||V (x)|| 2 /n), Algorithm 2: Sampling from a continuous DPP: 2nd step. See main text for details about the complete procedure.
The densities involved above have to be understood with respect to µ and simulations from them can be done by rejection sampling.
The algorithm of simulation and some of the properties of DPPs discussed above require to know the spectral decomposition (5). Unlike the finite case where the kernel is simply a matrix, this decomposition is in general unknown in the continuous case. To overcome this issue in practice, we can either define the kernel C directly through its spectral form (see Section 6), or apply some approximation as discussed in Section 3. Note finally that it is possible to do statistical inference on C without this spectral knowledge (see Sections 3 and 5).

DPPs in spatial statistics
In spatial statistics, a common concern is the analysis of spatial point patterns observed on a bounded subset of R d , the most common situation being d = 2. In presence of this kind of data, a first step usually consists in the estimation of the (first order) intensity ρ, to decide if it is constant (or equivalently homogeneous), in which case the points are uniformly distributed in the observation window, or if the intensity is inhomogeneous, in which case the points are more dense in certain regions of the observation window than elsewhere. In a second step, we may be interested in the interaction between the points. Three main situations may occur: independence (the case of the Poisson point process), aggregation (equivalently clustering), or inhibition. Due to their repulsiveness property, DPPs are well adapted models for inhibition. We explain below how parametric DPP models can be constructed, in which extent they constitute a flexible family of models, and how inference can be conducted.
In the spatial point process community, see for instance [29], the two first order moments of a point process are generally summarized by the intensity ρ and by the pair correlation function (pcf) If g(x, y) = 1, there is no (second order) interaction between two points located in a vicinity of x and y. If g(x, y) > 1, there is attraction, meaning that it is more likely than in the independent case to find a point nearby y if a point is already located nearby x. If g(x, y) < 1, there is inhibition. Following Section 2.2, a DPP with a Hermitian kernel C satisfies The expression of g confirms the inhibitive property of a DPP. These formulas also give a clear interpretation of the kernel C, that helps for the specification of C in a modeling purpose. Some conditions for existence are nonetheless necessary.
Let us first assume that the point process is stationary, which implies that ρ is constant and g(x, y) only depends on the difference y − x. For a DPP with a real valued kernel C, stationarity is equivalent to that C(x, y) only depends on y − x. If C is complex valued, the latter condition also implies stationarity of the associated DPP, but this condition is only sufficient, an example being the Ginibre DPP studied in Section 5, which is stationary while its kernel is not invariant by translation. The following result, proved in [24], gives simple sufficient conditions for existence of the DPP when C(x, y) only depends on y − x.
Then the existence of a DPP with kernel C is equivalent to that the Fourier transform of C 0 is less than 1.
The construction of parametric families of DPPs thus becomes straightforward. One only needs to choose a parametric family of covariance functions C 0 and to compute its Fourier transform in order to check the condition for existence. This condition then becomes a constraint on the parameters of the family. Note that the advent of a constraint is expected for repulsive models, as the range of repulsiveness has to be somehow balanced by the mean number of points. As an extreme example, there is an upper bound on the number of hard balls with a given radius that one can include in a bounded region. Many examples of parametric covariance functions having an explicit Fourier transform are already known: Gaussian, Whittle-Matèrn, generalized Cauchy, Bessel-type, ... We refer to [6,24] for the expression of these covariance functions and of their Fourier transform, and the associated constraint on the parameters for existence.
Remark. An alternative parametric family is the β-Ginibre DPP, studied in Section 5, where the kernel takes complex values. Note that this family has exactly the same second order properties (same ρ, same g) as DPPs with a Gaussian kernel. However, due to its central role in random matrix theory, some properties are known for the β-Ginibre DPP that are not available for the Gaussian DPP family, as for instance the expression of the J-function (see Section 5). This opens new possibilities for inference, see Section 5.
As we are interested in parametric DPPs to model point patterns exhibiting inhibition, it is natural to wonder whether DPPs form a flexible class of models. The least repulsive DPP is the Poisson point process, a peculiar case of DPP without interaction, see Section 2. The most repulsive stationary DPP with intensity ρ has been determined in [6], where the repulsiveness is quantified through the behavior of the pair correlation. It corresponds to the DPP whose kernel has a Fourier transform equal to 1 on the ball centered at the origin with volume ρ. In dimension 1, this corresponds to the sinc kernel, C 0 (x) = sin(πρ|x|)/(π|x|), while in dimension 2, it is sometimes called the jinc kernel given by where J 1 denotes the Bessel function of the first kind of order 1. The expression in any dimension can be found in [6]. Figure 1 shows the pcf of DPPs with kernels from the Whittle-Matèrn, the generalized Cauchy and the Bessel-type families. This plot illustrates that the most repulsive DPP within the first two families corresponds to the Gaussian kernel, whose pcf is represented in bold dashed line, and a realisation of which is shown on the top left plot of Figure 1. Even if the point pattern clearly exhibits some regularity (or inhibition), it is less regular than a realisation from the most repulsive DPP shown in the bottom left plot of Figure 1. The pcf of the latter DPP is represented in bold solid line. It is actually part of the Bessel-type family, which appears as a more flexible parametric family than the Whittle-Matèrn and the generalized Cauchy families.
The previous analysis demonstrates that DPPs cover a large range of repulsiveness, from the Poisson point process to the most repulsive DPP with kernel (7), opening promising possibilities for modeling. However, it also reveals that DPPs cannot be extremely repulsive. In particular, a DPP cannot include a hardcore distance δ (this a consequence of [36], Corollary 1.4.13), a situation where no pairs of points can be at a distance less than δ.
Concerning inhomogeneous models, a common practice in spatial statistics is to assume the point process to be inhomogeneous at the first order only, meaning that the intensity ρ is not constant while the pcf g is invariant by translation. For DPPs, this can be done by taking with C 0 (0) = 1. Then the intensity is ρ(x) and the pcf is 1 − |C 0 (x − y)| 2 . The existence of the associated DPP is ensured whenever ρ(.) is bounded byρ > 0 and the DPP with kernel ρ C 0 (x − y) satisfies the condition of Theorem 1. Parametric models can thus be considered by assuming a parametric form for ρ(.), possibly depending on covariates, and to choose a parametric covariance model for C 0 as discussed earlier.
Assuming a parametric form like (8) where ρ(x) = ρ β (x) and C 0 (x) = C 0,θ (x) depend on two different parameters β and θ, we now discuss how to conduct statistical inference from an observation of X on a bounded set W . The standard procedure is to first estimate β by Poisson likelihood, that isβ In the homogeneous case of a constant intensity ρ (in which case β = ρ), this givesρ = X(W )/|W |. Second, the estimation of θ can be carried out by contrast estimating functions or estimating equations, typically based on second order characteristics of X like the pcf g. Note that because of (8), g depends on θ only and not on β. For instance, given a non-parametric estimateĝ of g (see [29] for an expression), we can estimate θ bŷ where r min and r max are user-specified parameters. A standard choice in practice is to set r min = 0.01 and r max as one quarter of the side length of the observation window. The theoretical justifications of this two step procedure to estimate β and θ can be found in [5], in the particular case of stationary DPPs. Other contrast estimating functions are possible, where g is replaced by another characteristic of the point process, see Section 5 for an example with the J function.
As an alternative to contrast estimation, likelihood inference can be conducted in some cases. This method is expected to be more efficient than the other methods of inference, at least asymptotically (that is when W is big enough). Although no theoretical justifications for DPPs support this claim so far (neither any justification of the consistency of likelihood estimation), this has been confirmed by some simulations, see [5,24]. However, to get the expression of the likelihood from (8), one needs to know the spectral representation of C on W , which is rarely available. If W is a rectangular set and ρ(x) is constant in (8), an approximation based on the Fourier transform of C 0 is proposed in [24]. If W is the unit square, this gives for any x, y ∈ W , where ϕ denotes the Fourier transform of C 0 and k.(x − y) is the inner product between k and (x−y). The case of a general rectangular set W can easily been deduced, see [24]. This approximation turns out to be quite accurate when ρ is not too small, see [24] for some justifications and a simulation study.
Several functions are available in the spatstat library [3] of R [32] to manipulate DPP models on a bounded subset of R 2 . They allow to define parametric models of homogeneous and inhomogeneous DPPs, to simulate them (using the spectral approximation presented above and the Gram-Schmidt algorithm described in Section 2.2), and to fit them to a real dataset.

Determinantal survey sampling
Survey sampling considers the following problem. One wants to acquire knowledge of a parameter of interest θ, function of {y k , k ∈ E}, where E = {1, · · · , N } is a finite population. A typical parameter θ is the sum t y (or the mean m y ) of y. Unfortunately, the values y i are inaccessible, except for a small part of the population. One therefore uses an estimator of θ constructed on a randomly chosen subpopulation. This choice is called the sampling design, whose properties are of crucial importance to get "good" estimators. In this section, we show that sampling designs based on DPPs, coined Determinantal Sampling Designs (DSDs) afterwards, enjoy must of the properties usually required for a sampling design. A good design should in particular meet the following requirements : • simplicity of the design and simple algorithmic construction, • control of the size of the sample, • statistical amenability (consistency, central limit theorem,...), • low mean square error of the estimator, see for instance [8].

Determinantal processes as sampling designs
A sampling design can always be defined as the law of a vector of Bernoulli variables. Let E = {1, · · · , N } denote the population. Then a sampling design is the law of B = (B i , 1 ≤ i ≤ N ), vector of Bernoulli variables, and the set X = {i ∈ E|B i = 1} is called the sample. This approach by Bernoulli variables is for instance used in [44]. Usually, except for very simple sampling designs, only the marginal law of each variable B i is known, and other quantities remain unknown, or have to be estimated. This is not the case for determinantal sampling designs, whose description is given below.
Consider X ∼ DP P (K) a DPP with kernel matrix K on E, as defined in Section 2, and let B = (1 i∈X , 1 ≤ i ≤ N ). Then B i = 1 i∈X is a Bernoulli variable, and (the law of) B is a sampling design. We call this design a determinantal sampling design with kernel K. We also note X ∼ DSD(K) to insist on the sampling interpretation. It then follows from the definition (1) of DPPs that P (Π i∈A B i = 1) = det K A .
Therefore, the whole law of the design is known through the principal minors of its kernel. As DPPs are indexed by symmetric matrices with eigenvalues in [0, 1], DSDs form a parametric family of sampling designs. The existence of a perfect simulation algorithm as described in Section 2 allows to implement these designs in practice.
We interpret below some other properties of DPPs described in Section 2 in terms of sampling theory. Let X ∼ DSD(K).
(1) The first order inclusion probabilities are the diagonal terms of K, P (i ∈ X) = K ii .
(3) The size of the sample X is known through the eigenvalues of K. In particular, X is of fixed size n (a property usually required in practice) iff K is a projection matrix of rank n. Also, DSD(K) samples at least one point iff 1 is an eigenvalue of K. (4) The restriction X F of X to a domain F ⊂ E is also a DSD, X F ∼ DSD(K F ).

(5) Recall that a sampling design is stratified if the population E can be decomposed as
It holds that a DSD(K) is stratified if and only if K admits a block matrix decomposition (after reordering of the population by strata). (6) Poisson Sampling is determinantal with K diagonal. But Simple Random Sampling (SRS) of size n, defined by P ((i 1 , ·, i m ) = X) = 1/ N n if m = n and 0 otherwise is not determinantal (unless, n = 0, 1, N − 1 or N ). (7) However, it is shown in [25] that for some couples (n, N ) there exists DSDs with the same first and second order inclusion probabilities as SRS.
We finally address in this section a typical requirement of statisticians regarding sampling designs: the construction of fixed size sampling designs with prescribed unequal first order inclusion probabilities. Let us explain briefly the reason behind this requirement. In survey samplings, one usually knows the values of a variable x (on the whole propulation) correlated with y. If you choose the π i proportional to x i , then one can check that the variance of the Horvitz-Thompson estimator is proportional to the variance of the number of points of the design. Thus a fixed size sampling design with π i ∝ x i will achieve perfect estimation of x, and hopefully produce a low variance estimator of y. The Schur-Horn Theorem [20] allows to derive the following result: Theorem 2. Let Π be a vector of first order inclusion probabilities such that Then there exists X ∼ DSD(K) of fixed size n such that P (i ∈ X) = K ii = Π i .
Moreover, an explicit solution can be constructed. Simpler solutions for equal probability sampling designs (P (i ∈ X) = K ii = n N for all i ∈ E) are also constructed in [25].

Statistical properties
In this section we sum up the statistical properties of the Horvitz-Thompson estimatort y of a total t y = i∈E y i constructed upon those DSDs. The central limit theorem is based on [43], and the concentration inequality on [31]. Let X ∼ DSD(K). We pose (Horvitz-Thompson estimator)t y = i∈X ii and is the Schur-Hadamard (entrywise) product. (6) (Concentration (2)) If DSD(K) is of fixed size n, we can improve the previous inequality in The last three results allow to derive asymptotic as well as finite distance confidence intervals.
We conclude the section by stating three interesting consequences of the knowledge of the exact expression of the variance. As the Horvitz-Thompson estimator is unbiased, the variance of the estimator is a good indicator of the precision of the estimator, and one would tend to choose (if possible) a sampling design with low variance to increase the precision of the estimation.
(1) The quantity y T y − V(t y ) can be interpreted as a ponderated measure of global repulsiveness for point processes on a discrete space (see [6] in the continuous setting). As DPPs are repulsive, we then expect DSDs to achieve small variance within all sampling designs. This is validated by our empirical studies, see also Section 6 for results in the continuous setting. (2) Imagine y is known. Is there a "best" way to estimate its total by a determintal sampling design ? That is, can we minimize (at least theoretically) the variance oft y ? Assume the diagonal diag(K) of K is fixed. Then V(t y ) = y T y − (y/(diag(K)) T K K(y/(diag(K)) and minimizing the variance is equivalent to maximizing z T K Kz with z a fixed vector. This can be interpreted as a quadratic semidefinite optimization problem. Semidefinite optimization has attracted a lot of interest recently (mainly in the linear setting, see for instance [7], [46]). But even in the linear case, problems are challenging. (3) Nevertheless, we can solve the problem of exact estimation (V(t y ) = 0). The following result can be found in [25]: Theorem 3. The total t y is perfectly estimated byt y (t y = t y ) iff DSD(K) is a fixed size stratified determinantal sampling design with K −1 ii y i constant on each stratum.
Equivalently, t y is perfectly estimated byt y iff K is a block diagonal matrix, with each block a projection matrix with diagonal K ii = αy −1 i (the value α depends on the block).

The β-Ginibre point process and design of a cellular network
This section aims to validate the β-Ginibre point process as a model for the distribution of base station locations in a cellular network, from real data collected in Paris, France.

Theoretical model: the β-Ginibre point process
The Ginibre point process (GPP) with intensity ρ = γ π (with γ > 0) is a determinantal point process on C whose kernel C γ is given for any x, y ∈ C by: If β is a real number between 0 and 1, the β-Ginibre point process (β-GPP) with intensity ρ = γ π is a determinantal point process on C whose kernel C γ,β is given for any x, y ∈ C by: A β-GPP may be built by combining two operations on a GPP: a thinning with parameter β (one keeps each point independently with probability β) then a rescaling with parameter √ β, such that we keep the same intensity. Hence, the parameter β provides an information concerning the degree of repulsiveness of the point process: the smaller β is, the less repulsive the β-GPP is. Note that such a point process is not defined for β > 1. One can observe in Figure 2 some realizations of a Poisson point process (PPP) and β-GPPs for different values of β.
These point processes were investigated in the wireless communication field : they were at first introduced by Shirai et al. [38] in quantum physics to model fermion interactions. Works of Miyoshi et al. [28] and Deng et al. [13] have derived a computable representation for the coverage probability in cellular networks -the probability that the signal-to-interference-plus-noise ration (SINR) for a mobile user achieves a target threshold.

Statistical analysis
In this subsection, we use real data from the mobile network in Paris to show that base station locations can be fitted with a β-GPP [17]. We introduce the fitting method that allows to obtain the parameter β and also present the results from the fitting of each deployment and operator. In order to fit the mobile network deployment on the β-Ginibre model, we use the J-function. The J-function of a stationary point process X on R d is defined for any r ∈ R + by: where F is the empty space function of X and G its nearest-neighbor distance distribution function, defined for some u ∈ R d and any r ∈ R + by: The J-function provides both a characterization of the point process and a direct information about its attractiveness or repulsiveness. More precisely, when J < 1, X is attractive, otherwise X is repulsive. The equality J ≡ 1 characterizes the PPP, where there is no interaction between the particles. For the case of the β-GPP, we get from [13] the following proposition.
Proposition 5.1. The J-function of the β-GPP with intensity γ π is given for any r ∈ R + by: Note that for any β this J-function is bigger than one, which confirms that the β-GPP is a repulsive point process. When β tends to 0, this expression tends to 1, which corresponds to the J-function of a PPP.
This J-function allows to validate the β-GPP as a distribution model of the repartition of the base stations for each operator and each technology.
We use the spatstat package to obtain an estimate of the J-function from the raw data. Since we consider only a finite set of antennas, edge effects might appear on the J-function estimate. We then have to keep a subset of the data to perform the estimation. Figure 3 gives the window we considered for extracting data in Paris, France.
This window is chosen such that it covers about 60 % of the city and that its shape matches the geographical borders. The values of the J-function estimate are computed for r ≤ 600 m. Above 600 m, the estimation is not relevant due to the edge-effect. J is then directly fitted on the estimate and the parameter β is deduced. An example of fitting is given in Fig. 4. Visual inspection reveals a clearly repulsive behaviour of the base stations locations and a good fit to the theoretical model, especially when compared to the unit J-function of a PPP.
Numerical values of the parameter β and the intensity ρ from the fitting are given in Table  1 for each operator and each technology, and in Table 2 for each operator. Each intensity ρ is simply computed using the number of corresponding base stations in the window. The  parameter β is then computed by the method of least squares applied to the J-function of the β-GPP and its estimation. Data analysis also shows that the superposition of all sites is tending to a PPP as β is equal to 0.17, which will be confirmed by the asymptotic results mentioned in the part 5.3. We hence show that base stations distribution for an operator and for a technology can be fitted with a β-GPP in the Paris area. The distribution of all base stations of all operators may therefore be considered as an independent superposition of β-GPP, but can actually be fitted with a PPP.

Asymptotics
In order to justify in a theoretical way the Poisson behavior of the independent superposition of the β-GPPs, we put in this subsection the corresponding asymptotic results. Assume that the space E is endowed with its Borel space B(E) and denote by N E the space of configurations on E. For a point process X on E, the application c : E × N E → R + is a Papangelou intensity [16] of X if, for any measurable function u : Intuitively, if x ∈ E and ξ ∈ N E , c(x, ξ) represents the conditional probability of finding a particle in the location x given the configuration ξ. It allows to propose a new definition for repulsiveness: a point process X on E with Papangelou intensity c is said to be repulsive (in the Papangelou sense) if, for any ω, ξ ∈ N E such that ω ⊂ ξ and any x ∈ E, c(x, ξ) ≤ c(x, ω), and weakly repulsive (in the Papangelou sense) if, for any ξ ∈ N E and any x ∈ E, The next proposition gives an explicit expression for the Papangelou intensity of a DPP [16]. If C is the kernel of a DPP, denote by T C the functional operator defined for any f ∈ L 2 (E, µ) and any x ∈ E by Proposition 5.2. Let X be a DPP on E with kernel C such that the operator T H = (Id − T C ) −1 T C is well-defined. Then, its Papangelou intensity c is given for any x 0 , x 1 , . . . , x k by: Moreover, X is repulsive in the Papangelou sense.
We now introduce the topology on point processes which is used in this part [11]. The total variation distance d TV is defined for any measures ν 1 , ν 2 by: we say that a map h : N E → R is 1-Lipschitz according to d TV if for any ω 1 , ω 2 ∈ N E , and denote by Lip 1 (d TV ) the set of all such maps which are measurable.
The Kantorovich-Rubinstein distance d * TV associated to d TV between two point processes X 1 and X 2 is defined as: where the supremum is over all h ∈ Lip 1 (d TV ) that are integrable with respect to the distributions of X 1 and X 2 . This distance provides a strong topology on the space of point processes: in particular, it is strictly stronger than convergence in law. A counterpart of this use is that we can only consider sequences of finite point processes. The following theorem [12] gives an upper bound for this distance taken between a finite PPP and an other point process. Using Laplace transforms, we are able to establish the convergence of a superposition of independent β-GPPs to a PPP and that a β-GPP tends to a PPP as β tends to 0, where these two results are given for convergence in law. The previous upper bound theorem allows to give more precise results [12] provided by the following propositions.
Proposition 5.4. For any n ∈ N, let X n the superposition of n independent, finite and weakly repulsive point processes X n,1 , . . . , X n,n , with respective joint intensities ρ (k) n,1 , . . . , ρ (k) n,n (k ≥ 1) and let Z be a Poisson point process with control measure M (dx) = m(x)µ(dx). Then, Corollary 5.5. Under the assumptions and notations of the Proposition 5.4, and assuming moreover that there exists a real constant A such that for any n ∈ N, one has for any n ∈ N, Proposition 5.6. Let C be the kernel of a stationary determinantal point process X on R d with intensity ρ ∈ R, Λ be a compact subset of R d , (β n ) n∈N ⊂ (0, 1) N and Z Λ,ρ designs the homogeneous Poisson point process with intensity ρ reduced to Λ. For any n ∈ N, X n is the point process on R d obtained by combining a β n -thinning with a β n -rescaling on the point process X that one reduces to Λ. More precisely, X n is the determinantal point process with kernel C n defined by Then,

DPPs for Monte Carlo integration
DPPs have also been used in the design of Monte Carlo numerical integration methods [4]. In this section, we give an informal description of [4], insisting on the link with Gaussian quadrature.
For the purpose of this section, given a probability density π over R d , a numerical integration -or quadrature-method is defined as: • a rule to build nodes x 1 , . . . x N ∈ R d , • a rule to compute weights w i for 1 ≤ i ≤ N , for a large class of functions f . Monte Carlo methods are defined as quadrature methods where the nodes are the realizations of a collection of N random variables. In traditional Monte Carlo approaches [34], nodes (x i ) are drawn independently, such as in importance sampling, or using a Markov chain, such as in Markov chain Monte Carlo (MCMC) algorithms. Laws of large numbers and central limit theorems then guarantee (10), leading to asymptotic confidence intervals for the integral of width O(N −1/2 ). This rate is often dubbed the Monte Carlo rate.
The question naturally arises whether we could obtain a faster rate, i.e. smaller confidence intervals, by introducing more structure in the rule used to sample the nodes (x i ). Intuitively, forcing the nodes to spread evenly across R d should reduce the variance of the LHS of (10), as for each draw of the nodes (x i ) the whole support of π is evenly covered. In contrast, independent or MCMC draws will tend to leave more "holes" in the support of π, and these holes will be different for each draw, resulting in a large variance of the LHS of (10).
Simultaneously, the nodes should concentrate where π puts a lot of mass, as failing to capture rapid variations of f in the modes of π will cost a lot in quadrature error. In a nutshell, a good quadrature method should solve the tradeoff sampling nodes in the modes of π vs. spreading nodes across R d to avoid holes. In this section, we explain how DPPs can satisfyingly solve this tradeoff. To understand how, we first make a detour by a two-century-old quadrature method.

Gaussian quadrature
For deterministic quadrature methods, Gauss [15] first realized that setting a regular grid over R was not the most economical way to integrate polynomials. More precisely, consider for a moment d = 1, and let (p k ) k≥0 be the result of applying Gram-Schmidt orthonormalization in L 2 (π) to the sequence of monomials q k : x → x k . We call (p k ) the orthonormal polynomials with respect to π [45]. Note that in particular, the degree of p k is k. Finally, let be the so-called Christoffel-Darboux kernel [40]. Then Gaussian quadrature is defined by setting (x i ) i=1,...,N to be the N zeros of p N , and the weights w i to be 1/K N (x i , x i ) for each 1 ≤ i ≤ N . Gaussian quadrature has the property to be exact whenever f is a polynomial function of degree up to 2N − 1, that is, (10) is an equality. Gaussian quadrature is thus expected to be accurate on functions that are limits of sequences of polynomials: Jackson's approximation theorem for algebraic polynomials, for instance, says that the error in (10) is O(1/N ) for f continuously differentiable. Although deterministic, Gaussian quadrature has the property we are looking after: nodes are well spread across the support of π, while more densely present in the modes of π. To understand why, denote by c k the leading coefficient of p k . Among all monic polynomials of degree k, p k /c k is the one with the smallest norm in L 2 (π), see e.g. [45,Section 3]. Consequently, the zeros of p k will tend to be far from each other, to make p k as flat and close to zero as possible.
Simultaneously, there will be more zeros where π puts a lot of mass, to make p k even closer to zero in the areas of R d that contribute a lot to squared norms in L 2 (π). These intuitions are demonstrated on Figure 5(a).
The main disadvantages of Gaussian quadrature are that (1) there is no generic way to adapt Gaussian quadrature to d ≥ 2. Outside very specific cases, one has to make the Cartesian product of sets of one-dimensional nodes, thus raising any bound on the quadrature error to the power 1/d [10]. (2) tight bounds on the error are scarce [47], specific to particular choices of π, and do not scale well with the dimension d. (3) orthonormal polynomials are rarely known beforehand and computing them requires computing the moments of π. This limits the applicability of the method, with most applications focusing on Jacobi measures π : x → (1 − x) a (1 + b) b for some a, b > −1.

Monte Carlo with DPPs
Looking back at Definitions (3) and (4), DPPs also have the potential to solve the tradeoff of "sampling in the modes of π but make points spread regularly". Choosing µ = π in (3), any  Figure 5. Comparison of quadrature rules. In each plot, π is the same product Jacobi measure, depicted in green on the marginal plots. The area of each marker is proportional to its weight w i in the corresponding estimator.
kernel K in (4) that makes the corresponding DPP exist, and denoting (x i ) a sample from the corresponding DPP, (4) implies that is an unbiased estimator of the RHS of (10). In particular, choosing K = K N the Christoffel-Darboux kernel (11) yields DPP samples of size N almost surely, and the estimator (12) takes the same form as Gaussian quadrature in Section 6.1, except the nodes are a sample from a DPP and not the zeros of p N . Actually, these two sets of nodes are asymptotically very similar [19], further confirming the intuition that Monte Carlo with a DPP using kernel (11) is a stochastic version of Gaussian quadrature. It turns out that stochasticity partly solves the issues of Gaussian quadrature: (1) the above DPP easily generalizes to general d, as long as π(x) = π 1 (x 1 ) . . . π d (x d ) is separable. For 1 ≤ ≤ d, take (p k ) k to be the orthonormal polynomials with respect to π , fix b : N → N d a bijection that orders d-uplets of integers, and consider the kernel on R d × R d defined by where (k 1 , . . . , k d ) = b(k). Then taking µ = π and K = K N in (3) and (4) leads to a DPP that exists and strictly generalizes the above one-dimensional construction. Figure 5(b) depicts a draw from this DPP. The weighted set of points in the DPP sample can be seen to leave fewer holes than i.i.d. sampling with the same marginals in Figure 5(c). (2) Most importantly, for any dimension d, one can prove a central limit theorem [4, Theorem 2.7] for (12) that leads to confidence intervals of size O(N − 1 technically involved and requires assumptions on π, f , and b. In particular, f should be continuously differentiable. But the reward is an asymptotic quantification of the quadrature error at a higher level of generality than Gaussian quadrature. Furthermore, the limiting variance in the central limit theorem [4, Theorem 2.7] then has a strikingly simple form, and measures how fast the Fourier coefficients of f π decrease. In other terms, non-smoothness of the integrand in (10) is paid in limiting variance. (3) if the moments of π are not known, or π is not separable, one can still draw from an instrumental DPP that satisfies the assumptions, and then change the weights w i accordingly. The same central limit theorem holds for this new estimator [4, Theorem 2.9].