STOCHASTIC APPROXIMATION SCHEMES FOR ECONOMIC CAPITAL AND RISK MARGIN COMPUTATIONS

. We consider the problem of the numerical computation of its economic capital by an insurance or a bank, in the form of a value-at-risk or expected shortfall of its loss over a given time horizon. This loss includes the appreciation of the mark-to-model of the liabilities of the ﬁrm, which we account for by nested Monte Carlo `a la Gordy and Juneja [17] or by regression `a la Broadie, Du


Introduction
The current financial and insurance regulatory trends incentivize investment banks and insurance companies to charge to their clients, on top of a risk-neutral expectation of contractual cash flows, a suitable risk margin (see [11], [25]), meant to be gradually released to shareholders as return for their capital at risk in the future. This risk margin, sometimes called market value margin (MVM) in insurance and corresponding in banking to a capital valuation adjustment (KVA, see [2]), can be modeled as an expectation of the future economic capital of the firm. Future economic capital is modeled in our paper as the conditional expected shortfall (ES) 1 of the losses of the firm over a one-year horizon. These losses are assessed on a mark-to-model basis, which includes, at any future time point where the conditional expected shortfall is computed, the valuation one year later of the liabilities of the firm, such as variable annuities (VA) in the insurance case or a credit valuation adjustment (CVA) in the banking case, i.e. an expectation, conditionally on the information available one year later, of the future cash flows that the liability is pricing.
As such complex liabilities are typically intractable analytically and because the losses of the firm are specified through dynamic models of the underlying risk factors, in principle, the computation of the risk margin involves a nested and, in fact, a doubly nested simulation, whereby an outer Monte Carlo simulation gathers inner estimates of conditional expected shortfalls at future time points, themselves calling for recursive valuation one year later of the embedded liability. This makes it a challenging problem, both from a practical and from a convergence analysis point of view. In particular, on realistically heavy applications, such computations can only be implemented in parallel, with GPUs as a current hardware paradigm, which poses nontrivial programming optimization issues.
The assumptions made in Gordy and Juneja [17] for establishing the convergence of the simulation-and-sort value-at-risk and expected shortfall nested Monte Carlo estimates are hard to check (and might actually be violated) in practice, especially when considered dynamically in the context of risk margin computations. As the value-at-risk and expected shortfall of a given loss random variable can jointly be represented as zeros of suitable functions that can be written as expectations, an alternative is stochastic approximation (SA). In the base case without embedded liability of the firm, the convergence of the value-at-risk and expected shortfall SA estimates is established in Bardou, Frikha, and Pagès [7,8]. In the present paper this convergence is extended to the case of dependent noise, corresponding to the presence of the nested future liability of the firm in our loss variable. This is then applied to risk margin computations by embedding the resulting inner conditional ES estimates into an outer sample mean.
Moreover we analyze a variant of this approach where the future liabilities are regressed as in Broadie, Du, and Moallemi [10], rather than re-simulated in a nested fashion, resulting in a simply nested procedure for the overall risk margin computation.
The different variants of the method are tested numerically, using GPU programming so that the inner conditional risk measures can be computed in parallel and then averaged out for yielding the outer risk margin estimate.
Beyond the extension of the base result of [7,8] to dependent noise and its economical capital and risk margin application, we refer the reader to the concluding section of the paper regarding the technical contributions of our approach with respect to [17] and [10].
The paper is organized as follows. Section 1 presents our stochastic approximation value-at-risk and expected shortfall algorithms in the presence of dependent noise, with nested Monte Carlo versus regression estimates of the latter in the respective cases of Algorithms 1 and 2 (whereas Algorithm 0 corresponds to the base case without dependent noise). Sections 2 and 3 deal with the convergence analyses of the respective Algorithms 1 and 2. Section 4 casts such estimates in a dynamic setup, integrating out the estimated conditional economic capital in the context of an outer simulation for the corresponding risk margin; this is then illustrated numerically in the context of a KVA case study. Section 5 concludes.
Remark 0.1. In the motivating discussion above and in our application Section 4, for concreteness, we focus on economic capital, modeled as expected shorfall, and on the ensuing risk margin. However, the results of Sections 2 and 3 cover both expected shorfall and value-at-risk (establishing convergence for the latter is in fact a prerequisite for the former). Hence, our results also cover the cases of value-at-risk, conditional value-at-risks, and integration of the latter in the context of an outer expectation. Again this can be relevant together for bank and for insurance, noting that: • In the insurance case, Solvency capital is determined as the 99.5%-value-at-risk of the one year loss of the firm for Solvency II (see [11]), and as the 99%-expected shortfall for the Swiss Solvency Test (see [25]); • In the banking case, Basel II Pillar II defines economic capital as the 99% value-at-risk of the depletion over a one-year period of core equity tier I capital (CET1) (where the latter corresponds the one year trading loss of the bank as detailed in [2,Section A.2]); But the FRTB required a shift from 99% valueat-risk to 97.5% expected shortfall as the reference risk measure in capital calculations. Moreover, valueat-risk is relevant to banks for the computation of their initial margin (with a time horizon of one or two weeks, as opposed to one year conventionally in the paper) and, in turn, of their dynamic (conditional) initial margin (see [3]) in the context of the computation of their margin valuation adjustment (MVA).

Stochastic Algorithms for Economic Capital Calculations
On some probability space (Ω, A, P), our financial loss L is defined as a real valued random variable of the form where (β, φ, ψ, ψ ) are four real valued random variables and (Z 0 , Z 1 ) are two R q valued random variables such that under P 0 , the conditional probability measure P given Z 0 (with related expectation and variance denoted by E 0 and Var 0 ): • i.i.d. samples from (φ, β, Z 1 ) given Z 0 are available; • i.i.d. samples from the conditional distribution of ψ given Z 1 , denoted by Π(Z 1 , ·), are available; • i.i.d. samples from the conditional distribution of ψ given {Z 0 = z}, denoted by Π (z, ·), are available; • the discount factor β is bounded: there exists a positive constant c β such that |β| ≤ c β .
We denote by P(z, ·) and Q(z, ·) the distributions of Z 1 and L conditionally on Z 0 = z. We also write Ψ(Z 1 ) := E 0 [ψ|Z 1 ] and Ψ (z) := E 0 [ψ ], so that L = φ + βΨ(Z 1 ) − Ψ (z). (1.2) In the financial application the second and third terms in (1.1) will be used for modeling the future (conventionally taken as 1, i.e. one year) and present (time 0) liability valuations, whereas the first term corresponds to the realized loss of the firm on the time interval [0, 1]. The above-listed assumptions allow recovering E 0 [ψ|Z 1 ] by nested Monte Carlo simulation restarting from time 1 (which is the approach in [17]) or by empirical regression at time 1 (which is the approach in [10]), whereas E 0 [ψ ] can be obtained by a standard Monte Carlo simulation rooted at (0, z). Let A value-at-risk ξ at level α of the random variable (loss) L solves the equation it is uniquely defined if L has an increasing P 0 c.d.f F e.g. if it has a nonvanishing P 0 density f . Given a solution ξ to (1.4), the expected shortfall χ at level α solves the equation We model economic capital (EC) at time 0 (known in the insurance regulation as the Solvency capital requirement, SCR) as the expected shortfall of level α ∈ 1 2 , 1 of the distribution of L given Z 0 = z, i.e.
In (1.7), VaR a 0 [L] is a corresponding value-at-risk at level a. Throughout the paper, α is fixed, so the dependence of ES(z) upon α is omitted. Likewise we introduce the notation VaR(z) for the value-at-risk at the (fixed) level α of L.

Stochastic Approximation (SA) With Dependent Noise
We propose two approaches for computing ES(z). Both estimates ES(z) are defined as the output of a stochastic approximation (SA) algorithm with K iterations. However, in the applications targeted in this paper, the expectations in (1.4) and (1.5) are not known analytically, so that the quantities (ξ , χ ) are roots of intractable functions. SA algorithms provide a numerical solution to (1.4)-(1.5) (see e.g. [9,20]): given a deterministic stepsize sequence {γ k , k ≥ 1} and a sequence {L k , k ≥ 1} of random variables i.i.d. with distribution Q(z, ·), we define iteratively, starting from (ξ 0 , χ 0 ), to be compared with the following empirical quantile ξ k specification: The (almost-sure) limit (ξ ∞ , χ ∞ ) of any convergent sequence {(ξ k , χ k , k ≥ 0} is a solution to Therefore, any limit is a pair of solutions to (1.4)-(1.5). In particular, χ ∞ = ES(z).
However, in our case, i.i.d. samples from the law of L are not available, because of the quantities E 1 [ψ] and E 0 [ψ ] in L, which are not explicit. Therefore, we propose to replace exact sampling of L by approximate sampling. Toward this aim, we introduce two strategies.
with the same distribution as (φ, β, Z 1 ) conditionally on Z 0 = z, the first strategy consists in replacing the draws {L k , k ≥ 1} in (1.8) by Of course, conditionally on Z 0 = z, the second average in (1.9) can be updated at each step k using only the corresponding partial sum at step k − 1 and the samples {ψ m , M k−1 < m ≤ M k }.
The second strategy consists in replacing the draws {L k , k ≥ 1} in (1.8) by where the first and last terms are as before and where Ψ(·) is a regression-based estimator, computed prior and independently from the Z k 1 , of the function Ψ(·), such that (recall that P(z, ·) denotes the conditional distribution of Z 1 given Z 0 = z).
The advantage of the first approach is that, under sufficiently good convergence hypotheses for the nested averages (see the assumptions of Theorem 2.2), the approximation of ES(z) can be made asymptotically as good as desired. On the other side, the approach based on the regression requires a previous knowledge of the global behavior of Ψ (as an element of a certain function space) in order to give approximations with small bias (see Theorem 3.3), which is essential to have good asymptotics in our error analysis. Nevertheless, the second strategy has a small computational cost compared with the first one (at least for large values of the M k in (1.9)). This can be a significant advantage if we indeed know which function space can serve to build a good predictor of the function Ψ.
Algorithmic summaries of these two strategies are given in the respective Sections 1.3 and 1.4. In Section 1.2, for pedagogical purposes, we start by recalling essentially known results in the base case where ψ = ψ = 0.

Base-case Without Present and Future Liabilities
Sample φ k with the same distribution as φ conditionally on Z 0 = z, independently from the past draws ; 5 Set L k := φ k ; 6 /* Update the conditional VaR and ES estimates */ Algorithm 0: Estimates of VaR(z) and ES(z) in the base case without present and future liabilities (ψ = ψ = 0). Note that, when ψ ≡ ψ = 0, the random variables {L k , k ≥ 1} are i.i.d. with distribution Q(z, ·). Therefore, sufficient conditions on this distribution and on the sequence {γ k , k ≥ 1} for the almost-sure convergence of {ξ k , k ≥ 0} to VaR(z) and {χ k , k ≥ 0} to ES(z) can be proven by application of standard results for stochastic approximation algorithms: By application of Theorem A.3 and Lemma A.2 in Appendix A.2, we prove in Theorem 1.2 that the algorithm produces a sequence {(ξ k , χ k ), k ≥ 1} converging to a pair solution of (1.4)-(1.5) where L ∼ Q(z, ·). Hence, ξ K is a strongly consistent estimator of a value-at-risk of level α of the distribution Q(z, ·), while χ K is a (strongly) consistent estimator of the associated expected shortfall.
More precisely, these convergences are established under the following assumptions.
H1. {γ k , k ≥ 1} is a (0, 1)-valued deterministic sequence, such that for some κ ∈ (0, 1], H1 is standard in stochastic approximation, and is satisfied for example with γ n ∼ γ /n c and c ∈ (1/2, 1]. The condition H2 essentially allows to characterize the set of the limiting points of the algorithm and to prove that the stochastic approximation algorithm is a perturbation of a discretized ODE with a controlled noise. be the output of Algorithm 0. Assume H1 and H2. Then there exist a bounded random variable ξ ∞ and a real number χ ∞ satisfying (1.6) P 0 -a.s. and such that, for any p ∈ (0, 2), The proof of this result, which is very close to [7,Theorem 1], is detailed in Appendix A.3. The proof consists in first proving the almost-sure convergence of the sequence {ξ k , k ≥ 0} toward the set of solutions of (1.4) by applying classical results on the convergence of stochastic approximation scheme; for the sake of completeness these results are stated and proved as Theorem A.3 in Appendix A.2. We then deduce the convergence of the sequence {χ k , k ≥ 0} by using the fact that χ k can be written as a weighted sum of the samples {L j , ξ j , 0 ≤ j ≤ k} (see Lemma A.2 in Appendix A.1).
Remember that although the set of solutions to the equation ξ : 1−α = ∞ ξ Q(z, dx) might not be a singleton (when VaR(z) is not unique), ES(z) is unique -see Lemma A.1 in Appendix A.3.

With Future Liability Estimated by Nested Monte Carlo
Sample (φ k , β k , Z k 1 ) with the same distribution as (φ, β, Z 1 ) conditionally on Z 0 = z, independently from the past draws ; , independently from the past draws ; 10 /* Update the conditional VaR and ES estimates */ Algorithm 1: Estimates of VaR(z) and ES(z) with future liability estimated by nested Monte Carlo. Note that the random variables {L k , k ≥ 1} have the same distribution, but this distribution is not Q(z, ·), the distribution of L given by (1.1): there is a bias which, roughly speaking, can be made as small as possible by choosing M k , M k large enough.
We provide in Section 2.1 sufficient conditions on Q(z, ·) and on the sequences {γ k , k ≥ 1}, {M k , k ≥ 1}, {M k , k ≥ 1} for the P 0 -a.s. convergence of {ξ k , k ≥ 0} to VaR(z) and {χ k , k ≥ 0} to ES(z). We also provide convergence rates in Section 2.2 and show the benefit of considering the averaged outputs K −1 K k=1 ξ k and K −1 K k=1 χ k as estimators of VaR(z) and ES(z).

With Future Liability Estimated by Regression
The regression approach relies on the following observation: The function Ψ in (1.10) satisfies where H denotes the set of Borel measurable, P(z, ·)-square integrable functions from R q to R. Since the integral in (1.11) is not explicit but sampling from the conditional distribution of (ψ, Z 1 ) given Z 0 = z is possible, we define the estimate Ψ(·) as the solution of the empirical criterion associated with (1.11), replacing this integral by a Monte Carlo sum with i.i.d. samples. If furthermore we replace the 'complex' functional space H by a space H suitable to least squares estimation, typically a finite-dimensional vector space of functions (but not necessarily, possibly also e.g. a neural network), we obtain a version of (1.11) in which Ψ is approximated by the solution to a least squares regression problem. The best choice for H will depend on the specific problem at hand, typically on regularity assumptions regarding Ψ.
In order to make use of the distribution-free theory of non-parametric regression (as explained, for instance, in [18]), it is better to deal with bounded random variables to get nice statistical error estimates (through appropriate measure concentration inequalities). For this reason we consider the projection of the real-valued random variable ψ on the interval [−B, B]: where B is a large threshold assumed to be known by the user, and we write Ψ B (Z 1 ) : This gives rise to the following Algorithm 2 for the estimation of ES(z), using the embedded regression Algorithm 3. Sample (φ k , β k , Z k 1 ) from the conditional distribution of (φ, β, Z 1 ) given Z 0 = z, independently from the past draws; /* Update the conditional VaR and ES estimates */ , where (ξ ∞ , χ ∞ ) is the (almost-sure) limit of (ξ K , χ K ) as K goes to infinity, from biases (or"deterministic errors") given, up to multiplicative constants, as the respective square and cube roots of (1.13) The control of (1.13) depends on analytic features of the problem at hand, typically on the regularity of Ψ for the choice of H and on the distribution of ψ for the choice of B (see [18,Chapter 10] for a general discussion). 2 Putting everything together, these results are telling us how we should choose the inputs ( H, B, M, M ) in order to make the limit (χ ∞ , ξ ∞ ) of the (χ K , ξ K ) as close as desired from the target values (χ , ξ ).

2.
Convergence Analysis of the Economic Capital SA Algorithm 1 (Future Liabilities Estimated by Nested Monte Carlo) Section 2.1 deals with the almost-sure convergence of Algorithm 1. Section 2.2 addresses the rate of convergence of Algorithm 1 along a converging sequence: a central limit theorem is established as well as the rate of convergence when an averaging technique is applied to the output of Algorithm 1.

Almost-sure Convergence
The difference between Algorithm 0 and Algorithm 1 is that .1)) is non zero. The expectations are untractable and they are approximated by Monte Carlo sums. Hence, the random variables {L k , k ≥ 1} in Algorithm 1 are no more i.i.d. under the distribution Q(z, ·). Nevertheless, when the number of Monte Carlo points tends to infinity, the Monte Carlo error vanishes, and it is expected that Algorithm 1 inherits the same asymptotic behavior as the one of Algorithm 0, in which the L k are i.i.d. with distribution Q(z, ·). We provide sufficient conditions for this intuition to hold. H5 strenghtens H1 by showing how the stepsize γ k and the number of Monte Carlo points M k , M k have to be balanced; H3 is in echo to H2. H4 (see also H6) is introduced to control the bias between the distributions of the L k and Q(z, ·).
We assume and it has a density with respect to the Lebesgue measure on R, bounded by C 0 (z) > 0. In addition, H4. There exists p ≥ 2 such that is finite.
When c = 1, we have to choosec ≤ 1 and µ > 0 (note that the last condition in (2.2) does not allow µ = 0). Therefore, the number of Monte Carlo points has to increase, even slowly, along the iterations; this comes from the fact that the Monte Carlo bias has to vanish along iterations to force Algorithm 1 to have the same behavior as Algorithm 0.
When c = 1/2, the slowest rate for M k ∧ M k is µ = 1 + 1/p , and in that case,μ > 1 + 1/p andc > 1. Therefore, the number of Monte Carlo points has to increase more than linearly with k.
When c ∈ (1/2, 1), the slowest rate for M k ∧ M k is µ = 2(1 − c)(1 + 1/p ), and in that case,c > 1/c and The above discussion makes it apparent that either we choose a rapidly decaying stepsize sequence, and we have the weakest Monte Carlo cost; or we choose a slowly decaying stepsize sequence, but the number of Monte Carlo points has to increase more than linearly. It is known that for implementation efficiency, a slow decaying rate for γ k is better during the burn-in phase of the algorithm (while it has not reached its asymptotic convergence rate).
If H4 is strenghtened into H6. There exists C ∞ (z) > 0 such that for any δ > 0 and any integer M , The above discussion on the choice of (c, µ) is essentially modified as follows (the choice of the logarithmic terms c,μ is not detailed): either c = 1 and µ > 0, or c ∈ [1/2, 1) and µ = 2(1 − c).
The following proposition is fundamental in the proof of Theorem 2.2. It allows to control the error induced by drawing samples L k under a distribution approximating Q(z, ·) instead of sampling from Q(z, ·). Its proof is postponed to Appendix A.4. Lemma 2.1. Assume H3 and H4. Let L be a random variable such that where c p is a universal constant depending only on p . When H4 is replaced with H6, then for any M, M ≥ 3, We can now prove that the output of Algorithm 1 provides strongly consistent estimators of VaR(z) and ES(z). The proof of the next theorem is postponed to Appendix A.4.
be the output of Algorithm 1. Assume H3, H4, and H5. Then there exists a bounded random variable ξ ∞ and a real χ ∞ satisfying P 0 -a.s. (1.6) and such that for any p ∈ (0, 2)

Rates of Convergence of Algorithm 1
We establish a rate of convergence in L 2 and a central limit theorem, along a sequence {(ξ k , χ k ), k ≥ 1} converging to (ξ , χ ), where (ξ , χ ) is a solution to (1.6); this solution is fixed throughout this section. These results are derived under the following conditions.
H7. (ξ , χ ) solves (1.6). H3 holds and is strenghtened as follows: under P 0 , the density of L : , is continuously differentiable in a neighborhood of ξ and strictly positive at ξ . In addition, there exists ν > 0 such that H4 is strenghtened as follows: there exists p > 2 such that is finite.
To make the assumptions simpler, we consider the case where the stepsize sequence {γ k , k ≥ 1} is polynomially decreasing.
where p is given by H7.
When M k ∧ M k ∼ m k µ when k → ∞, the condition (2.7) is satisfied with µ > c(1 + 1/p ). In the case the condition (2.6) is replaced with H6, the condition (2.7) gets into Theorem 2.4 provides a central limit theorem, proving that, along converging paths, the normalized error γ −1/2 k (θ k − θ ) behaves asymptotically as a Gaussian distribution.
Theorem 2.4. Assume H7 and H8. Let {θ k , k ≥ 1} be the output of Algorithm 1. Then, under the conditional probability P 0 (·| lim q θ q = θ ), the sequence {γ The proof of Theorem 2.4 is postponed to Appendix A.6. Lemma 2.3 is a consequence of [16, Lemma 3.1.], applied to the same decomposition of θ k − θ as in the proof of Theorem 2.4; details are omitted.
Theorem 2.4 shows that (i) the maximal rate of convergence is reached with a stepsize γ k decaying at a rate 1/k as soon as γ is large enough (see H8 in the case c = 1); (ii) the limiting variance depends on γ . In practice, the condition on γ is difficult to check since the quantity f (z, ξ ) is unknown in many applications; in addition, it is known (see e.g. [9, Lemma 4, Chapter 3, Part I] or [16,Section 3]) that the optimal variance for an SA algorithm targeting the roots of the function We prove in Theorem 2.5 that the optimal rate O(1/k) and this optimal limiting variance Γ can be obtained by a simple post-processing of the output of Algorithm 1 run with γ k ∼ γ k −c for some c ∈ (1/2, 1). The proof of Theorem 2.5 is postponed to Appendix A.6. This post-processing technique is known in the literature as the Polyak-Ruppert averaging (see [21,22]). Setθ Theorem 2.5. Let {θ k , k ≥ 1} be the output of Algorithm 1. Assume H7, γ k ∼ γ k −c with c ∈ (1/2, 1) and γ > 0, and Then, under the conditional probability P 0 (·| lim q θ q = θ ), the sequence {k 1/2 (θ k − θ ), k ≥ 1} converges in distribution to the centered bivariate normal distribution with covariance matrix Γ .
In the case where the condition (2.6) is replaced with H6 in Theorem 2.5, then the condition (2.8) becomes it is satisfied if µ > 2c. Note that these conditions on µ are slightly more restrictive than what we obtained for the convergence of the sequence {θ k , k ≥ 1} in the case c ∈ (1/2, 1).

Convergence Analysis of the Economic Capital SA Algorithm 2 (Future Liabilities Estimated by Regression)
In order to properly define Ψ B (Z 1 ) in L B as a random variable, we assume that the function space H is pointwise measurable. 3 We introduce the following object (cf. (3.1)): For any fixed g ∈ H, we define where g B : R q → R is the truncation of g by B: g B := sign(g)(|g| ∧ B). Last, for the approximation of Ψ B obtained by regression (see Algorithm 3), we write They are independent from the ψ m . In addition we have the square integrability conditions : E 0 |φ| 2 + |ψ | 2 < +∞.
Observe that the above assumption ensures in particular that, for any g ∈ H, We require an additional condition on L g B .
H10. For every g ∈ H, L g B in (3.2) has a continuous cumulative distribution function under P 0 .
Proof. Given g ∈ H, H1, H9 and H10 imply that the hypotheses of Theorem 1.2 are verified for every L g B as in (3.2) under the distribution P 0 . Hence, for fixed D, the same is true for L B in (3.3) under the conditional distribution P 0 ( · | D). The conclusion follows by application of Theorem 1.2.

Error Analysis With a Given Approximate Model for the Regression Function
The next step is to bound the error between the initial model for L and the truncated and approximate model L g B , where we use the function g B (for a given g ∈ H) as a model of Ψ. For this we need Assumption 2 a) on the cumulative distribution function of L in (1.2) and its stronger version The distribution of L, with P 0 c.d.f. F , admits a density f under P 0 bounded by C f , this density is positive and continuous on a neighborhood of the interval

4)
where Assume H9-H10 and H2 a), let g ∈ H be given and let If the stronger condition H11 holds (for g), then Proof. We begin by proving (3.6), by an application of Corollary A.11. For this, we first estimate the Kolmogorov distance d kol (L g B , L): actually Corollary A.13 with p = 2 gives The difference in the expectation (3.7) is bounded as (see definitions (1.2) and (3.2)) Therefore, we deduce Consequently, we can apply Corollary A.11 with r = s = ζ, to get (3.6). The inequality (3.5) follows in an easier way via (A. 16) in Lemma A.10.

Error Analysis for the Randomly Optimal Regression Function
Observe that by taking formally g B = Ψ B , we obtain, as a corollary of the previous proposition, a pathwise control between (ξ ∞ , χ ∞ ) (associated to L B ) and (ξ , χ ) (associated to L), for a given regression sample D. By reintegrating over the learning sample D, we shall obtain an estimate about the corresponding mean L 1 error. This strategy works nicely, in particular if we allow Assumption H11 to be valid with a uniform in the learning sample D. For this, set which stands for a (rough) upper bound for ζ. This explains the following new assumption.
Regarding the error analysis about the limits of Algorithm 2 (given by Lemma 3.1), our main result is now the following. (1 + ln(M )) M

10)
where C is the constant that appears in (A.20). We have 3 gives a precise and useful guide for tuning the parameters all together. Namely, to make the (asymptotic) errors E 0 |χ ∞ − χ | and E 0 |ξ ∞ − ξ | less than some tolerance , we can choose H and B such that the "bias" given by the second line in (3.10) is sufficiently small; then one can choose M and M large enough so that the right hand sides in (3.11) are less than . Unsurprisingly, when the complexity of H increases, the bias term (inf h∈ H . . . ) goes to 0 and the variance term explodes (VCĤ → +∞), hence one has to find a trade-off between those types of error. When one increases the threshold B, the bias decreases E 0 [((|ψ| − B) + ) 2 ] but the variance increases (factor C B 2 . . . ).
Proof. First, by H1, H9, H10 and Lemma 3.1, the limits indeed exist for every fixed D and they correspond to solutions of (1.4)-(1.5) for L = L B (see (1.2)) under P 0 ( · | D). Now apply Lemma 3.2, valid for any D since H2 a) holds for all g B owing to the choice ζ = ζ ∞ . As β is bounded, we obtain Note that the first expectation of the right hand side is exactly controlled using Theorem A.14. For the second term, write We now easily obtain the desired estimates by taking the expectation in (3.13)-(3.14), applying Theorem A.14 and using (3.15), together with E(|Z| 1/p ) ≤ (E(|Z|)) 1/p for any p ≥ 1.

Dynamization of the Setup
Let there be given an R q valued process Z = {Z t , t ≥ 0}, with Z 0 = z, non-homogeneous Markov in its own filtration on our probability space (Ω, A, P). The process Z plays the role of observable risk factors. Conditional probabilities, expectations, value-at-risks and expected shortfalls at a level a ∈ (0, 1), given Z t , are denoted by P t , E t , VaR a t , and ES a t . Other sources of randomness arising in (Ω, A, P) may be unobservable factors (like hidden financial variables, private information). We assume that Z can be simulated exactly (in other words, we ignore for the sake of simplicity a vanishing time discretization bias regarding Z, which could be considered without major difficulty). We denote byZ t = (t, Z t ) the time-homogenized Markov extension of Z. We write Z [s,t] andZ [s,t] for the paths of Z andZ on the interval [s, t]. We define the discount factor for some bounded from below, continuous interest rate function r (hence, in particular, a bounded discount factor). We may then consider the following specification of (1.1): where φ and ψ are real valued measurable functions.
Remark 4.1. The functions φ and ψ could depend on variables other than Z, it would not have any significant impact on the analysis. For instance, we could consider a Euro Median Term Note (EMTN), issued by a bank, with a performance linked to the 1 year Euribor rate denoted by Z; then the cashflow for the bank may take the form ϕ(Z 1 )1 τ ≥1 = ψ (Z [0,1] , τ ), where τ is the default time for the bank (assumed independent from Z for simplicity).
In the regression setup of Algorithm 2, this flexibility of using "Z smaller than an underlying high-dimensional factor process" allows embedding in our framework the common industry practice of "partial regressions" with respect to reduced sets of factors.
More broadly, let, for t ≥ 0 (cf. (4.1) for t = 0), Let VaR a t [L t t+1 ] denote a value-at-risk at level a ∈ 1 2 , 1 of L t t+1 for the conditional distribution 5 of L t t+1 given F t , i.e.

Theoretical Risk Margin Estimate
The risk margin RM (called KVA in banking parlance) estimates how much it would cost the firm (bank or insurance) to remunerate its shareholders at the hurdle rate h > 0 (e.g. 10%) for their capital at risk ES(Z t ) at any future time t (see Section ). Given the final maturity T of the portfolio, the corresponding formula in [2] reads as where the second equality follows by randomization of the integral with an independent exponential time ζ of parameter h. Accordingly, we propose the risk margin estimator where {Z n ζ n , n ≥ 1} are independent random variables with the same distribution asZ ζ and where ES(·) is one of the estimators of ES(·) considered in the previous sections, now made conditional on Z t .
The convergence of the ensuing estimator to the risk margin obtained by sampling an outer expectation of inner conditional expected shortfall estimates could be established by taking an outer expectation of the errors for ES(Z n ζ n ) estimates of the ES(Z n ζ n ) in (4.5), errors obtained from the conditional version of the results of Sections 2 and 3 (or, more precisely, of the awaited but technical developments of these results in terms of convergence rates). By contrast, how to "make conditional" the convergence arguments of [17] or [10] and "aggregate them" to establish the convergence of an outer risk margin estimate is far from clear.

KVA Case Study
Our case study is based on the setup of Armenti and Crépey [5], Section 4 (see also Section 4.4 in [1]), which we recall as a starting point. We consider a clearing house (or central counterparty, CCP for short) with a finite number (≥ 2) of clearing members labeled by i. We denote by: • T : an upper bound on the maturity of all claims in the CCP portfolio, also accounting for a constant time δ > 0 of liquidating the positions of defaulting clearing members; • D i t : The cumulative contractual cash flow process of the CCP portfolio of the member i, cash flows being counted positively when they flow from the clearing member to the CCP; The mark-to-market of the CCP portfolio of the member i; • τ i , τ δ i = τ i + δ and δ τ δ i (dt): The default and liquidation times of the member i, a Dirac measure at time τ δ i ; The cumulative contractual cash flows of the member i, accrued at the OIS rate, over the liquidation period of the clearing member i; • IM i t : The initial margin (IM) posted by the member i as a guarantee in case it defaults, given at time t as a conditional value-at-risk (at a given confidence level a ma ) of β −1 Beyond the first ring of defense provided by initial margin (and, of course, variation margin, which we assume equal to the process MtM i t stopped at time τ i ), a CCP maintains an additional resource, known as the default fund, against extreme and systemic risk. The current EMIR regulation sizes the default fund of a CCP by the Cover 2 rule, i.e. enough to cover the joint default of the two clearing members with the greatest CCP exposures, which purely relies on market risk. By contrast, we consider in the setup of this case study a broader risk-based specification, in the form of an economic capital of the CCP, which would be defined as a conditional expected shortfall, at some confidence level a df , of its one-year ahead loss-and-profit if there was no default fund, as it results from the combination of the credit risk of the clearing members and of the market risk of their portfolios. As developed in [5], such a specification can be used for allocating the default fund between the clearing members, after calibration of the quantile level a df to the Cover 2 regulatory rule at time 0.
Specifically, we define the loss process of a CCP that would be in charge of dealing with member counterparty default losses through a CVA ccp account (earning the risk-free rate r) as, for t ∈ (0, T ] (starting from some arbitrary initial value, since it is only the fluctuations of L ccp that matter in what follows), where the CVA of the CCP is given as (in particular, L ccp is constant from time T onward). We define the corresponding economic capital process of the CCP as where, by (4.6), (4.9) The KVA (or risk margin) of the CCP estimates how much it would cost the CCP to remunerate all clearing members at some hurdle rate h for their capital at risk in the default fund from time 0 onward, namely, assuming a constant interest rate r (cf. For our numerics we consider the CCP toy model of Section 4 in [5] and Section 4.4 in [1], where nine members are clearing (interest rate or foreign exchange) swaps on a Black-Scholes underlying rate process X, with all the numerical parameters used there. The default times of the clearing members are defined through the common shock model or dynamic Marshall-Olkin copula (DMO) model of [12], Chapt. 8-10 and [13] (see also [14,15]).

Mapping with the General Setup
This model, where defaults can happen simultaneously with positive probabilities, results in a Markovian pair Z = (X, Y ) made of, on the one hand, the underlying Black-Scholes rate X and, on the other hand, the vector Y of the default indicator processes of the clearing members. As a consequence, all conditional expectations, value-at-risks (embedded in the IM i numbers), and expected shortfalls (embedded in the EC ccp numbers) are functions of the pair (t, Z), so that, with Z = (X, Y ) as above, we can identify The ensuing KVA can be computed by Algorithms 0 (for validation purposes, building on the explicit CVA ccp formulas that are available in our stylized setup, cf. [5, Section A]), 1, or 2 for the inner EC ccp computations, which are then aggregated as explained above. However, for GPU optimization reasons developed in [1, Appendices A and B], we do not rely on the randomized version (given by the right-hand side formulation) of the risk margin in (4.4), i.e. we do not use the unbiased estimator (4.5), resorting instead on a Riemann sum approximation of step six months of the time integral that is visible in the left-hand side in (4.4).
Depending on the algorithm that is used, we can identify further β(Z [s,u] ) = e −r(u−s) and: • In the case of Algorithm 0: • In the case of Algorithms 1 or 2 : With respect to the general setup of previous sections, the methodological assumptions, such as the ones on the sequences γ k of the SA parameters or the requirement made in H9 of using a regression sample independent from the rest of the simulation in the context of Algorithm 2, can always be met at implementation stage.
Regarding now the abstract assumptions there, we only make a general comment that they should all hold in our lognormal model for X combined with randomized sampling at the times of defaults of the counterparties, which are all times with an intensity, recalling the corresponding modeling assumptions related to Algorithms 0 (SA scheme for the basic case without liabilities), 1 (SA scheme with nested simulation of future liabilities) and Algorithm 2 (SA scheme with regression of future liabilities), respectively: Regarding the regression algorithm for CVA ccp t+1 that is required in the context of Algorithm 2, we apply to CVA ccp t+1 = E t+1 ψ(Z [t+1,T ] ) the approach that is used for computing the "CA process" in Section 4.4 of [1], using as a regression basis 1, X t+1 , X 2 t+1 (recall X is the underlying Black-Scholes rate) and the default indicator processes at time (t + 1) of the clearing members. In the present case of CVA ccp t+1 the situation is in fact a bit simpler as no time-stepping is required, i.e. we just need one regression for each time (t + 1) that occurs via the discretization times t of the integral visible in (4.4), because CVA ccp t+1 is a conditional expectation, as opposed to the above-mentioned CA process, which only solves a semi-linear BSDE.
For an SA scheme launched at time t of the outer KVA simulation, we use γ k = γ0 (100+k 0.75 ) × (T −t) T , starting from the initial condition ξ 0 = χ 0 = 0.

Numerical Results
All our simulations are run on a laptop that has an Intel i7-7700HQ CPU and a single GeForce GTX 1060 GPU programmed with the CUDA/C application programming interface (API). Table 1 shows the time 0 (unconditional) expected shortfalls over the first year, obtained by four variants of the SA scheme and for three levels of the quantile a df . In the case a df = 85%, Figure 1 shows the corresponding (time discretized) ES processes obtained after K = 10 4 and K = 5 × 10 5 iterations of the SA schemes;

Conclusion and Perspectives
In this paper we propose convergent stochastic approximation estimators for the economic capital of a loss random variable L that entails a future liability (conditional expectation). The latter is estimated either by nested Monte Carlo as in Gordy and Juneja [17], or by regression as in Broadie, Du, and Moallemi [10]. Then we embed conditional versions of the above into outer risk margin (or KVA) computations.
From a practical point of view, an incremental SA scheme uses a limited amount of memory but, being a loop, is less easy to parallelize than a simulation-and-sort algorithm, on which several processors can fruitfully be used (see [1,Appendix C]). On the other hand SA schemes can be efficiently combined with importance sampling as studied in [7,8], whereas [17] and [10] introduce respective jacknife and weighted regression acceleration procedures for the simulation-and-sort schemes.  From a theoretical point of view, the stochastic approximation viewpoint leads to stronger convergence results under considerably smoother assumptions than together [17] and [10]. In particular, our assumptions (recalled in Section 4.3.1) only bear on the limiting problem, as opposed to unverifiable (not to say implausible) assumptions on the perturbed approximating problems in [17] and [10]: • Assumptions on the density of the nested Monte Carlo surrogate of the loss in [17]; • Invertibility of the empirical covariance matrix of the regressors and an orthonormal basis of empirical regressors in [10]. By contrast, we do not even need to assume a vector space of theoretical regressors; for instance, our space of theoretical regressors could be given in the form of a neural network.
About now the results: • [17] only shows mean square convergence, whereas we show almost sure convergence; • [10] considers a very stylized proxy of expected shorfall in the form of E(L − ξ) + , for a known and fixed ξ, instead of the value-at-risk of L that needs to be estimated in the first place in a genuine expected shorfall perspective. Moreover, their study is asymptotic in the number of simulations for a fixed number of basis functions, they do not address the global convergence problem when the size of the regression basis and the number of simulations jointly go to infinity.
Last, regarding the comparison between the stochastic approximation schemes with nested versus regressed estimation of future liabilities, the assumptions that allow establishing the convergence of either approach are discussed and compared along the paper. In order to compare the fine convergence properties of each approach, it would be useful to push the computations to obtain the L 2 errors in both cases, which we leave for further research.

A. Appendix: Technical Developments
We denote by |x| the (Euclidean) norm of x ∈ R d and by x, y the inner product of two vectors x, y ∈ R d . Vectors x ∈ R d are column-vectors, and A T denotes the transpose of a matrix A.
Some of the results are general (not specific to the setup of the main body of the paper) and therefore stated in terms of an abstract probability measure Q, with related expectation denoted by E.

A.2. A General Convergence Result for Stochastic Approximation Algorithms
Let H : R d × R q → R d be a measurable function and let {γ k , k ≥ 1} be a sequence of positive numbers. Let R q -valued random variables {V k , k ≥ 0} and θ 0 ∈ R d be defined on a probability space (Ω, A, P). Theorem A.3 provides sufficient conditions for the almost-sure convergence and the L p -convergence, p ∈ (0, 2), of the sequence {θ k , k ≥ 0} given by (A.1) These conditions are general enough to cover the case when the r.v. {V k , k ≥ 1} are not i.i.d. but have a distribution converging, in some sense, to the distribution of a r.v. V .
Step2. Uniform boundedness in L 2 . Let a (deterministic) point θ * ∈ L be given. By taking expectation in (A.3), we have Applying again the Robbins-Siegmund lemma with the assumptions (i) and (vi), we deduce that the sequence lim k E |θ k − θ * | 2 exists and thus sup k E |θ k | 2 < ∞ since L is bounded. This implies sup k E |θ k − θ ∞ | 2 < +∞ for any L-valued random variable θ ∞ , using again that L is bounded.
Step 3. Convergence in L p . Let C > 0 and p ∈ (0, 2). We write The first term converges to zero by the dominated convergence theorem. For the second term, Hölder's and Markov's inequalities give that which is lower than > 0 for some C large enough. This holds true for any , thus concluding the proof.
For the results on the sequence {χ k , k ≥ 1}, we check the assumptions of Lemma A.2 with θ k ← χ k (so that d = 1) and Set S 0 := 1 and S k := k j=1 (1 − γ j ) −1 for any k ≥ 1, so that S k (1 − γ k ) = S k−1 and S k − S k−1 = γ k S k . By H1 and Lemma A.2, lim k S k = +∞ so that from the above almost-sure convergence on {ξ k , k ≥ 1} and from the Cesaro lemma, By H2a, the second term in the RHS of (A.5) is a continuous function of ξ k . Therefore, by similar arguments, Finally, {ẽ k , k ≥ 1} is a G k -martingale increment; by using |(a−c) + −(b−c) + | ≤ |a|+|b| and (|a|+|b|) 2 ≤ 2a 2 +2b 2 , and since {L k , k ≥ 1} are i.i.d. with distribution Q(z, ·), we have By Lemma A.2, we obtain that P 0 -a.s., lim k χ k exists and solves If p < 2, by using (x + y) p /2 ≤ x p /2 + y p /2 for any x, y ≥ 0, we have This concludes the proof.
(iii) there exists a constant C > 0 such that Then for any positive integer M , where c p only depends on p (see its definition in Lemma A.4). If, in addition, (iv) there exists C ∞ > 0 such that for any δ > 0 and any positive integer M , then, for any integers M, M ≥ 3, By using (i), it holds By (ii), (iii) and Lemma A.4, and by using (x + y) p ≤ 2 p −1 (x p + y p ) for any x, y ≥ 0, The Chebyshev inequality implies We where the last inequality is obtained by choosing δ ← (log M )/(2 C ∞ M ). This concludes the proof since √ ln M ≥ 1 for M ≥ 3.
Proof of Lemma 2.1. We apply Lemma A.5 with P ← P 0 , B ← σ(Z 0 , Z 1 ), C i ← C i (z) for i ∈ {0, p , ∞}, C ← c β and p ≥ 2. This yields the inequalities (2.3) and (2.5). Since |a + − b + | ≤ |a − b| and p ≥ 1, we have We conclude the proof of (2. Under H5 and H3, the conditions (i), (ii), (iii) and (v) hold; the proof is on the same lines as in the proof of Theorem 1.2 and is omitted. We establish the condition (vi) (which also implies the condition (iv)) with Note that since the r.v. (φ k+1 , β k+1 , Z k+1 1 ) are independent from G k and, conditionnally to G k , have the same distribution as the processes (φ, β, Z 1 ), then the distribution ofL k+1 given G k is Q(z, ·). Hence, We write so that, by Lemma 2.1, there exists a constant c such that for any k ≥ 1, P 0 -a.s.

A.5. A Central Limit Theorem for Stochastic Approximation Algorithms
We recall in this section sufficient conditions for a central limit theorem (CLT) to hold for random variables {θ k , k ≥ 0} defined through a stochastic approximation algorithm: given a deterministic sequence {γ k , k ≥ 1}, a function h : R d → R, θ 0 ∈ R d and R d -valued random variables {e k , k ≥ 1} and {r k , k ≥ 1} defined on (Ω, A, P), define for k ≥ 0, θ k+1 = θ k + h(θ k ) + γ k+1 e k+1 + γ k+1 r k+1 . (A.10) Theorem A.6 corresponds to [16,Theorem 2.1.]. It provides sufficient conditions for a CLT along a converging sequence {lim q θ q = θ } where θ ∈ R d is fixed (deterministic). On the mean field h and the limit point θ , it is assumed C1. a) The mean field h : R d → R d is measurable and twice continuously differentiable in a neighborhood of θ , where h(θ ) = 0. b) The gradient ∇h(θ ) is a Hurwitz matrix. Denote by − , ( > 0), the largest real part of its eigenvalues.
c) There exists a symmetric positive definite matrix D and a sequence {D k , k ≥ 1} of R d -valued random variables, such that P-a.s.
A.6. Proofs of the Results of Section 2.2 Throughout this section, set θ := (ξ, χ), and We start with a preliminary lemma.
Proof of Theorem 2.4. The proof consists in applying Theorem A.6. We check its assumptions with Q ← P 0 , θ k ← (ξ k , χ k ), θ ← (ξ , χ ), the function h given by (A.11). The random variables e k , r k are set equal to where h and H are given by (A.11) and With these definitions, note that Algorithm 1 updates the parameter θ k+1 by θ k+1 = θ k + γ k+1 H(θ k , L k+1 ).
Since θ satisfies (1.6), we have h(θ ) = 0. By H7, the function h is twice continuously differentiable in a neighborhood of θ and the gradient is given by where we used (1.6) in the last equality. Hence, by H7, the condition C1 is verified.
By H7, sinceL k+1 has the same distribution as L under P 0 , there exists a constant C (depending upon C) such that sup Hence, the conditions C2a-b are verified. The condition C2c follows from Lemma A.8. We write By Lemma 2.1, under H7, the LHS is upper bounded by Hence, by H8, the condition C3 is verified. Finally, the condition C4 holds by H8 and (A.14). This concludes the proof of the theorem.
Proof of Theorem 2.5. The proof consist in an application of Theorem A.7. We use the same notations as in the proof of Theorem 2.4; it was already proved that C1 and C2 hold. We check C5: we have By Lemma 2.1, there exists a constant C such that for any k ≥ 0, Therefore, the condition on γ −1/2 k r k is satisfied by (2.8). In addition, by Lemma 2.1 again, there exists a constant C such that for any δ > 0, Therefore, the condition C5 holds by (2.8).

A.7. Sensitivities of Value-at-Risk and Expected Shortfall to Perturbations of the Input Distribution
We develop in this section some estimates relative to the perturbation of the value-at-risk and expected shortfall that arise when we use different distributions for the underlying loss variable Z. We use the notation VaR α (Z) and ES α (Z) for the P value-at-risk and expected shortfall of Z where VaR α (Z) defined on the left is the infimum of such values.
Definition A.9. The Kolmogorov distance d kol (X, Y ) between two scalar random variables X and Y is the sup norm between their cumulative distribution functions: We show that if X, Y are integrable scalar random variables with a continuous density, then for any α > 0 fixed, the difference |[VaR α (X), ES α (X)] − [VaR α (Y ), ES α (Y )]| is bounded, up to a multiplicative constant depending of α and the density of X, by the L 1 and the Kolmogorov distances between X and Y .
Our first proposition has to do with the relationship between the Kolmogorov distance and the behavior of VaR β (·) as a function of β. Lemma A. 10. Let X and Y be scalar random variables having a continuous cumulative distribution function. Then for any α ∈ (0, 1) and every VaR α (Y ) we have VaR α−d kol (X,Y ) (X) ≤ VaR α (Y ) ≤ VaR α+d kol (X,Y ) (X).
(A. 15) for some elements from the respective VaR α (X) sets and with the convention VaR β = −∞ (respectively VaR β = ∞) if β < 0 (respectively β > 1). If X and Y are also integrable then Proof. Let α ∈ (0, 1) be given, and let d := d kol (X, Y ). From the definition of the Kolmogorov distance it follows that for every ξ ∈ R P[X ≤ ξ] − d ≤ P[Y ≤ ξ] ≤ P[X ≤ ξ] + d, so that for every ξ α such that P[Y ≤ ξ α ] = α (i.e. for every VaR α element of Y ) we have Now consider the function G(x, z) := x + 1 1 − α (z − x) + and note that, for fixed x, the function G(x, ·) is a uniformly Lipschitz function of z with Lipschitz constant 1/(1 − α). This implies in particular, by taking Z = X and Z = Y , that for every x Taking the inf in x in the above and using (A.17), we get (A.16) as desired.
Inspired from [6], we develop further estimates on the Kolmogorov distance between X and Y that might depend on higher moments for the difference between these random variables. We apply these estimates to the error analysis of Algorithm 2, in which the bias due to fixing an approximation procedure for the samplings of Ψ has to be controlled in order to have useful criteria for the choice of the parameters of the algorithm.
Corollary A.11. Assume that the scalar random variable X has a c.d.f. F which is continuously differentiable and strictly increasing in a neighborhood of VaR α (X), let f := dF/dλ be the respective density (where it exists), and let δ be such that the inverse F −1 of F exists in an δ−neighborhood of α. Then for any scalar random variable Y and any 0 < r, s < δ, the condition |f (F −1 (x))| −1 d kol (X, Y ).
Proof. This follows from the fact that, under the given hypotheses The other cases are treated similarly.
In order to pass to controls that depend only on the L 1 distance, we present now two estimates of d kol (X, Y ) that are related to the actual difference between X and Y . These will be combined to estimate the expected error induced by the application of the stochastic approximation procedure to the sequence of samplings produced via regression.
Lemma A.12. Assume that the scalar random variable X admits a density which is bounded by C 0 . Then for any scalar random variable Y and any δ > 0 (A. 19) Proof. The following argument was presented already in the proof of Lemma A.5, thus we give here a summarized version: for δ > 0 given and any ξ ∈ R, we have using the hypothesis.
Corollary A.13. Assume that the scalar random variable X admits a density which is bounded by C 0 . Then for any scalar random variable Y and any p > 0 we have Proof. For the case in which E|X − Y | p = +∞ the conclusion is trivially true. For the p−integrable case, take δ = (E|X − Y | p ) 1/(1+p) in equation (A. 19) and apply Markov's inequality.

A.8. A Nonasymptotic Estimate for Regressions
The following result is used to control the error due to the introduction of a regression procedure in Algorithm 2: Theorem A.14 ( [18, Theorem 11.5]). Let (X, Y ) be a random vector in R d ×R, let F be a pointwise measurable set of functions f : R d → R, with finite Vapnik-Chervonenkis dimension VC F ≥ 1. Assume that the random variable Y is bounded by B > 0. If D n = ((X k , Y k )) n k=1 is any vector of independent copies of (X, Y ) and if we define the random function f Dn by (and therefore m(X ) = E [Y | X ] because (X, Y ) ∼ (X , Y )), and if we apply (A.21) to g Dn := |f Dn − m| 2 , we get that and therefore (A.20) tells us that, up to a factor of 2 (which can be improved by looking carefully at the proofs), the accuracy of f Dn as a predictor constructed from F of Y as a function of X deviates from the optimal L 2 −accuracy inf for no more than CB 2 VC F (1 + ln(n)) n units, on L 2 P − expectation.