Truncated control variates for weak approximation schemes

In this paper we present an enhancement of the regression-based variance reduction approaches recently proposed in Belomestny et al. This enhancement is based on a truncation of the control variate and allows for a significant reduction of the computing time, while the complexity stays of the same order. The performances of the proposed truncated algorithms are illustrated by a numerical example.


Introduction
Let T > 0 be a fixed time horizon. Consider a d-dimensional diffusion process (X t ) t∈[0,T ] defined by the Itô stochastic differential equation for some N 0 ∈ N 0 , where X T is an approximation for X 0,x T constructed via a time discretisation of (0.1) (we refer to [7] for a nice overview of various discretisation schemes). In the computation of u(0, x) = E[f (X 0,x T )] by the SMC approach there are two types of error inherent: the (deterministic) discretisation error E[f (X 0,x T )] − E[f (X T )] and the Monte Carlo (statistical) error, which results from the substitution of E[f (X T )] with the sample average V N 0 . The aim of variance reduction methods is to reduce the latter statistical error. For example, in the so-called control variate variance reduction approach one looks for a random variable ξ with Eξ = 0, which can be simulated, such that the variance of the difference f (X T ) − ξ is minimised, that is, Then one uses the sample average instead of (0. 3) to approximate E[f (X T )]. The use of control variates for computing expectations of functionals of diffusion processes via Monte Carlo was initiated by Newton [11] and further developed in Milstein and Tretyakov [8]. In Belomestny et al [1] a novel regression-based approach for the construction of control variates, which reduces the variance of the approximated functional f (X T ) was proposed. As shown in [1], the "Monte Carlo approach with the Regression-based Control Variate" (abbreviated below as "RCV approach") is able to achieve a higher order convergence of the resulting variance to zero, which in turn leads to a significant complexity reduction as compared to the SMC algorithm. Other prominent examples of algorithms with this property are the multilevel Monte Carlo (MLMC) algorithm of [5] and quadrature-based algorithms of [9] and [10]. The RCV approach becomes especially simple in the case of the so-called weak approximation schemes, i.e., the schemes, where simple random variables are used in place of Brownian increments, and which became quite popular in recent years. However, due to the fact that a lot of computations are required for implementing the RCV approach, its numerical efficiency is not convincing in higher-dimensional examples. The same applies also to the SRCV algorithm of [4]. In this paper we further enhance the performances of the RCV and SRCV algorithms by truncating the control variates, leading to a reduction from (2 m − 1) to m terms at each time point in case of the weak Euler scheme and a reduction from (3 m 2 m(m−1) 2 − 1) to m(m + 1) = O(m 2 ) terms at each time point in case of the second order weak scheme. It turns out that, while the computing time is reduced significantly, we still have a sufficient variance reduction effect such that the complexity is of the same order as for the original RCV and SRCV approaches.
The paper is organised as follows. In Section 1 we present a smoothness theorem for a general class of discretisation schemes. Section 2 recalls the construction of control variates for weak schemes of the first and the second order. The main truncation results are derived in Section 3. In Section 4 we describe a generic regression algorithm. Section 5 deals with a complexity analysis for the algorithm that is based on the truncated control variate. Section 6 is devoted to a simulation study. Finally, all proofs are collected in Section 7.

Smoothness theorem for discretisation schemes
In this section we present a technical result for discretisation schemes, which will be very important in the sequel. To begin with, let J ∈ N denote the time discretisation parameter, we set ∆ := T /J and consider discretisation schemes defined on the grid {j∆ : j = 0, . . . , J}.
We now define the random function G l,j (x) for J ≥ l ≥ j ≥ 0, x ∈ R d , as follows where Φ ∆,l (x) := Φ ∆ (x, ξ l ) for l = 1, . . . , J. By Φ k ∆,l , k ∈ {1, . . . , d}, we denote the k-th component of the function Φ ∆,l . Note that it holds Let us define the operator D α as follows where g is a real-valued function, α ∈ N d 0 and |α| = α 1 + . . . + α d (N 0 := N ∪ {0}). In the next theorem we present some smoothness conditions on q j , which will be used several times in the chapters below.
. Suppose that f is K times continuously differentiable with bounded partial derivatives up to order K, Φ ∆ (·, ξ) is K times continuously differentiable (for any fixed ξ), and that, for any n ∈ N, l ≥ j, k ∈ {1, . . . , d}, α ∈ N d 0 with 1 ≤ |α| ≤ K, it holds with probability one for some constants A n > 0, B n > 0. Moreover, suppose that for any n 1 , n 2 ∈ N, for some constants E n 1 ,n 2 > 0. Then we obtain for all j ∈ {0, . . . , J} that q j is K times continuously differentiable with bounded partial derivatives up to order K.

Representations for weak approximation schemes
Below we focus on weak schemes of first and second order.
2.1. Weak Euler scheme. In this subsection we treat weak schemes of order 1. Let us consider a scheme, where d-dimensional approximations X ∆,j∆ , j = 0, . . . , J, satisfy X ∆,0 = x 0 and . . , ξ m j ), j = 1, . . . , J, being m-dimensional iid random vectors with iid coordinates such that That is, relating to the framework in Section 1, we havem = m and use the discrete increments ξ i j , i = 1, . . . , m. A particular case is the weak Euler scheme (also called the simplified weak Euler scheme in [7, Section 14.1]) of order 1, which is given by The proposition below summarises important representations for the weak Euler scheme, which were derived in [1].
Proposition 2.1. The following representation holds where we use the notation s = (s 1 , . . . , s r ). The coefficients a j,r,s : R d → R can be computed by the formula for all j, r, and s as in (2.3). Moreover, we have the following recursion formulas where the coefficients a j,r,s (x) are given by (2.4). Proposition 2.2. Assume that µ and σ in (0.1) are Lipschitz continuous with components µ i , σ i,r : R d → R, i = 1, . . . , d, r = 1, . . . , m, being 4 times continuously differentiable with their partial derivatives of order up to 4 having polynomial growth. Let f : R d → R be 4 times continuously differentiable with partial derivatives of order up to 4 having polynomial growth. Provided that (2.2) holds and that, for sufficiently large p ∈ N, the expectations E|X ∆,j∆ | 2p are uniformly bounded in J and j = 0, . . . , J, we have for this "simplified weak Euler scheme" ∆,T = 0, which is due to the martingale transform structure 1 in (2.6). 2.2. Second order weak scheme. Now we treat weak schemes of order 2. We consider a scheme, where d-dimensional approximations X ∆,j∆ , j = 0, . . . , J, satisfy X ∆,0 = x 0 and are random m × m-matrices, (S3) the pairs (ξ j , V j ), j = 1, . . . , J, are i.i.d., (S4) for each j, the random elements ξ j and V j are independent, (S5) for each j, the random variables ξ i j , i = 1, . . . , m, are i.i.d. with (S6) for each j, the random variables V il j , 1 ≤ i < l ≤ m, are i.i.d. with . . , m, j = 1, . . . , J. Hence, the matrices V j can be generated by means of m(m−1) 2 i.i.d. random variables. That is, relating to the framework in Section 1, we havem-dimensional random vectorsξ j : Remark 2.3. In order to obtain a second order weak scheme in the multidimensional case, we need to incorporate additional random elements V j into the structure of the scheme. This is the reason why we now consider (2.7) instead of (2.1). For instance, to get the second order weak scheme (also called the simplified order 2 weak Taylor scheme) of [7,Section 14.2] in the multidimensional case, we need to define the functions Φ ∆ (x, y, z), x ∈ R d , y ∈ R m , z ∈ R m×m , as explained below. First we define the function Σ : R d → R d×d by the formula and recall that the coordinates of vectors and matrices are denoted by superscripts, e.g. Σ(x) = (Σ kl (x)) d k,l=1 , Φ ∆ (x, y, z) = (Φ k ∆ (x, y, z)) d k=1 . Let us introduce the operators L r , r = 0, . . . , m, that act 1 This phrase means that the discrete-time processM = (M l ) l=0,...,J , whereM0 = 0 andM l is defined like the righthand side of (2.6) but with J j=1 being replaced by l j=1 and aj,r,s byãj,r,s is a martingale, which is a straightforward calculation. on sufficiently smooth functions g : R d → R as follows: The r-th coordinate Φ r ∆ , r = 1, . . . , d, in the simplified order 2 weak Taylor scheme of [7, Section 14.2] is now given by the formula provided the coefficients µ and σ of (0.1) are sufficiently smooth. We will need to work explicitly with (2.8) at some point, but all results in this subsection assume structure (2.7) only.
Let us define the index sets . . , m}, I 2 = (k, l) ∈ I 2 1 : k < l and the system where P(I) denotes the set of all subsets of a set I. For any U 1 ⊆ I 1 and o ∈ {1, 2} U 1 , we write o as o = (o r ) r∈U 1 . Below we use the convention that a product over the empty set is always one.
For k ∈ N 0 , H k : R → R stands for the (normalized) k-th Hermite polynomial, i.e.
We remark that, in particular, . As in Subsection 2.1, we summarise important representations from [1] below.
where the coefficients a j,o,U 1 ,U 2 : R d → R can be computed by the formula Moreover, we have for each j ∈ {1, . . . , J}, and, for all Using Theorem 2.4, we obtain the following result (see Proposition 3.6 in [1]), which provides a bound for the discretisation error and a perfect control variate for the discretised quantity.
Proposition 2.5. Assume, that µ and σ in (0.1) are Lipschitz continuous with components µ i , σ i,r : R d → R, i = 1, . . . , d, r = 1, . . . , m, being 6 times continuously differentiable with their partial derivatives of order up to 6 having polynomial growth. Let f : R d → R be 6 times continuously differentiable with partial derivatives of order up to 6 having polynomial growth. Provided that (2.8) holds and that, for sufficiently large p ∈ N, the expectations E|X ∆,j∆ | 2p are uniformly bounded in J and j = 0, . . . , J, we have for this "simplified second order weak Taylor scheme" where the coefficients a j,o,U 1 ,U 2 (x) are defined in (2.10).

Truncated control variates for weak approximation schemes
Below we recall the assumptions from [1], suggest sufficient conditions for them in terms of the functions f, µ, σ, and then suggest some stronger conditions that will justify the use of truncated control variates.
Next we suggest some stronger conditions that give us somewhat more than (A2).  Let us define the "truncated control variate" where e i ∈ R m denotes the i-th unit vector in R m and a j,1,e i is given by (cf. (2.4)) Note that the superscript "trunc" comes from "truncated". That is, we consider in M ∆,T for which r = 1 (cf. (2.6)). Next we study the truncation error that arises from replacing M Notice that under Assumption (A2) alone the variance in (3.2) would have been O(1).

3.2.
Second order weak scheme. First we recall some of the required assumptions in [1]: let us fix some j ∈ {1, . . . , J}, We assume that, for some positive constants Σ, A, it holds: Below we verify the above assumptions. (ii) Let all the functions µ k and σ ki , k ∈ {1, . . . , d}, i ∈ {1, . . . , m}, be bounded, the function f be continuously differentiable with bounded partial derivatives, and all the functions µ k , σ ki be three times continuously differentiable with bounded partial derivatives up to order 3. Then (B2) holds.
Let us define the index sets In the following theorem we provide some stronger conditions that give us more than (B2).
Theorem 3.6. (i) Let all the functions µ k and σ ki , k ∈ {1, . . . , d}, i ∈ {1, . . . , m}, be bounded, the function f be twice continuously differentiable with bounded partial derivatives up to order 2, and all the functions µ k , σ ki be four times continuously differentiable with bounded partial derivatives up to order 4. Then it holds Let all the functions µ k and σ ki , k ∈ {1, . . . , d}, i ∈ {1, . . . , m}, be bounded, the function f be three times continuously differentiable with bounded partial derivatives up to order 3, and all the functions µ k , σ ki be five times continuously differentiable with bounded partial derivatives up to order 5. Then it holds An equivalent reformulation of assumptions (B2)-(B4) is as follows: there exists some positive constant A such that it holds Similar to Section 3.1, let us define a truncated control variate through Next we derive the truncation error that arises from replacing M Theorem 3.8. Suppose that all the functions µ k and σ ki , k ∈ {1, . . . , d}, i ∈ {1, . . . , m} are bounded, the function f is three times continuously differentiable with bounded partial derivatives up to order 3, and all the functions µ k , σ ki are five times continuously differentiable with bounded partial derivatives up to order 5. Then it holds (cf. Proposition 2.5)

Generic regression algorithm
In the previous sections we have given several representations for control variates. Now we discuss how to compute the coefficients in these representations via regression. For the sake of clarity, we focus on second order schemes and control variate (3.5) with coefficients given by (2.10).
be a solution of the following least squares optimisation problem: Define an estimate for the coefficient function a j,o,U 1 ,U 2 viâ in the above formula emphasises that the estimateŝ a j,o,U 1 ,U 2 of the functions a j,o,U 1 ,U 2 are random in that they depend on the simulated training paths. The cost of computing α j,o,U 1 ,U 2 is of order O(N Q 2 ), since each α j,o,U 1 ,U 2 is of the form α j,o,U 1 , k, l ∈ {1, . . . , Q}. The cost of approximating the family of the coefficient functions a j,o, Summary of the algorithm. The algorithm consists of two phases: training phase and testing phase. In the training phase, we simulate N independent training paths D tr N and construct regression estimatesâ j,o,U 1 ,U 2 (·, D tr N ) for the coefficients a j,o,U 1 ,U 2 (·). In the testing phase, independently from D tr N we simulate N 0 independent testing paths (X  Notice that the result of (4.4) indeed requires the computations above and cannot be stated right from the outset because the summands in (4.2) are dependent (through D tr N ). This concludes the description of the generic regression algorithm for constructing the control variate. Further details, such as bounds for the right-hand side of (4.4), depend on a particular implementation, i.e. on the quality of the chosen basis functions.

Complexity analysis
In this section we extend the complexity analysis presented in [1] to the case of the "TRCV" (truncated RCV) algorithm. Below we only sketch the main results for the second order schemes. We make the following assumption (cf. [2] and [4]): (B5) The functions a j,o,U 1 ,U 2 (x) can be well approximated by the functions from Ψ Q := span ({ψ 1 , . . . , ψ Q }), in the sense that there are constants κ > 0 and C κ > 0 such that where P ∆,j−1 denotes the distribution of X ∆,(j−1)∆ .
Remark 5.1. Note that (B5) is a natural condition to be satisfied for good choices of Ψ Q . For instance, under appropriate assumptions, in the case of piecewise polynomial regression as described in [1], (B5) is satisfied with κ = 2ν(p+1) 2d(p+1)+dν , where the parameters p and ν are explained in [1].
In Lemma 5.2 below we present an L 2 -upper bound for the estimation error of the TRCV algorithm. To this end, we need to describe more precisely, how exactly the regression-based approximations a j,o,U 1 ,U 2 are constructed: Let functionsâ j,o,U 1 ,U 2 (x) be obtained by regression onto the set of basis functions {ψ 1 , . . . , ψ Q }, while the approximationsã j,o,U 1 ,U 2 (x) of the TRCV algorithm be the truncated estimates, which are defined as follows where ∆ U 1 ,U 2 andÃ are given in (3.3) and (3.4)).
wherec is a universal constant.
Notice that the expectation in the left-hand side of (5.2) means averaging over the randomness in D tr N . Let (X ∆,j∆ ) j=0,...,J be a testing path, which is independent of the training paths D tr N . We define ] ∆ 2 + Jm(m + 1) c(Σ +Ã 2 ∆(log N + 1)) Q N + 8 C κ Q κ .

5.1.
Complexity of the TRCV approach. Let us study the complexity of the TRCV approach. The overall cost is of order JQ max {N Q, N 0 }, provided that we only track the constants which tend to ∞ when ε 0 with ε being the accuracy to be achieved. That is, the constants, such as d, m, κ, C κ , are ignored. We have the following constraints where the first term comes from the squared bias of the estimator and the remaining three ones come from the variance of the estimator (see Theorem 5.3 as well as footnote 3 on page 10). We get the following result.
Theorem 5.4. For the TRCV approach with the second order weak schemes under (B1)-(B5), it is optimal to choose the orders of parameters as follows (cf. [4]) 3 Notice that the variance of the TRCV estimate 1 provided that κ > 1. 4 Thus, we have for the complexity Remark 5.5. (i) For the sake of comparison with the SMC and MLMC approaches, we recall at this point that their complexities are C SM C ε −2.5 and C M LM C ε −2 at best (we are considering the second order scheme).
(ii) Complexity estimate (5.5) shows that one can go beyond the complexity order ε −2 , provided that κ > 9, and that we can achieve the complexity order ε −1.75−δ , for arbitrarily small δ > 0, provided κ is large enough.
(iii) The complexity of the TRCV approach is the same that we obtain for the RCV approach (where the "complete" control variate (2.12) is estimated), since the second constraint in (5.4), which does not arise for the RCV approach, is the only inactive one in this case. That is why we truncated M ( the bound for the variance in (3.6) would have been of order ∆ and due to the resulting constraint 1 JN 0 ε 2 , we would have obtained worse complexities than ε −2 , since C T RCV JN 0 .

Numerical results
The results below are based on program codes written and vectorised in MATLAB and running on a Linux 64-bit operating system.
Let us consider the following SDE for d = m = 5 (cf. [1]) The solution of (6.1) is given by for t ∈ [0, 1]. Further, we consider the functional 4 Performing the full complexity analysis via Lagrange multipliers one can see that these parameter values are not optimal if κ ≤ 1 (a Lagrange multiplier corresponding to a "≤ 0" constraint is negative). Recall that in the case of piecewise polynomial regression (see [1] and recall Remark 5.1) we have κ = 2ν(p+1) 2d(p+1)+dν . Let us note that in [1] it is required to choose the parameters p and ν according to p > d−2 2 and ν > 2d(p+1) 2(p+1)−d , which implies that κ > 1, for κ expressed via p and ν by the above formula. Here we consider weak schemes of the second order and compare the numerical performances of the SMC, MLMC, RCV, TRCV and TSRCV approaches. The latter one is the truncated version of the SRCV approach of [4]. Like the RCV algorithm, the SRCV one is based on (2.12), the difference is only in how to implement the approximations of the coefficients a j,o,U 1 ,U 2 in practice (while the RCV algorithm is a direct Monte Carlo regression, in the SRCV algorithm the regression is combined with a kind of "stratification"; see [4] for more detail). Therefore, the idea of the truncation (i.e. replacing (2.12) with (3.5)) applies also to the SRCV approach and gives us the TSRCV one.
For simplicity we implemented a global regression for the RCV, TRCV and TSRCV approaches (i.e. the one without considering the truncation operator in (5.1), as a part of the general description in Section 4). More precisely, we use quadratic polynomials (that is 5 i=1 x l i i , where l 1 , . . . l 5 ∈ {0, 1, 2} and 5 l=1 l i ≤ 2) as well as f as basis functions, hence Ψ Q consists of Q = 7 5 + 1 = 22 basis functions. Note that we do not need to consider random variables V kl j in the second order weak scheme, since L k σ rl (x) = 0 for k = l (see (2.8)). This gives us less terms for the RCV approach, namely ≡ 1024 is no longer present). As for the TRCV and TSRCV approaches, this gives us only m(m+3) 2 = 20 compared to m(m + 1) = 30 terms in (3.5).
We choose κ = 1.2, which is related to the piecewise polynomial regression with polynomial degree p = 2 (comparable to our setting) and the limiting case ν → ∞ (see footnote 4 on page 11). Moreover, for each ε = 2 −i , i ∈ {2, 3, 4, 5, 6}, we set the parameters J, N and N 0 for the RCV, TRCV and TSRCV approaches as follows (compare with the formulas in Subsection 5.1): The factors 512 and 2048 are here for stability purposes. For the TRCV and SMC algorithms we additionally consider ε = 2 −7 , which produces a picture with approximately equal maximal computational time (that is, the time corresponding to the best accuracy) for all algorithms. Next we estimate the numerical complexity for the RCV, TRCV and TSRCV approaches by means of 100 independent simulations and compare it with the one for the SMC and MLMC approach, for which we use the same output as in [1]. As can be seen from Figure 1, the estimated numerical complexity is about RMSE −1.85 for the RCV approach, RMSE −1.83 for the TRCV approach, RMSE −1.53 for the TSRCV approach, RMSE −2.67 for the SMC approach and RMSE −2.01 for the MLMC approach, which we get by regressing the log-time (logarithmic computing time of the whole algorithm in seconds) vs. log-RMSE.
Beyond the numerical complexities we observe that the truncation effect from RCV algorithm to its truncated versions is huge. While we have poor results for the RCV approach (as in [1]), i.e. in this region of ε-values the RCV approach is numerically outperformed by the other ones, the TRCV and TSRCV approaches work best (even better than the SMC and MLMC approaches).
7.5. Proof of Theorem 3.6. (i) The proof works similarly to the one of Theorem 3.2, that is, we consider a Taylor expansion for q j (Φ ∆ (x, y, z)) of order 1, around the same point µ ∆ (x) as in the proof of Theorem 3.5. Then we use whenever |U 2 | + |K 2 | + |K 1 | 2 ≥ 1 (where againΦ k ∆ (x, y, z) = Φ k ∆ (x, y, z) − µ k ∆ (x)). Then we apply Theorem 1.1 (case K = 2) which gives us that q j is twice continuously differentiable with bounded partial derivatives up to order 2 under the assumptions, that all functions µ k and σ ki are bounded and all the functions f, µ k , σ ki are four-times continuously differentiable with bounded partial derivatives up to order 4. Finally, we get that all the functions a j,o,U 1 ,U 2 are of order ∆, since the product of all functionsΦ k ∆ (x, y, z)Φ l ∆ (x, y, z), k, l ∈ {1, . . . , d}, is of order ∆ under the above assumptions. 7.8. Proof of Theorem 5.3. Using the martingale transform structure in (2.12) and (3.5) (recall footnote 1 on page 4) together with the orthonormality of the system r∈U 1 H or (ξ r j ) (k,l)∈U 2 V kl j , we get by (3.6) and (5. ≤ ∆ 2 + Jm(m + 1) c(Σ +Ã 2 ∆(log N + 1)) Q N + 8 C κ Q κ , since ∆ 2 U 1 ,U 2 ≤ ∆. 7.9. Proof of Theorem 5.4. The proof is similar to the complexity analysis performed in [3].