A Class of Finite-Dimensional Numerically Solvable McKean-Vlasov Control Problems

We address a class of McKean-Vlasov (MKV) control problems with common noise, called polynomial conditional MKV, and extending the known class of linear quadratic stochastic MKV control problems. We show how this polynomial class can be reduced by suitable Markov embedding to finite-dimensional stochastic control problems, and provide a discussion and comparison of three probabilistic numerical methods for solving the reduced control problem: quantization, regression by control randomization, and regress later methods. Our numerical results are illustrated on various examples from portfolio selection and liquidation under drift uncertainty, and a model of interbank systemic risk with partial observation.


Introduction
The optimal control of McKean-Vlasov (also called mean-field) dynamics is a rather new topic in the area of stochastic control and applied probability, which has been knowing a surge of interest with the emergence of the mean-field game theory. It is motivated on the one hand by the asymptotic formulation of cooperative equilibrium for a large population of particles (players) in mean-field interaction, and on the other hand from control problems with cost functional involving nonlinear functional of the law of the state process (e.g., the mean-variance portfolio selection problem or risk measure in finance).
In this paper, we are interested in McKean-Vlasov (MKV) control problems under partial observation and common noise, whose formulation is described as follows. On a probability space pΩ, F, Pq equipped with two independent Brownian motions B and W 0 , let us consider the controlled stochastic MKV dynamics in R n : dX s " bpX s , P W 0 Xs , α s qds`σpX s , P W 0 Xs , α s qdB s`σ0 pX s , P W 0 Xs , α s qdW 0 s , X 0 " x 0 P R n , (1.1) where P W 0 Xs denotes the conditional distribution of X s given W 0 (or equivalently given F 0 s where F 0 " pF 0 t q t is the natural filtration generated by W 0 ), and the control α is F 0 -progressive valued in some Polish space A. This measurability condition on the control means that the controller has a partial observation of the state, in the sense that she can only observe the common noise. We make the standard Lipschitz condition on the coefficients bpx, µ, aq, σpx, µ, aq, σ 0 px, µ, aq with respect to px, µq in R nˆP 2 pR n q, uniformly in a P A, where P 2 pR n q is the set of all probability measures on pR n , BpR n qq with a finite second-order moment, endowed with the 2-Wasserstein metric W 2 . This ensures the well-posedness of the controlled MKV stochastic differential equation (SDE) (1.1). The cost functional over a finite horizon T associated to the stochastic MKV equation (1.1) (sometimes called conditional MKV equation) for a control process α, is and the objective is to maximize over an admissible set A of control processes the cost functional: (1. 2) The set A of admissible controls usually requires some integrability conditions depending on the growth conditions on f , g, in order to ensure that Jpαq is well-defined for α P A (more details will be given in the examples, see Section 4). Notice that classical partial observation control problem (without MKV dependence on the coefficients) arises as a particular case of (1.1)- (1.2). We refer to the introduction in [19] for the details.
Let us recall from [19] the dynamic programming equation associated to the conditional MKV control problem (1.2). We start by defining a suitable dynamic version of this problem. Let us consider F 0 a sub σ-algebra of F independent of B, W 0 . It is assumed w.l.o.g. that F 0 is rich enough in the sense that P 2 pR n q " tLpξq : ξ P L 2 pF 0 ; R n qu, where Lpξq denotes the law of ξ. Given a control α P A, we consider the dynamic version of (1.1) starting from ξ P L 2 pF 0 ; R n q at time t P r0, T s, and written as: , for t ď s, µ " Lpξq P P 2 pR n q.
It is shown in [19] that dynamic programming principle (DPP) for the conditional MKV control problem (1.5) holds: for pt, µq P r0, T sˆP 2 pR n q, for any F 0 -stopping time θ valued in rt, T s. Next, by relying on the notion of differentiability with respect to probability measures introduced by P. L. Lions [13] (see also the lecture notes [4]) and the chain rule (Itô's formula) along flow of probability measures (see [3], [6]), we derive the HJB equation for v: sup aPA "f pµ, aq`µ`L a vpt, µq˘`µ b µ`M a vpt, µq˘ı " 0, pt, µq P r0, T qˆP 2 pR n q, vpT, µq "ĝpµq, µ P P 2 pR n q, (1.6) where for φ P C 2 b pP 2 pR n qq, a P A, and µ P P 2 pR n q, L a φpµq is the function R n Ñ R defined by L a φpµqpxq " B µ φpµqpxq.bpx, µ, aq`1 2 tr`B x B µ φpµqpxqpσσ `σ 0 σ 0 qpx, µ, aq˘, (1.7) and M a φpµq is the function R nˆRn Ñ R defined by M a φpµqpx, x 1 q " 1 2 tr`B 2 µ φpµqpx, x 1 qσ 0 px, µ, aqσ 0 px 1 , µ, aq˘. (1.8) The HJB equation (1.6) is a fully nonlinear partial differential equation (PDE) in the infinite-dimensional Wasserstein space. In general, this PDE does not have an explicit solution except in the notable important class of linear-quadratic MKV control problem. Numerical resolution for MKV control problem or equivalently for the associated HJB equation is a challenging problem due to the nonlinearity of the optimization problem and the infinite-dimensional feature of the Wasserstein space. In this work, our purpose is to investigate a class of MKV control problems which can be reduced to finite-dimensional problems in view of numerical resolution.

Main assumptions
We make two kinds of assumptions on the coefficients of the model: one on the dependence on x and the other on the dependence on µ.

Numerical methods
In this section, we introduce our numerical methods for the resolution of the reduced problem (2.3)-(2.4).
Let us introduce the process Z α , valued in R d , controlled by an adapted process α taking values in A, solution to

1) and the performance measure
which assesses the average performance of the control. Introduce now a time discretization t n " n∆t, n " 0, . . . , N , ∆t " T {N , and denote by A ∆t the space of discrete processes pα tn q N´1 n"0 such that for all n, n " 0, . . . , N´1, α tn is F 0 tn -measurable. We can write the Euler approximation of the SDE governing the process Z " Z α , with α P A ∆t (to alleviate notations, we sometimes omit the dependence on α when there is no ambiguity, and keep the same notation Z for the discrete and continuous process) where ∆W 0 tn " N p0, ∆tq is an increment of W 0 . The discrete time approximation of Jpt n , z, αq is defined as: where α P A ∆t .

Value and Performance iteration
For n " 0, . . . , N , consider V ∆t pt n , zq " sup αPA ∆t J ∆t pt n , z, αq, the discrete time approximation of the value function at time t n : V`t n , z˘" sup αPA Jpt n , z, αq. The dynamic programming principle states that`V ∆t pt n ,¨q˘0 ďnďN is solution to the Bellman equation: where E a n,z r¨s denotes the expectation conditioned on the event tZ tn " zu and when using the control α tn " a at time t n . Observe that for n " 0, . . . , N´1, the equation (3.5) provides a backward procedure to recursively compute the V ∆t pt n ,¨q if we know how to analytically compute the conditional expectations E a n,z rV ∆t pt n`1 , Z tn`1 qs for all z P R d and all control a P A. We refer to the procedure in (3.5) as value iteration. An alternative approach to compute V ∆t pt n ,¨q, for n " 0, . . . , N´1, is to notice that once again by the dynamic programming principle, it holds that`V ∆t pt n ,¨q˘0 ďnďN is solution to the backward equation where for k " n`1, . . . , N´1, the control αt k is the optimal control at time t k defined as follows: and where`Zt k˘nďkďN is the process Z controlled by the following control α from time t n to t N : (3.8) For n " 0, . . . , N´1, the scheme (3.6) provides once again a backward procedure to compute V ∆t pt n ,¨q, assuming that we know how to analytically compute the conditional expectations E a n,z for all z P R d and all control a P A. We refer to the procedure in (3.6) as performance iteration.
Except for trivial cases, closed-form formulas for the conditional expectations appearing in the value and performance iteration procedures are not available, and they have to be approximated, which is the main difficulty when implementing both approaches to compute the value functions. In the next section, we discuss different ways to approximate conditional expectations and derive the corresponding estimations of the value functions V ∆t pt n ,¨q for n " 0, . . . , N´1.

Approximation of conditional expectations
In this subsection, we present three numerical methods that we apply later to conditional MKV problems. Two of these methods belong to the class of Regression Monte Carlo techniques, a family of algorithms whose effectiveness highly relies on the choice of the basis functions used to approximate conditional expectations; the third algorithm, Quantization, approximate the controlled process Z α tn with a particular finite state Markov chain for which expectations can be approximated quickly.

Regression Monte Carlo
In the simpler uncontrolled case, the family of Regression Monte Carlo algorithms is based on the idea of approximating the conditional expectation E " V ∆t pt n`1 , Z tn`1 qˇˇZ tn ‰ , for n " 0, . . . , N´1, by the orthogonal projection of V ∆t pt n`1 , Z tn`1 q onto the space generated by a finite family of φ k pZ tn q ( kě1 where pφ k q kě1 is a family of basis functions, i.e., a family of measurable real-valued functions defined on R d such that φ k pZ tn q˘k ě1 is total in L 2 pσpZ tn qq 1 and such that for all scalars β k and all K ě 1, if ř K k"1 β k φ k pZ tn q " 0 a.s. then β k " 0, for k " 1, . . . , K. The expectation E " V ∆t pt n`1 , Z tn`1 qˇˇZ tn ‰ should then be approximated as follows: where K ě 1 is fixed and β n " pβ n 1 , . . . , β n K q T is defined as: Notice that β n is defined in (3.10) as the minimizer of a quadratic function, and can then be rewritten by straightforward calculations as: where we use the notation φ " pφ 1 , . . . , φ K q T , and where we assumed that E " φpZ tn qφpZ tn q T ‰ is invertible.
In order to estimate a solution to (3.11) we rely on Monte Carlo simulations to approximate expectations with finite sums. Consider the training set t`Z m tn , Z m tn`1˘u M m"1 at time t n obtained by running M ě 1 forward simulations of the process Z from time t 0 " 0 to t n`1 . β n defined in (3.11) can then be estimated bŷ where we denote byÂ M n the estimator 1 M ř M m"1 φpZ m tn qφpZ m tn q T of the covariance matrix A n " E " φpZ tn qφpZ tn q T ‰ . The procedure presented above offers a convenient mean to approximate conditional expectations when the dynamics of the process Z are uncontrolled. When controlled, however, one has to account for the effect of the control on the conditional expectations either explicitly, via Control Randomization, or implicitly, via Regress-Later.

Control Randomization
In order to explicitly account for the effect of the control, one could directly introduce dependence on the control in the basis function. This basic idea of Control Randomization consists in replacing in the dynamics of Z the endogenous control by an exogenous control pI tn q 0ďnďN , as introduced in [11]. Its trajectories can then be simulated from time t 0 to time t N . Consider the training set tZ m tn , I m tn u N,M n"0,m"1 , with M ě 1, where I m tn are i.i.d. samples from a "training distribution" µ n with support in A. The training set will be used to estimate the optimal β n coefficients for n " 0, . . . , N´1. In the case of value iteration, tV ∆t pt n`1 , Z m tn`1 qu M m"1 is regressed against basis functions (which are, in this context, functions of the state and the control) evaluated at the points tZ m tn , I m tn u M m"1 , as follows: and where φ " pφ 1 , . . . , φ K q T andÂ is an estimator of the covariance matrix A n " ErφpZ tn , I tn qφpZ tn , I tn q T s. Notice that the basis functions take state and action variables as input in the case of Control Randomizationbased method, i.e., their domain is R dˆA . Also, observe that the estimated conditional expectation highly depends on the choice of the randomization for the control 2 . An optimal feedback control at time t n given Z tn " z is approximated by the expression (see Subsection 3.4 for more practical details on the computation of the argmax ): Basically, different randomized controls may drive the process Z to very different locations, and the estimations will suffer from inaccuracy on the states that have been rarely visited. 8 The value function at time t n is then estimated using Control Randomization method and value iteration procedure as Notice that Control Randomization can be easily employed in a performance iteration procedure by computing controls (3.16), keeping in mind that at each time t n we need to re-simulate new trajectories tZ m t k u N,M k"n,m"1 iteratively from the initial conditionZ m tn " z, using the estimated optimal strategies pα t k q N´1 k"n`1 to compute the quantities

Regress-Later
We present now a regress-later idea in which conditional expectation with respect to Z tn is computed in two stages. First, a conditional expectation with respect to Z tn`1 is approximated in a regression step by a linear combination of basis functions of Z tn`1 . Then, analytical formulas are applied to condition this linear combination of functions of future values on the present value Z tn . For further details see [8], [15] or [2]. With this approach, the effect of the control is factored in implicitly, through its effect on the (conditional) distribution of Z tn`1 conditioned on Z tn . Unlike the traditional Regress-Now method for approximating conditional expectations (which we discussed so far in the uncontrolled case and in Control Randomization), the Regress-Later approach, as studied in [2], imposes conditions on basis functions: can be computed analytically.
Using the Regress-Later approximation of the conditional expectation and recalling Assumption 3.1 we obtain the optimal control α m tn corresponding to the point Z m tn , sampled independently from a "training distribution" µ n (see Subsection 3.3 for further details): Notice that we are able to exploit the linearity of conditional expectations becausê is a constant once the training sets at times t k , k " n`1, . . . , N, are fixed.
The value function at time t n , is then estimated using Regress-Later method and value iteration procedure as Notice that contrary to Control Randomization, Regress-Later does not require the training points to be distributed as Z tn`1 conditioned on Z tn because the projection (3.17) is an approximation to an expectation conditional to the measure µ n which can be chosen freely to optimize the precision of the sample estimation. On the other hand Regress-Later, similarly to Control Randomization, can be easily employed in a performance iteration procedure by generating forward trajectories at each time step.

Quantization
We propose in this section a quantization-based algorithm to numerically solve control problems. We may also refer to the latter as the Q-algorithm or Q in all the numerical examples considered in Section 4, where Q stands for Quantization. Let us first introduce some ingredients of Quantization, and then propose different ways of using them to approximate conditional expectations. Let pE, |.|q be a Euclidean space. We denote byε a L-quantizer of an E-valued random variable ε, that is a discrete random variable on a grid Γ " te 1 , . . . , e L u Ă E L defined bŷ where C 1 pΓq, . . ., C L pΓq are the Voronoi cells corresponding to Γ, i.e., they form a Borel partition of E satisfying and where Proj Γ stands for the Euclidian projection on Γ.
The discrete law ofε is then characterized by The grid of points pe q L "1 which minimizes the L 2 -quantization error }ε´ε} 2 leads to the so-called optimal L-quantizer, and can be obtained by a stochastic gradient descent method, known as Kohonen algorithm or competitive learning vector quantization (CLVQ) algorithm, which also provides as a byproduct an estimation of the discrete law pp q L "1 . We refer to [16] for a description of the algorithm, and mention that for the normal distribution, the optimal grids and the weights of the Voronoi tesselations are precomputed for dimension up to 10 and are available on the website http://www.quantize.maths-fi.com. In practice, optimal grids of the Gaussian random variable N 1 p0, 1q of dimension 1 with 25 to 50 points, have been used to solve the control problems considered in Section 4. We now propose two ways to approximate conditional expectations. The first approximation belongs to the family of the constant piecewise approximation, and the other one is an improvement on the first one, where the continuity of the approximation w.r.t. the control variable is preserved. In the sequel, assume that for n " 0, . . . , N´1 we have a set Γ n of points in R d that should be thought of as a training set used for estimating V pt n ,¨q. See Subsection 3.3 for more details on how to build Γ n .

Piecewise constant interpolation
We assume here that we already have an estimate of V ∆t pt n`1 ,¨q, the value function at time t n`1 , for n P t0, . . . , N´1u, and we denote by p V Q ∆t pt n`1 ,¨q the estimate. The conditional expectation is then approximated as E a n,z where: ‚ G ∆t is defined, using the notations introduced in (3.3), as ‚ Proj Γn p.q stands for the Euclidean projection on Γ n . ‚ Γ " te 1 , . . . , e L ( and p ( 1ď ďL are respectively the optimal L-quantizer and its associated sequence of weights of the exogenous noise ε. See above for more details. An optimal feedback control at time t n and point z P Γ n is approximated by the expression (see Subsection 3.4 for more practical details on the computation of the argmax ): The value function at time t n , is then estimated using the piecewise constant approximation and value iteration procedure as  Figure 1 p.21). As a result, the optimization process over the control space suffers from high instability and inaccuracy, which implies a poor estimation of the value function V pt n ,¨q.

Semi-linear interpolation
Once again, we assume here that we already have p V Q ∆t pt n`1 ,¨q, an estimate of the value function at time t n`1 , with n P t0, . . . , N´1u, and wish to provide an estimation of the conditional expectation in the particular case where the controlled process lies in dimension d=1. Consider the following piecewise linear approximation of the conditional expectation, which is continuous w.r.t. the control variable a: where for all " 1, . . . , L, z´and z`are defined as follows: ‚ z´and z`are the two closest states in Γ n`1 from G ∆t pz, a, e q, such that z´ă G ∆t pz, a, e q ă z`, if such states exist; and, in this case, we define λ e ,z a " G ∆t pz,a,e q´zź`´z´. ‚ Otherwise, z´and z`are equal and defined as the closest state in Γ Z n`1 from G ∆t pz, a, e q and we define λ e ,z a " 1.

Remark 3.3.
This second approximation is continuous w.r.t. the control variable, which brings stability and accuracy to the optimal control task (see Subsection 3.4), and also ensures an accurate estimate of the value function at time t n . We will mainly use this approximation in the numerical tests (see Section 4).

Remark 3.4.
Although the dimension d " 1 plays a central role to define clearly the states z´and z`in (3.21), the semi-linear approximation can actually be generalized to a certain class of control problems of dimension greater than 1, using multi-dimensional Quantization (see, e.g., the comments on the Q-algorithm designed to solve the Portfolio Optimization example, in Subsection 4.1.2). However, it is not well-suited to solve numerically general control problems in dimension greater than 1. For these cases, other interpolating methods such as the use of Gaussian processes are more appropriated (see, e.g., [14] for an introduction on the use of Gaussian processes in Regression Monte Carlo).
The optimal feedback control at time t n and point z P Γ n is approximated as (see Subsection 3.4 for more practical details on the computation of the argmax ): The value function at time t n is then estimated using the semi-linear approximation and value iteration procedure as where z`and z´are defined using the controlα Q tn pzq. See (3.21) for their definitions.

Training points design
We discuss here the choice of the training measure µ and the sets pΓ n q n"0,...,N´1 used to compute the numerical approximations in Regression Monte Carlo and Quantization. Two cases are considered in this section. The first one is a knowledge-based selection, relevant when the controller knows with a certain degree of confidence where the process has to be driven in order to optimize her reward functional. The second case, on the other hand, is when the controller has no idea where or how to drive the process to optimize the reward functional.

Exploitation only strategy
In the knowledge-based setting there is no need for exhaustive and expensive (in time mainly) exploration of the state space, and the controller can directly choose training sets Γ constructed from distributions µ that assign more points to the parts of the state space where the optimal process is likely to be driven.
In practice, at time t n , assuming we know that the optimal process is likely to stay in the ball centered around the point m n and with radius r n , we chose a training measure µ n centered around m n as, for example N pm n , r 2 n q, and build the training set as sample of the latter. In the Regress-Later setting this can be done straightforwardly, while Control Randomization requires one to select a measure for the random control such that the controlled process Z is driven in such area of the state space.
Taking samples according to µ to build grids makes them random. Another choice, which we used in the Quantization-based algorithm, is to use the (deterministic) optimal grid of N pm n , σ 2 n q with reduced size (typically take 50 points for a problem in dimension 1, 250 for one of dimension 2 when σ 2 n " 1,. . . ), which can be found at www.quantize.maths-fi.com, to reduce the size of the training set and alleviate the complexity of the algorithms. Remark 3.5. As the reader will see, we chose the training sets based on the "exploitation only strategy" procedure, i.e. by guessing where to drive optimally the process, when solving the Liquidation Problem introduced in Subsection 4.1.

Explore first, exploit later
Explore first: If the agent has no idea of where to drive the process to receive large rewards, she can always proceed to an exploration step to discover favorable subsets of the state space. To do so, the Γ n , for n " 0, . . . , N´1, can be built as uniform grids that cover a large part of the state space, or µ can be chosen uniform on such domain. It is essential to explore far enough to have a well understanding of where to drive and where not to drive the process.

Exploit later:
The estimates for the optimal controls at time t n , n " 0, . . . , N´1, that come up from the Explore first step, are relatively good in the way that they manage to avoid the wrong areas of state space when driving the process. However, the training sets that have been used to compute the estimated optimal control are too sparse to ensure accuracy on the estimation. In order to improve the accuracy, the natural idea is to build new training sets by simulating M times the process using the estimates on the optimal strategy computed from the Explore first step, and then proceed to another estimation of the optimal strategies using the new training sets. This trick can be seen as a two steps algorithm that improves the estimate of the optimal control. Remark 3.6. In Control Randomization, multiple runs of the method are often needed to obtain precise estimates, because the initial choice of the dummy control could drive the training points far from where the optimal control would have driven them. In practice, after having computed an approximate policy backward in time, such policy is used to drive M simulations of the process forward in time, which in turn produce control paths that can be fed as a random controls in a new backward procedure, leading to more accurate results.
Remark 3.7. We applied the "explore first, exploit later" idea to solve the Portfolio Optimization problem introduced in Subsection 4.1.

Optimal control searching
Assume in this section that we already have the estimates p V ∆t pt k ,¨q for the value function at time t k , for k " n`1, . . . , N , and want to estimate V pt n ,¨q the value function at time t n . The optimal control searching task consists in optimizing the function 3 over the control space A, for each z P Γ n , and where we denote byÊ a n,z " p V ∆t pt n`1 , Z tn`1 q ‰ an approximation of E a n,z " p V ∆t pt n`1 , Z tn`1 q ‰ using Regress-Later, or Control Randomization or Quantization-based methods (see Subsection 3.2). Once again, we remind that importance of this task is motivated by the dynamic programming principle stating that for all n " 0, . . . , N´1, we can approximate the value function at time n as follows where p V ∆t pt n ,¨q is our desired estimate of the value function at time n.

Low cardinality control set
In the case where the control space A is discrete (with a relatively small cardinality), one can solve the optimization problem by an exhaustive search over all the available controls without compromising the computational speed.

Remark 3.8.
Note that in the case where the control space is continuous, one can always discretize the latter in order to rely on the effectiveness of extensive search to solve the optimal control problem. However, the control space discretization brings an error. So the control might have to include a high number of points in the discretization in order to reduce the error thereby causing a considerable slow down of the computations.

High cardinality/continuous control space
If we assume differentiability almost everywhere, as follows from the semi-linear approximation in Quantization, and most choices of basis functions in Regression Monte Carlo, we can carry on the optimization step by using some gradient-based algorithm for optimization of differentiable functions. Actually, many optimizing algorithms (Brent, Golden-section Search, Newton gradient-descent,. . . ) are already implemented in standard libraries of most programming languages like Python (see, e.g., package scipy.optimize), Julia (see, e.g., package Optim.jl), C and C++ (see, e.g., package NLopt). Concretely, in all the examples considered in Section 4, we used the Golden-section Search or the Brent methods when testing Quantization-based algorithm to find the optimal controls at each point of the grids. These algorithms were very accurate to find the optimal controls, and we made use of Remark 3.9 to find the optimal controls using the Regress-Later-based algorithm.

Upper and lower bounds
After completing the backward procedure, we can compute an unbiased estimation of the value of the control policy by using Monte Carlo simulations and sample average. Assume already computed (or simply available) the matrix of regression coefficients, in the case of Regression Monte Carlo, and discrete probability lawp for Quantization, we can use this information to implicitly compute the control and simulate forward many trajectories of the controlled process starting from a common initial condition. We can then evaluate the average performance measure by computing the sample average of the rewards collected on each trajectory. Denoting such approximation byV ∆t p0, zq, and recalling that by definition J ∆t p0, z, αq ď J ∆t p0, z, α˚q, for all α P A ∆t and where α˚represents the optimal control process; it holdsV ∆t p0, zq ď V ∆t p0, zq, for z P R d .
The argument above implies that, neglecting the time-discretization error, we obtain a lower bound for V ∆t p0,¨q by evaluating the estimated policy. On the other hand, see [10], based on [20], to get an upper bound of the value function via duality.

Pseudo-codes
In this section, we present the pseudo-code for the three approaches presented in the previous sections. For simplicity, we will only show the algorithms designed using value iteration procedure. However, the performance iteration update rule can be substituted in the codes below provided that forward simulations are run to obtain a pathwise realization of the controlled process and associated rewards.

Pseudo-code for a Regress-Later-based algorithm
We present in Algorithm 1 a pseudo-code to estimate V ∆t pt n ,¨q, for n " 0, . . . , N´1, using Value Iteration and based on Regress-Later method. For n " 0, . . . , N´1, we denote byV RL ∆t pt n ,¨q the derived estimation of V ∆t pt n ,¨q, and will refer to it as the RLMC algorithm in the numerical tests presented in Section 4.
Note that we use the same training measure µ at each time step so that there is only one covariance matrix to estimate (since A tn is the same for all n " 0, . . . , N´1). Denote byÂ M the estimator, as defined in (3.15).

Pseudo-code for a Control Randomization-based algorithm
We present in Algorithm 2 a pseudo-code to estimate V ∆t pt n ,¨,¨q, for n " 0, . . . , N´1, using Value Iteration and based on Control Randomization method. For n " 0, . . . , N´1, we denote byV CR ∆t pt n ,¨q the derived estimation of V ∆t pt n ,¨q, and will refer to it as the CR algorithm in the numerical tests presented in Section 4.

Pseudo-code for a Quantization-based algorithm
We present in Algorithm 3 a pseudo-code to estimate V ∆t pt n ,¨,¨q, for n " 0, . . . , N´1, using value iteration procedure and based on Quantization method. For n " 0, . . . , N´1, we denote byV Q ∆t pt n ,¨q the derived estimation of V ∆t pt n ,¨q, and will refer to it as the Q-algorithm in the numerical tests presented in Section 4.
Note that we made use of a piecewise constant approximation of conditional expectations to approximatê V Q ∆t pt n ,¨q in order to keep the algorithm simple. Also, note that, as said previously, in most of the numerical tests run in Section 4, we will use optimal grids available at www.quantize.maths-fi.com and will take L " 25 to 50 points for the size of the optimal grid of the Gaussian noise ε. Outputs: tβ n k u N,K n,k"1 ,V RL ∆t p0, zq for z P R d . Outputs: tβ n k u N,K n"0,k"1 ,V CR ∆t p0, zq for z P R d .
1: Initialize the estimated value function at time N :V Q ∆t pt N , zq " gpzq, @z P Γ N . 2: for n " N´1 to 0 do 3: Estimate the value function at time t n as follows: p p V Q ∆t´t n`1 , Proj Γn`1`G∆t pz, a, e q˘¯ff, @z P Γ n . (3.24)

The model
We consider a financial market model with one risk-free asset, assumed to be equal to one, and d risky assets of price process S " pS 1 , . . . , S d q governed where B 0 is a d-dimensional Brownian motion on a filtered probability space pΩ, F, F, P 0 q, σ is the dˆd invertible matrix volatility coefficient, assumed to be known and constant. However, the drift pβ t q of the asset (which is typically a diffusion process governed by another independent Brownian motion B) is unknown and unobservable like the Brownian motion B 0 . The agent can actually only observe the stock prices S, and we denote by F S the filtration generated by the price process S, which should be view as the available information.
In this context, we shall consider two important classes of optimization problems in finance: (1) Portfolio Liquidation. We consider the problem of an agent (trader) who has to liquidate a large number y 0 of shares in some asset (we consider one stock, d " 1) within a finite time T , and faces execution costs and market price impact. In contrast with frictionless Merton problem, we do not consider mark-to-market value of the portfolio and instead consider separately the amount on the cash account and the inventory Y , i.e., the position or number of shares held at any time. The strategy of the agent is then described by a real-valued F S -adapted process α, representing the velocity at which she buys (α t ą 0) or sells (α t ă 0) the asset, and the inventory is thus given by The objective of the trader is to minimize over α the total liquidation cost where f p.q is an increasing function with f p0q " 0, representing a temporary price impact, and p.q is a loss function, i.e., a convex function with p0q " 0, penalizing the trader when she does not succeed to liquidate all her shares. (2) Portfolio Selection. The set A of portfolio strategies, representing the amount invested in the assets, consists in all F S -adapted processes α valued in some set A of R d , and satisfying ş T 0 |α t | 2 dt ă 8. The dynamics of wealth process X " X α associated to a portfolio strategy α is then governed by and as in Merton Portfolio Selection problem, the objective of the agent is to maximize over portfolio strategies the utility of terminal wealth where U is a utility function on R, e.g., CARA function U pxq "´expp´pxq, p ą 0.
Let us show how one can reformulate the above problems into a McKean-Vlasov type problem under partial observation and common noise as described in Section 1. We first introduce the so-called probability reference P, which makes the observation price process a martingale. Let us then define the process which is a pP 0 , Fq-martingale (under suitable integrability conditions on β), and defines a probability measure P " P 0 through its density: dP dP 0ˇF t " Z t , and under which the process is a pP, Fq-Brownian motion by Girsanov's theorem, and the dynamics of S is Notice that F S " F 0 the filtration generated by W 0 . We also denote by L t " 1{Z t the pP, Fq-martingale governed by Next, we use Bayes formula and rewrite the gain (resp. cost) functionals of our two portfolio optimization problems as St pdsq " S t , and we used the law of conditional expectations and the fact that S, X and Y are F 0 -adapted. This formulation of the functional J 1 (resp. J 2 ) fits into the MKV framework of Section 1 with state variables pX, L, βq (resp. pY, S, L, βq) We now consider the particular case when β is an F 0 -measurable random variable distributed according to some probability distribution νpdbq: this corresponds to a Bayesian point of view when the agent's belief about the drift is modeled by a prior distribution. In this case, let us show how our partial observation problem can be embedded into a finite-dimensional full observation Markov control problem. Indeed, by noting that β is independent of the Brownian motion W 0 under P, we havē Hence, the functionals J 1 and J 2 can be written as We are then reduced to a pP, F 0 q-control problem with state variables pW 0 , Xq for problem (1) and pW 0 , S, Y q for problem (2) with dynamics under P: Remark 4.1. Another example of partial observation for the drift β is the case when it is modeled by a linear Gaussian process. This would lead to the well-known Kalman-Bucy filter, hence to a finite-dimensional control problem. However, for general unobserved drift process β, we fall into an infinite dimensional control problem involving the filter process.

Numerical results
Let us now illustrate numerically the impact of uncertain Bayesian drift on the Portfolio Liquidation problem and the Portfolio Selection problem in dimension d " 1, by considering a Gaussian prior distribution β " ν " N pb 0 , γ 2 0 q, with b 0 P R and γ 0 ą 0. In this case, F is explicitly given by:

Portfolio Liquidation.
Let us first consider the Portfolio Liquidation problem (1) with a linear price impact function f paq " γa, γ ą 0, and a quadratic loss function pyq " ηy 2 , η ą 0. The optimal trading rate is given by (see [18]) here Y˚is the associated inventory with feedback control α˚: dYt " αt dt, Y0 " y 0 . Since we consider a Gaussian prior N pb 0 , γ 2 0 q for β, the optimal trading rate is explicitly given by where erfi is the imaginary error function, defined as:

Remark 4.2.
In the particular case where the price process is a martingale, i.e., b 0 " 0, and in the limiting case when the penalty parameter η goes to infinity, corresponding to the final constraint Y T " 0, we see that αt converges to´Yt {pT´tq, hence it becomes independent of the price process, and this leads to an explicit optimal inventory: Yt " y 0 T´t T with constant trading rate αt "´y 0 {T . We retrieve the well-known VWAP strategy obtained in [1].
We solve the problem numerically, taking N " 100 for the time discretization, and fixing the other parameters as follows: γ=5, S 0 =6, Y 0 =1, η=100 and σ=0.4. We run two sets of forward Monte Carlo simulations for b 0 " 0.1, T " 1 and b 0 "´0.1, T " 0.5 changing the value of γ 0 . We tested the Regress-Later Monte Carlo (RLMC), the Control Randomization (CR) and the Quantization (Q) algorithms. In particular, we wanted to compare the performance of these algorithms with pαt n q N´1 n"0 , where α˚, defined above, is the optimal strategy associated to the continuous-time Portfolio Liquidation problem. We refer to this discrete-time strategy as α˚(i.e., re-using the same notation), and we use Opt, or continuous-time optimal strategy when we want to stress the fact that this strategy is optimal for the continuous-time control problem, and not for the discrete time one. We also tested a benchmark strategy (Bench) which consists in liquidating the inventory at a constant rate´y 0 {T . The test consisted in computing the estimatesV ∆t pt 0 " 0, S 0 " 6, Y 0 " 1q associated to the different algorithms.
We display the results obtained by the different algorithms in Table 1 and plot them in Figure 2. One can observe in Figure 2 that for ∆t " 1 100 the estimationsV ∆t pt 0 " 0, S 0 " 6, Y 0 " 1q of the value function V ∆t pt 0 " 0, S 0 " 6, Y 0 " 1q, provided by RLMC, CR or Q-based methods, are sometimes such that V ∆t pt 0 " 0, S 0 " 6, Y 0 " 1q ďĴ ∆t pt 0 " 0, S 0 " 6, Y 0 " 1, α˚q, whereĴ ∆t p¨,¨,¨, α˚q is a Monte Carlo estimate of J ∆t p¨,¨,¨, α˚q applying strategy pαt n q N´1 n"0 (see in Figure 2 the curve Opt). It means that RLMC, CR, or Q-based methods sometimes provide better estimations of the optimal strategy than α˚for the discrete time control problem. However, since under suitable conditions (see, e.g., [12]), the optimal strategy for the discrete time control problem α∆ t converges toward α˚, i.e. we have α∆ t Ý ÝÝÝ Ñ ∆tÑ0 α˚, then it holds: Figure 3 shows a sample of the inventory pY t q tPr0,T s when the agent follows α˚and the Quantization algorithm. One can notice that given the chosen penalization parameters, it is optimal to short some stocks at terminal time. Finally, notice that the concavity of the curves comes from the fact that the running cost does not penalize the inventory. If so, we expect the curves of the inventory w.r.t. time to be convex, see, e.g., [7].

Details on the RL and CR algorithms implementation
The implementation of Regression Monte Carlo algorithms has required intense tuning and the use of the performance iteration technique introduced in Subsection 3.3.2 in order to obtain satisfactory results. Paramount is, in addition, the distribution chosen for the training points in Regress-Later and for the initial control in Control Randomization. The problem of finding the best set of data to provide to the backward procedure is similar in the two Regression Monte Carlo algorithms. However little study is available in the literature; for more details on this problem in the Regress-Later setting see [15] and [2]. In the case of RL algorithm a training measure µ n has been chosen in order to sufficiently explore the state space in the Y dimension, in particular we considered µ n " Ur´0.5, 0.5`T´t n tn s. Similarly for CR we seek a distribution of the random control such that the controlled process Y results in having a distribution similar to µ n . In order to achieve such goal we follow the "explore first, exploit later" approach presented in Subsection 3.3.2 and use a perturbed version of the empirical distribution of the control (to avoid concentration of the training points) obtained at previous iteration of the method to determine the random control at next iteration of the method.
In order to choose the basis functions, we used the fact that we expect the value function to be convex in the Y dimension with minimum around the optimal inventory level and monotone in the S dimension. For RL algorithm we choose therefore the following set of basis functions: ts, y, y 2 , sy, sy 2 u, where we take the square function y 2 as a general approximator for convex functions around their minima (where we expect the measure µ n to be concentrated). On the other hand, CR requires that we guess what the functional form of the conditional expectation of the value function is with respect to the control process. Considering our argument on square function approximating general convex functions we choose to add the set tα, α 2 , αy, αsu to the set of basis functions used by RL.
Note that there is no need for time-consuming optimal control searching with such a choice of basis functions, as explained in Remark 3.9.
Finally note that we observed very high volatility in the quality of the policy estimated by control randomization. For this reason we estimated the policy 50 times, and report in Table 1 the results provided by the best performing one; increasing the number of training points further affects the variability only marginally.

Details on the Q algorithm implementation
To numerically solve this example, we used the optimal grid of the Gaussian random variable with L " 50 points, denoted by Γ ε L , to define the grid 4 Γ W n " t n Γ ε L that discretizes W tn , the Brownian motion at time t n , and the grid Γ Y n " Y 0´t n T`t n Γ ε L that discretizes Y tn , the inventory at time t n , for n " 0, ..., N . Note that Γ Y n , for n " 0, ..., N , is centered at point Y 0´t n T because we guessed that the optimal liquidation rate was close to Y0 T (see Figure 3 to check that our guess is correct). We then considered the grid Γ n " Γ W nˆΓ Y n to discretize Z tn " pW tn , Y tn q, n " 0, ..., N .
We first tried to design a quantization algorithm using the following expression for the conditional expectation approximations: E a n,pw,yq where the first and second components of the process Z " pW, Y q are projected respectively on the grids Γ W n and Γ Y n ; and Proj Γ W n (resp. Proj Γ Y n ) stands for the Euclidean projection of the first (resp. second) component of Z " pW, Y q on Γ W n (resp. Γ Y n ). This approximation belongs to the family of constant piecewise approximations, and is in the spirit of multidimensional component-wise-quantization methods already studied in the literature (see, e.g., [17]). Unfortunately, as it can be seen in Figure 1, approximation (4.6) is discontinuous w.r.t. the control variable a in such a way that the optimal control searching task suffered from instability and inaccuracy, which implied bad value function estimations at time n " 0, ..., N´1. We thus had to use a better conditional expectation approximation.  Figure 1. Plot of the quantized-based piecewise-constant approximation of the conditional expectation CondExp: ∆t ppw, yq, a, eq˘¯. We took n " N´1, w " 0, and y "´0.18 to plot the curve. Observe that the approximation is discontinuous w.r.t. the control variable a in such a way that it makes the search of the minimizer of this function very difficult by usual (gradient descent-based) algorithms. Also, observe that the minimum of the function, which is actually equal to the estimation of the value function at time N´1 at point pw " 0, y "´0.18q, suffers from inaccuracy.
We then decided to smooth the previous approximation of the conditional expectations w.r.t. the control variable by considering the following E a n,pw,yq where, in the spirit of the semi-linear approximation presented in Subsection 3.2, we have for all " 1, ..., L: ‚ G w ∆t ppw, yq, a, e q and G y ∆t ppw, yq, a, e q respectively stand for the first and the second component of G ∆t ppw, yq, a, e q, i.e., G ∆t ppw, yq, a, e q "`G w ∆t ppw, yq, a, e q, G y ∆t ppw, yq, a, e q˘. See (3.19) for the definition of G ∆t . ‚ y´and y`are the two closest states in Γ Y n`1 from G y ∆t ppw, yq, a, e q, such that y´ă G y ∆t ppw, yq, a, e q ă y`if such point exists; y´and y`are equal to the closest state in Γ Y n`1 from G y ∆t ppw, yq, a, e q otherwise. ‚ λ e ,pw,yq a " G y ∆t ppw,yq,a,e q´yý`´y´i n the first case of the definition of y´and y`above; λ e ,pw,yq a " 1 otherwise.
This approximation is a slight generalization (to dimension d=2) of the semi-linear approximation developed in (3.21). Its main interest lies in the continuity of the approximation w.r.t. the control variable a, which provides stability and accuracy to the usual (gradient descent-based) algorithms for the optimal controls searching, as can be seen on the numerical results (see, e.g., Table 1).

22
portfolio strategy is explicitly given by is the posterior mean of the drift (Bayesian learning on the drift), and the optimal performance by The Portfolio Selection problem, even though in many aspects similar to the Portfolio Liquidation problem, is interesting in its own right because the control acts only on the variance of the controlled wealth process. We tested the Regress-Later Monte Carlo (RLMC), the Control Randomization (CR) and the Quantization (Q) algorithm on the Portfolio Selection problem. Similarly to what has been done for Portfolio Liquidation problem, we discretized time choosing N " 100 and solved the discrete time problem associated. We considered two set of experiments, b 0 " 0.1, T " 1 and b 0 "´0.1, T " 0.5, for different values of γ 0 P r0, 1s, p " 1, σ " 0.4. Given all these different parameters, we compared the performance of these algorithms with the one of the optimal strategy for the continuous-time problem α˚(Opt). The general test consists in computing a forward Monte Carlo with 500000 samples, following optimal strategy estimated using different strategies, to provide estimates of V pt 0 " 0, X 0 " 0, W 0 " 0q the value function at time 0.
We present the results of our numerical experiments in Table 2. One can see that the Quantization algorithm performs similarly to the theoretical optimal strategy (Opt) for the continuous time problem, which can be interpreted as stability and accuracy of the Q algorithm, and also shows that the time discretization error is almost zero here.
We also present in Figure 4 a sample of the wealth of the agent following the optimal strategy and the (Q) estimated one. One can see that the strategies slightly differ when the drift is high, and remain the same when the drift is low. The small difference can be explained by the fact that the optimal strategy (Opt) is not optimal for the discrete time version of the problem.

Details on the Q algorithm implementation
We designed the same Quantization algorithm as the one built to solve the Portfolio Liquidation problem. We nevertheless had to take a larger number of points in the grids to minimize the back-propagation of errors from the borders of the girds; and had to use the "explore first, exploit later" idea (see Subsection 3.3.2) to improve the results.

Details on the RL and CR algorithms implementation
When implementing Regression Monte Carlo algorithms, and choosing basis functions, the control on variance implies that low order polynomial can not be used alone, as they can easily cause the control to be bangbang between the boundaries of its domain. Similarly, piecewise approximations are not very effective, as the dependence on the control is very weak, requiring a high number of local supports and making the computational complexity overwhelming. We tested both value and performance iteration and tried to employ different kinds of basis functions and training points. Unfortunately, both Regress-Later and Control Randomization do not cope well with controlling the dynamics of a process through the variance only. A tailor-made implementation of Regression Monte Carlo to deal with this kind of problems is outside the scope of this paper and further investigation will follow in future work. For now, we chose not to provide results based on RL and CR methods.  Opt Q Figure 4. 3 simulations of the agent's wealth pX t q tPr0,T s when the latter follows the continuous-time optimal strategy (Opt) and the (Q) estimated optimal strategy to solve the Portfolio Selection problem. We took σ=0.4, T =1, P =0.1, γ 0 =5, b 0 =0,1. One can see that the two strategies are the same when the drift is low; but Q performs slightly better than Opt when the drift is high, which is a time-discretization effect.

The model
We consider the following model of systemic risk inspired by the model in [5]. The log-monetary reserves of N banks lending to and borrowing from each other are governed by the system . . , N , are independent Brownian motions, representing the idiosyncratic risk of each bank, W 0 is a common noise independent of W i , σ ą 0 is given real parameter, ρ P r´1, 1s, and where X i 0 , i " 1, ..., N are i.i.d.. The mean-reversion coefficient κ ą 0 models the strength of interaction between the banks where bank i can lend to and borrow from bank j with an amount proportional to the difference between their reserves. In the asymptotic regime when N Ñ 8, the theory of propagation of chaos implies that the reserve state X i of individual banks become independent and identically distributed conditionally on the common noise W 0 , with a state governed by dX t " κpErX t |W 0 s´X t qdt`σX t p a 1´ρ 2 dB t`ρ dW 0 t q for some Brownian motion B independent of W 0 . Let us now consider a central bank, viewed as a social planner, who only observes the common noise and not the reserves of each bank, and can influence the strength of the interaction between the individual banks, through an F 0 -adapted control process α t . The reserve of the representative bank in the asymptotic regime is then driven by and we consider that the objective of the central bank is to minimize where η ą 0 and c ą 0 penalize the departure of the reserve from the average. This is a MKV control problem under partial observation, but notice that it does not belong to the class of linear quadratic (LQ) MKV problems due to the control α which appears in a multiplicative form with the state. However, it fits into our class of polynomial MKV problem, and can be embedded into standard control problem as follows: We setX t " ErX t |W 0 s and Y t " ErpX t´Xt q 2 |W 0 s. The cost functional is then written as where the dynamics ofX and Y are governed by We have then reduced the problem to a pP, F 0 q-control problem in dimension two with state variables pX, Y q, which is neither LQ, but can be solved numerically.

Numerical results
For this problem, in the absence of analytical solution, we decided to compare the estimations of the value function at time 0 provided by our algorithms with a numerical approximation based on finite difference scheme provided by Mathematica, of the solution to the 2-dimensional HJB equation associated to the systemic risk problem: (4.7) We refer to the solution of this partial differential equation (obtained using Mathematica using finite differences as explained below) as the Benchmark (or simply Bench) in the sequel.
We computedV ∆t pt 0 " 0, x 0 " 10, y 0 " 0q using RL, CR and Q methods by considering a sample of size 500 000, and using the following parameters T " 1, σ " 0.1, κ " 0.5 and X 0 " 10. We recall that V ∆t pt 0 " 0, x 0 , y 0 q is an estimation of V p0, x 0 " 10, y 0 " 0q, the value function at px 0 , y 0 q and time 0 (see its definition on the last step of each pseudo-code presented in Subsection 3.6).
In Table 3 we display the numerical results of experiments run for two situations: we took η " 10, c " 100 and η " 100, ρ " 0.5 and vary the value of ρ in the first case, and vary the value of c in the second one. Plots of the two tables are also available in Figure 5. One can observe that the algorithms performs well. Mainly, Bench and Q provide slightly better results than the Regression Monte Carlo-based algorithms (the curves of Bench and Q are below those of the other two). Figure 6 shows two examples of paths pX t q tPr0,T s controlled by RLMC (curve "RLMC"), pX t q tPr0,T s naively controlled by α " 0 (curve "uncontrolled"), and the conditional expectation of X pX t q tPr0,T s (curve "EpX|W q"). One can see in these two examples that the (RLMC estimated) optimal control is as follows: ‚ do nothing when the terminal time is far, i.e., take α " 0, not to pay any running cost. ‚ catchX when the terminal time is getting close, to minimize the terminal cost.
We finally present a sample of paths pY t q tPr0,T s controlled by the decisions given by Q in Figure 7. One can see that the (Q estimated) optimal strategy minimizes the running cost first by letting Y grow; and deals with the terminal cost later by making Y small when the terminal time is approaching.

Details on the RL and CR algorithms implementation
For the implementation of the RL algorithm we decided to use polynomial basis functions up to degree 2. This choice allows us to compute the optimal control analytically as a function of the regression coefficients (see Remark 3.9). Compared to other optimization techniques, explicit expression allows for much faster and error-free computations (see Remark 3.9). For CR, we used basis functions up to degree 3 in all dimensions to obtain more stable results. Regarding the choice of the training measure in RL, we employed marginal normal distributions on each dimension. As we know that the inventory dimension Y represents the conditional variance of the original process X, we centered the training distribution µ n at zero but considered only training points Y m n ě 0. In CR, on the other hand, we need to carefully choose the distribution of the random control so that the process Y does not become negative. Notice in fact that the Euler approximation, contrary to the original SDE describing Y , does not remain positive and we would therefore need to carefully choose a control to avoid driving Y negative. In order to achieve such goal, without having to worry too much about the control, we modified the Euler approximation of (4.7) to feature a reflexive boundary at zero. Such features allow to train the estimated control policy to not overshoot when trying to drive the process Y to zero, without having Y to become negative.

Details on the Q algorithm implementation
As stated above, it is straightforward that Y ą 0 on p0, T s. However, the Euler scheme used to approximate the dynamics of Y does not prevent the associated process pY ti q 0ăiďN to be non-positive. When implementing the Q algorithm for the systemic risk problem, we forced`Proj Γ Y i`Y ti˘˘0 ăiďN to remain positive by simply choosing positive points for the grids Γ Y i that quantize the states of Y ti , at time t i for i " 0, ..., N . Also, given the expression of the instantaneous and terminal reward, one can expect Y to stay close to 0, but we do not have any idea of how small Y should stay for the strategy to be optimal (cf. Figure 7 to see a posteriori where Y lies). To deal with this situation, we decided to adopt the "explore first, exploit later" procedure. First, we chose some random grids with a lot of points near 0 and computed the optimal strategy on these grids. Then, we ran forward Monte Carlo simulations and generated an empirical distribution of the quantized Y . Second, we build new grids of Quantization for Y by generating new points according to the empirical distribution that we got from in the previous step. Finally, we computed new (hopefully better) estimations of the optimal strategy by running the Q algorithm using the new grids. The Q strategy performed better after applying this step, but not significantly since our first naive guess for the grids (i.e., before bootstrapping) was already good enough.

Details on the implementation of the deterministic algorithm for the resolution of the HJB
We use the NDSolve function in Mathematica based on finite difference method to solve (4.7). Note that usually terminal and boundary conditions are required to get numerical results. The final condition: ρ " 0.5 and η " 100. Table 3. Results for the systemic risk problem. Estimations of the value function at point px 0 " 10q at time 0 provided by different strategies. We took T " 1, N " 100, σ " 0.1, κ " 0.5, X 0 " 10.
V pT, x, yq " c 2 y is already given by (4.7). However, the boundary conditions on V pt, 0, yq and V pt, x, 0q are missing, except the trivial condition consisting of V pt, 0, 0q " 0. We then provided the HJB without boundary conditions to the Mathematica function NDSolve, and let the latter add artificial boundary conditions by itself to output results.

Conclusion
In this work, we have investigated how to use probabilistic numerical methods for some classes of mean field control problem via Markovian embedding. We focused on two types of Regression Monte Carlo methods (namely, Regress-Later and Control Randomization) and Quantization. We have then presented three different examples of applications. We found that the Regression Monte Carlo algorithms perform well in problems of control of the drift. In such problems, they are much faster than Quantization for similar precision. In particular, we noticed that Regress-Later is usually more reliable than Control Randomization; often the choice of a uniform distribution of the training points on an appropriate interval is sufficient to obtain high-quality estimations. On the other hand Control Randomization is very sensitive to the choice of the distribution of the randomized control, and often a few iterations are necessary before finding a good control distribution. We have also tried to use the performance iteration or path recomputation method, but on the examples we considered, it was very time consuming and did not help much in terms of accuracy. Despite the success of Regression Monte Carlo methods in problems with control on the drift, the example of Portfolio Selection highlighted a possible weakness of these algorithms. When the control acts on the variance only, we found difficult to make the numerical scheme converge to sensible results within the computational resources available. We realized that the study of these problems and the solution via Regression Monte Carlo methods is outside the scope of this paper. This is probably related to another limitation of this family of methods: the choice of the basis functions for the regression. Indeed, for some problems, a good basis might be very large or might require several steps of trials and errors. Quantization-based method, on the other hand, provided very stable and accurate results. A first interesting and practical feature of the Q-algorithm is that regressing the value function using quantization-based methods is local. So, first, it can be easily parallelized to provide fast results, and, second, it is easy to check at which points of the grids the estimations suffer from instability and how to change the grid to fix the problem (basically, by adding more points where the estimations need to be improved). Another interesting feature of the quantization methods is that, one can choose the grids on which to learn the value function. It is possible to exploit this feature in the case where one has, a priori, a rough idea of where the controlled process should be driven by the optimal strategy (see, e.g., the liquidation problem). In this case, one should build grids with many points located where the process is supposed to go. In the case where one has no guess of where the optimal process goes, it is always possible to use bootstrapping methods to build    Figure 6. Two realizations of pX t q tPr0,T s controlled by RLMC (curve "RLMC"), pX t q tPr0,T s naively controlled taken α " 0 (curve "uncontrolled"), andX (curve "EpX|W q"). The optimal control for the systemic risk problem (computed by RLMC) is to do nothing at first, and catchX when the terminal time is getting close.  Figure 7. Sample of pY t q tPr0,T s controlled by Q. The (Q) estimated optimal control for the systemic risk problem is to initially let Y become large, and then reduce its value when the approaching the terminal time.
better grids iteratively, starting from a random guess for the grid (see, e.g., the systemic risk problem). In both cases, one has to be particularly careful with the borders of the grids that have been built. Indeed, the decisions computed by quantization-based methods at the borders might easily be wrong if the grids do not have a "good shape" at the borders. Unfortunately, the shape of the grid that should be used depends heavily on the problem under consideration. Except in special cases, it seems not possible to avoid the use of deterministic algorithms (such as gradient descent methods or extensive search) to find the optimal action at each point of the grid. A smooth expression of the conditional expectations of the quantized processes is necessary for the deterministic algorithms find optimal strategy efficiently. Once again, the use of parallel computing can alleviate the time-consuming task of searching for the optimal control at each point of the grids.