Deep Learning-based Schemes for Singularly Perturbed Convection-Diffusion Problems

Deep learning-based numerical schemes such as Physically Informed Neural Networks (PINNs) have recently emerged as an alternative to classical numerical schemes for solving Partial Differential Equations (PDEs). They are very appealing at first sight because implementing vanilla versions of PINNs based on strong residual forms is easy, and neural networks offer very high approximation capabilities. However, when the PDE solutions are low regular, an expert insight is required to build deep learning formulations that do not incur in variational crimes. Optimization solvers are also significantly challenged, and can potentially spoil the final quality of the approximated solution due to the convergence to bad local minima, and bad generalization capabilities. In this paper, we present an exhaustive numerical study of the merits and limitations of these schemes when solutions exhibit low-regularity, and compare performance with respect to more benign cases when solutions are very smooth. As a support for our study, we consider singularly perturbed convection-diffusion problems where the regularity of solutions typically degrades as certain multiscale parameters go to zero.


Scientific context and goals
Singularly perturbed differential equations are typically characterized by a small parameter ε > 0 multiplying some of the highest order terms in the differential equation.In general, the solutions to such equations exhibit multiscale phenomena, and this raises significant challenges to classical numerical methods such as finite elements or finite volumes.To build accurate and robust approximations with these methods as ε decreases, it is necessary to develop elaborate numerical discretizations.In addition to the mathematical difficulties of the formulation, the resulting numerical schemes are often not entirely trivial to implement: they often require mesh adaptation, and working on complicated geometries is challenging.These difficulties motivate the search for new discretization schemes, hopefully mesh-free, with potential to deliver good quality approximations with easier implementation techniques.In this work, we explore this research direction, and consider strategies based on deep learning techniques.Our main goal is to test various neural network-based schemes, so as to design a strategy which should be robust when ε → 0, easily implementable even for complicated geometries, and with potential to scale in high dimension.
The idea of working with neural network functions to solve PDEs is by far not novel, and countless contributions have been proposed on this front in recent years.The strategies can be roughly classified into two categories: (1) In the first category, deep neural networks are employed to assist classical numerical methods by improving some limitations, or accelerating certain steps (see, e.g., [1][2][3]).(2) In the second category, neural networks are used to directly approximate the solution of PDEs.The solution schemes become in this case an optimization problem where it is crucial to design appropriate loss functions.The loss functions are mostly based on residuals of the equations, and yield to different methods depending on the specific choice: (a) Physics-informed neural networks (PINNs, [4]) is a collocation-based method.One finds the coefficients of the neural network solution by minimizing a discretized version of the L 2 norm of the strong form of the residual of the PDE.This method is very easily implementable but it implicitly assumes that solutions are very regular.(b) Other strategies leverage weak variational formulations where less regular solutions are allowed.
On this front, most of the classical methods originally formulated for piecewise polynomial functions have by now been tested with trial and test spaces of neural network functions.In this respect, the deep Galerkin method (DGM, [5]) is based on a least-squares formulation, and the variational PINNS (VPINNs, [6,7]) is based on the Galerkin method.The main drawback of this approach is that the approximation quality depends on the architecture of both the trial and the test neural network classes.In addition, numerous evaluations for multiple test functions need to be performed.Also, strategies involving the minimization of weak-form residuals are usually not trivial to implement because they involve the computation of norms in very weak spaces which necessitate extra discretization steps.(c) Another approach based on weak variational formulations is the so-called deep Ritz method (DRM, [8]).It leverages the fact that the solution of certain PDEs are the unique minimizer to a certain energy functional.When possible, this approach seems the most appealing: the loss functions is naturally given by the problem, it can accommodate low regular solutions, and the computational cost is moderate in the sense that it only requires to handle test functions (no trial functions).It also carries potential to address high dimensional problem as illustrated in [8][9][10].

Contribution
The goal of this work is to compare and develop several neural network schemes for singularly perturbed problems when ε → 0. We focus more particularly on convection-diffusion (or stationnary Fokker-Planck) problems with vanishing diffusion for which we explore schemes from the second category according to the above distinction.In other words, we approximate solutions of singular PDEs with feedforward neural network functions.When ε → 0, the regularity of the solutions is deteriorated because of local or boundary thin layers.Therefore the vanilla PINNs method is expected to perform poorly for small values of ε because it commits a variational crime (and this is actually confirmed in our numerical experiments).Methods based on weak variational formulations seem better adapted, and on that front, it is desirable to work with the deep Ritz method.However, finding energy formulations is not straightforward due to the non-symmetric nature of convective effects.We show how this method can be applied in this context thanks to a change of variable.We compare its numerical robustness with respect to the PINNs method, and a naive finite element discretization with a uniform grid.In the present study, our tests are performed on a 1D example.Despite its simplicity, the example exhibits all the features that are challenging for numerical schemes.For our purposes, the example also presents the important advantage of having analytic solutions which we can leverage in our error analysis, and our validations.Higher-dimensional tests involving also more elaborate sampling strategies are left for future work.
The paper is organized as follows.In Section 2, various formulations of the convection-diffusion problem we are interested in are introduced.In Section 3, we introduce various neural networks-based schemes which are inspired from the various formulations introduced in Section 2. The reader is encouraged to observe that an expert mathematical insight is required in order to build formulations that do not incur in variational crimes.Lastly, in Section 4, these various schemes are compared for one-dimensional problem.We comment on their respective merits and limitations as ε goes to 0.

A singularly perturbed convection-diffusion equation
The aim of this section is to introduce the singularly perturbed convection-diffusion equation we consider in this work, and various formulations of the problem which will be used in Section 3 so as to design various neural networks-based schemes for its numerical solution.

Problem definition
As a prototypical example, we consider the following singularly perturbed convection-diffusion equation on a given domain Ω ⊂ R d , with d ∈ N * .Let F : Ω → R d be a given force field, 0 < ε 1 a small parameter, and f : Ω → R is a given right-hand side function.Our goal is to find a solution u : together with Robin boundary conditions where n refers to the outward unit vector of ∂Ω, α, κ ≥ 0 and g is a real-valued function defined on ∂Ω.In the following, we assume that the force field F derives from a potential function V : Ω → R, in the sense that Under appropriate assumptions on F (or V ), f and g, which are assumed to be smooth functions for the sake of simplicity, problem (1)-( 2) can be proved to have a unique solution [11][12][13].Note that, more generally, α and κ could also be given as real-valued functions defined on ∂Ω, instead of constants, and our subsequent developments could be easily adapted.
The equation represents the change in the concentration u of a quantity in a given medium, and in presence of convective and diffusive effects.The force field F represents the drag force while the singular perturbation parameter ε represents the diffusivity of the medium.In the limit of an inviscid medium as ε → 0, the equation changes from elliptic to hyperbolic nature, and from second to first order.For Dirichlet boundary conditions u = 0 on ∂Ω, the solution can develop sharp boundary layers of width ε near the outflow.We refer the reader to [14] for general references on this equation regarding its analysis and numerical methods.
Classical numerical methods are challenged by problem (1) when ε is small.In the case of the Galerkin finite element method, the poor performance for this problem is reflected in the bound on the error in the finite element solution.For Ω = (0, 1) and Dirichlet boundary conditions, a standard Galerkin method with a uniform grid of size h delivers a solution u h on a finite element space P h that satisfies where C(ε) ∼ ε −1 , so that the constant blows up as ε → 0 (see [14,Theorem 2.49]).The dependence of C on ε is usually referred to as a loss of robustness in the sense that, as ε decreases, the Galerkin method is bounded more and more loosely by the best approximation error.As a consequence, on a coarse mesh and for small values ε, the Galerkin approximation develops spurious oscillations everywhere in the domain.This very well-known behavior will actually be observed later on in our numerical tests.Numerous methods have been proposed in order to address this loss of robustness in finite element methods.An important family of methods is based on using residual-based stabilization techniques.Given some variational form, the problem is modified by adding to the bilinear form the strong form of the residual, weighted by a test function and scaled by a stabilization constant τ .The most well-known example of this technique is the streamline upwind Petrov-Galerkin (SUPG) method (see [15]).The addition of the residual-based stabilization term, can be interpreted as a modification of the test functions which means that these methods seek stabilization by changing the test space, and motivates to search for optimal test spaces in the spirit of [16,17].
Other classical discretization methods such as finite volumes suffer from similar issues, and strategies involving layer-adaptive grids such as Shishkin meshes have been proposed (see, e.g., [18]).
The aim of this work is to explore the potential of approximating solutions of such problems with neural network functions, and the next section presents several options for this, with a discussion on their merits and limitations.

General formulation
Any neural-network based numerical scheme for the solution of ( 1)-( 2) relies on the use of a variational formulation of this problem which enables to write u (or another function defined from u) as a minimizer of a problem of the form where V is a particular set of real-valued functions defined on Ω.The loss function J : V → R is usually of the form where for every v ∈ V, R(v) and S(v) are real-valued functions defined on Ω and ∂Ω respectively.They are assumed to be measurable with respect to the measures ρ and τ , which are defined on Ω and ∂Ω respectively.The aim of the next sections is to introduce various formulations of problem ( 1)-( 2) under the form (4)-( 5).This requires to define appropriate definitions of the set V, the functions R(v) and S(v) for any v ∈ V and the unknown function solution of (4).Unless otherwise stated, the measures ρ and τ will be defined as the Lebesgue and Lebesgue surfacic measures respectively.

Vanilla (V) formulation
We begin by introducing the most classical formulation used in neural network-based numerical schemes such as PINNs.For the reasons that we outline next, different aspects of this formulation can be improved, therefore we refer to it as vanilla (V) formulation in the following.
The formulation consists in interpreting the solution u of ( 1)-( 2) as the unique solution of a minimization problem of the form (4) with V = H 2 (Ω) and to define for all v ∈ V, for some λ ∈ (0, 1).In this approach, the parameter λ enables to tune the respective weight of the contributions of the bulk and boundary terms in the total functional J to be minimized.In practice, in the numerical tests presented in Section 4, λ will always be chosen to be equal to 1  2 .Note that such an approach requires the solution u to belong to H 2 (Ω), which implies that the solution has to be sufficiently regular.When ε → 0, this assumption becomes less and less realistic due to the formation of boundary layers.This raises the question as to whether it is possible to introduce another formulation of problem (1)-( 2) which would allow for less regular solutions.The goal of the next section is to introduce such an alternative formulation.

Weak variational (W) formulation
In this section we develop an avenue based on an energy minimization approach which requires less regularity in the solutions than the vanilla formulation.To this aim, we introduce the change of variable where c ∈ R is a constant yet to be determined.Taking first and second derivatives in (7) yield that for all x ∈ Ω, Now, setting the value of c to be and inserting the change of variable into (1), we conclude that u is a solution to (1) if and only if z is a solution to the elliptic problem with Robin boundary conditions At this stage, one could of course apply the vanilla formulation to solve (8)-( 9) and compute z solution of a minimization problem of the form (4) with V = H 2 (Ω) and the functionals R and S defined by for all v ∈ V = H 2 (Ω) and some λ ∈ (0, 1).The value of λ chosen in our numerical tests is λ = 0.5.We will refer to this approach as the vanilla-z (V z) formulation.
Note that this method does not fully exploit the change of variables since the elliptic nature of problem (8) allows us to easily build a weak formulation of this equation.Testing against a smooth test function v and integrating by parts we obtain the weak formulation Using equality (9), we get Therefore the weak formulation of problem ( 8) is to find z ∈ H 1 (Ω) such that with To ensure that the symmetric bilinear form a is continuous and coercive, we assume in the sequel that the following conditions are satisfied: In that case, all the hypothesis of the Lax-Milgram theorem are satisfied for a and .Thus, z can be equivalently rewritten as the unique solution of a minimization problem of the form This implies that z can equivalently be recast as the unique solution of a minimization problem of the form (4) with V = H 1 (Ω) and We will refer to this approach as the weak-z (Wz) formulation.
Moreover, using (7), we can equivalently express u as a solution of a minimization problem of the form (4) with and for all v ∈ V with v := ve − V 2ε .We will refer to this formulation as the weak (W) formulation.

Rescaled formulation
In this section, we introduce another formulation based on a change of scale in the original problem.More precisely, introducing Ω ε := 1  ε Ω, we introduce auxiliary functions u : Notice that if u and z satisfy (7), then Denoting by F (y) := −∇ V (y) = F (εy) for all y ∈ Ω ε , it holds that u is solution to (1)-( 2) if and only if u is solution to where f (y) := f (εy) for all y ∈ Ω ε with boundary conditions with g(y) := g(εy) for all y ∈ Ω ε .
Using similar calculations to the ones done in Section 2.4, the Lax-Milgram theorem guarantees that z is the unique solution in H 1 (Ω ε ) of the following variational problem: for all v ∈ H 1 (Ω ε ), The result is valid provided that the following assumptions on the coefficients hold The above conditions are equivalent to the assumptions (12) stated in Section 2.4.
As in the previous section, the function z is then the unique solution of a minimization problem of the form (4) with We will refer to this approach as the rescaled-weak-z (RWz) formulation.

Summary of the methods
For the sake of clarity, we summarize here the main features of each method.

Neural networks based numerical schemes
In this section we describe the numerical approach used in order to compute an approximation of the solution of a minimization problem of the form (4) by means of a neural-network based method.We first present in Section 3.1 the general principle of such approaches.The main ingredients to design a neural-network based method consist in the choice of a class of neural network functions and of sampling schemes in order to approximate the integrals involved in the definition of the loss function J defined by (5).These two ingredients are detailed respectively in Section 3.2 and Section 3.3 respectively.Finally, some details on the numerical implementation are given in Section 4.2.

General principle
The numerical solution of a minimization problem of the form (4) usually requires to consider alternatives to V and J that are amenable for practical implementation.The strategy thus consists in formulating a related problem of the form where • K ⊂ V is a set of functions parametrized by a finite number of scalar coefficients.A classical class of functions are finite elements.Here, we consider neural networks (see Section 3.2 below); • J is an approximation of the loss function J where the integrals are approximated using some particular quadrature or sampling schemes.More precisely, for given integers K, M ∈ N * , given sets of points (x k ) 1≤k≤K ⊂ Ω, (y m ) 1≤m≤M ⊂ ∂Ω, and given sets of weights (ρ k ) 1≤k≤K ⊂ R + and (τ m ) 1≤m≤M ⊂ R + , for all v ∈ K, the functional J (v) is defined by As a consequence, the definition of a neural-network based numerical scheme for the approximation of a problem of the form (4) requires the definition of two ingredients: • the class K ⊂ V of neural network functions; • the sampling scheme, i.e. the choice of K, M , (x k ) 1≤k≤K , (y m ) 1≤m≤M , (ρ k ) 1≤k≤K and (τ m ) 1≤m≤M in order to define the approximate functional J given by (23).The set of neural network functions K we consider in our numerical experiments is presented in Section 3.2.The various sampling schemes tested here are given in Section 3.3.

Neural Network classes of functions
In this work, we only consider classes of functions defined by means of feedforward neural networks whose definition we recall next (see [19] for general references).
Let X ⊂ R d X and Y ⊂ R d Y be some input and output sets of finite dimensions d X , d Y ∈ N * .A feedforward neural network is a function ψ : X → Y which reads as ψ(x) = T L (σ(T L−1 (σ(. . .σ(T 0 (x))))), ∀x ∈ X .
(24) For every ∈ {0, . . ., L}, is an affine function which can be expressed through a matrix A ∈ R p +1 ×p , and an offset vector b ∈ R p +1 , and σ : R → R is called the (nonlinear) activation function.By a slight abuse of notation, for all p ∈ N * and for any vector w := (w i ) 1≤i≤p ⊂ R p , the notation σ(w) actually denotes the vector of R p with entries σ(w i ), that is, σ(w) = (σ(w i )) p i=1 .Note that since ψ maps X onto Y, it is necessary that p 0 = d X and p L+1 = d Y .The layers numbered from 1 to L are usually called the hidden layers of the neural network.
To define a class of feedforward neural networks, we fix an architecture by prescribing a given activation function σ, depth L ∈ N, and layer widths p = (p 0 , . . ., p L+1 ) ∈ (N * ) L+2 .Once the values of σ, L and p have been chosen, we view the coefficients (A , b ) 0≤ ≤L of the affine mappings T 0 , • • • , T L as parameters.We gather these coefficients in the vector of parameters and assume that θ takes values in a set For any θ ∈ Θ, we define by ψ θ : X → Y the function ψ defined by ( 24) with θ = {(A , b )} L =0 ∈ Θ.The class of neural network functions with architecture (σ, L, p) and coefficient sets Θ is then defined as N (σ, L, p, Θ) := {ψ θ : θ ∈ Θ} .
In our context, the input and output sets X and Y are respectively given by Note that the set K is then a subset of V for all the formulations of the convection-diffusion problem we introduced in Section 2.Moreover, the solution of the approximate problem ( 22) is equivalent to finding a minimizer θ * ∈ Θ solution to min θ∈Θ J (ψ θ ).
Remark: In many machine learning applications, the choice of relu activation functions is very common due to its low computational cost when performing evaluation or first order differentiation.However, in our problem, second order derivatives are needed to calculate the loss function.If relu activation functions were used, then the second order derivative terms would be 0, and no good approximation could be learned.This reason motivates our choice of tanh as the activation function.

Sampling schemes
We detail in this section the various sampling schemes we considered in our numerical tests in order to define the approximate loss function J .

S(v) dτ for the RWz formulation).
We consider three different sampling schemes for the approximation of the bulk term Ω R(v) dρ: (1) The first choice is a simple uniform sampling scheme (labeled −u in our tests).For a given K ∈ N * , we set ρ k = 1 K and (x k ) 1≤k≤K as the centers of the intervals given by a uniform discretization grid of the interval (0, 1).
(2) The second sampling scheme, called random (−r) scheme, consists in choosing ρ k = 1 K and the points (x k ) 1≤k≤K as a collection of random points, identically independently distributed according to the uniform distribution on (0, 1).
(3) We lastly consider a third sampling scheme, called exponential (-e) scheme, which is specific to the Wz formulation.Recall that in this case, for all v ∈ K, the expression of R(v) is given by ( 14), namely

The thus view the bulk integral term as
and we approximate each component separately as follows.For the first term, we draw a collection of k ) 1≤k≤K1 from the uniform distribution on (0, 1) and for all 1 ≤ k ≤ K 1 , the weights ρ (1) k are chosen to be equal to 1 K1 .For the second term, we draw k ) 1≤k≤K2 following the probability density k ) .
In the following, we use the notation −u (respectively −r and −e), after the name of a formulation, in order to refer to the numerical method obtained by using this formulation, together with a uniform (respectively random or exponential) sampling scheme.For instance, the V − u method refers to the vanilla formulation used in conjunction with a uniform sampling scheme.

Comparison with finite element schemes
One important point in the investigation of the merits and limitations of deep learning-based numerical schemes is to understand how they compare with respect to other existing schemes.In our tests, we provide a numerical comparison with a vanilla finite element Galerkin scheme involving a uniform mesh.For the sake of completeness, we briefly recall the main steps of our finite element Galerkin approach.
Integrating the original equation ( 1) against a sufficiently smooth function v ∈ C ∞ (Ω), and integrating by parts, it follows that a weak formulation of problem (1) We numerically solve this problem by Galerkin projection.For this, we consider a mesh (T n ) N n=1 of Ω and define the associated P 1 finite element space We next take as a basis of V N the set of tent functions defined as and we express the solution as u N = N i=1 c i ϕ i .Gathering the expansion coefficients in the vector c = (c i ) N i=1 , and injecting the expansion of u N in the variational formulation, we are led to the system of equations M c = q where M = (M i,j ) 1≤i,j≤N , M i,j := a(ϕ i , ϕ j ), q = (q i ) N i=1 , q i := l(ϕ i ).

Test case and comparison criteria
In this section we show the results obtained by approximating the exact solution of the problem described in equation (1) using the methods introduced above.We work on the one dimensional domain Ω = (0, 1) with F = 1 and f = 1.We choose Robin boundary conditions that mimic Dirichlet conditions and we set α = 10 −3 , κ = 1, g 0 = g 1 = 0. Note that we cannot take α = 0 since all variational methods are not well defined for pure Dirchlet boundary conditions.With these choices, the equation reads Since we work in dimension 1, we can benefit from the fact that the exact solution u has an analytic form as shown in the Appendix A. We can thus easily compare the approximation quality of the output functions û from our methods by computing a discrete version of their L 2 (Ω) error norm with respect to the exact solution.
The points x k are sampled uniformly as defined in 3.3.We use 10 times more points than the ones used for training, so K = 10K.Similarly, we also compute the error with respect to the H 1 (Ω) semi-norm.
Note that one can obtain the H 1 error by adding the above error components.
We study the impact on the errors of the following parameters: • The values of ε.They range from 5.10 −3 to 10.0 with a logarithmic spacing.
• The choice of the sampling method for the training points (uniformly spaced or uniformly random, labelled as −u and −r).• The impact of the machine precision (Float16, Float32, Float64).Due to the randomness in the initialization of weights on the neural networks, for each combination of parameters (ε, K, sampling type, and machine precision), we perform 10 repetitions with different initializations.Since we didn't notice a big difference between the l 2 error and the h 1 error, we keep just the second one for clarity and put in the Appendix B the plots in l 2 error.

Our code and practical implementation details
All our neural network based numerical tests were performed in Python 3.6 and using the TensorFlow 1.13.1 library [20].The code provided in the original paper on PINNs [21] was used as the starting point for our own code developments, and we have followed similar guidelines to generalize and enlarge it where needed.In the same way, for each numerical method, derivatives of functions v ∈ K are computed using automatic differentiation.The numerical optimization procedure used in order to compute an approximation of θ * a minimizer of problem (26) is given by the quasi-Newton L-BFGS algorithm [22].The code used to generate the examples shown here is available at https://github.com/agussomacal/ConDiPINNThe interested reader can reproduce our results and test the impact of the variations of certain parameters such as ε, K, the sampling method, and the machine precision.

Impact of the number K of training points
In this section we discuss the impact of the number K of training points.We fix the machine precision to Float32, and the uniform sampling −u.
Figure 1 shows the best result obtained in the tests, i.e., the minimum value of the h 1 norm obtained in the 10 different simulations, plotted against the values of ε.In Figure 2, we fix = 10, and plot statistics on the accuracy e h 1 (left plot) and computation runtimes for different K (right plot), and for the different methods.
From these figures, we first notice that the approximation of FEM degrades when ε decreases.However, the quality globally improves when the number of training point increases (see, e.g., Figure 1 -left plot).The rate of improvement is quadratic as we can see from the right plot in Figure 2. In addition, when looking at the runtimes (Figure 2 -right) we observe the expected quadratic increase with respect to the number K of training points.
We can next study the behavior of Vanilla PINN and compare to FEM.We observe that it performs at around constant accuracy for any number of training points until around ε = 0.027 where stops producing reliable approximations (see Figure 1).One remarkable observation is that the Vanilla PINN error for large values of ε and small number of training points K = 10 is comparable to the FEM errors with a much larger number of degrees of freedom K > 10 3 (see left plot in Figure 2).Regarding the runtime to fit the neural network, we see that it is roughly constant for all values of K and it is comparable to the runtime of the FEM method with K = 100 (right plot in Figure 2).
We next comment on the other PINN-based variational methods.For ε large enough, we observe that all the variational based methods follow the same error trend as FEM both with respect to ε and K and for K < 10 4 they even perform marginally better.With respect to the computing time, all the methods perform with almost constant time with respect to K and similarly to a FEM method with K = 100 degrees of freedom.However, for ε < 0.63, the methods W − z, W − z − e and V − z blow up and lose completely their approximation capabilities.We conjecture that this is due to the fact that the neural network is used to approximate the solution z from the transformed problem, and there is an exponential term to go back from z to u (see equation (7)).This may lead to machine precision overflows (in the exponential computation) and underflows (the neural network has to learn very small values of z which also are in the limits of precision).To address this issue, we have explored two possible strategies: one was by directly minimizing over u while maintaining the weak formulation which accounts for the method W .The second approach is to perform the re-scaling of the domain RW − z.In both cases the blow up caused by the exponential is solved although the re-scaling method RW − z doesn't perform as good as others in the region with large ε values.
We finish this section by plotting in Figure 3 the best approximated solution for each model, and different values of ε.The interested reader may experiment other configurations in our provided code.The most striking observation is that only FEM and the vanilla PINN method recover the final shape of the exact solution when ε is small.The other variational PINN methods fail despite that some of them exhibit comparable values to FEM in the generalization errors as Figure 1 illustrates.This observations suggests that perhaps other types of error metrics should be introduced in order to be able to better distinguish between "good solution shapes" and "bad ones".

Impact of Machine Precision
Figure 4 shows the h 1 -error of the different approximated solution by changing the machine precision in the parameters of the neural networks for the different values of ε: Float16, Float32 and Float64.There is an improvement when going from Float16 to Float32 in all methods.Interestingly, we did not obtain very satisfactory results when working with Float64 precision.This precision seems to difficult the convergence to good quality minima: even after 10 repetitions, we failed to find good results.However, as the plots show, when a good minimum is found, it delivers slightly better approximation than lower machine precisions.For these reasons we have performed our experiments using the Float32 which seemed the most stable choice.

Impact of Sampling Strategy
Figure 5 shows the h 1 -error of the different approximated solution by changing the sampling strategy.For all models, the unif orm strategy is found to be either as good as the random or slightly better.For this reason we performed all the experiments using the unif orm strategy.

Conclusions from the numerical experiments
The above numerical experiments depict a contrasted landscape concerning the merits and limitations of deep learning-based approaches when the solutions become low regular: • For large values of ε when solutions are rather regular, some PINNs perform clearly better than FEM regarding the generalization errors.The superiority is particularly remarkable for very small number K of training points.However, the shapes of PINN solutions are sometimes not as satisfactory as the ones given by FEM.• For the challenging case where ε becomes small and solutions become less regular (which was the main motivation of our study), the accuracy of the variational neural-network methods is essentially comparable or worse to the one given by FEM in terms of generalization errors.Some PINN variational approaches become too unstable and the errors blow up.Only FEM and the vanilla PINN approach seem to be able to recover the correct shape of the exact function.The latter one has however the risk of sometimes falling into local minima with bad shapes.• The runtimes are clearly in favor to PINN methods as Figure 2 illustrates, and the simplicity of implementation is also in favor to all PINN methods.

Future research directions and extensions
One important point to explore in future works concerns the choice of the loss function for the training, and also the metric to evaluate generalization errors.It will also be interesting to explore if adaptive sampling strategies during the training could help to recover good solutions in a more stable manner.Finally, the impact of the machine precision in some steps involving exponential transformations seems also to be an important obstacle to retrieving stable solutions.It would be interesting to develop strategies that circumvent this issue.All these developments will play a crucial role in order to address higher dimensional problems with similar characteristics as the one considered here.

A. Analytic solution in dimension 1
The aim of this section is to give the analytic expression of the solution of ( 1)-(2) in the case when d = 1, Ω = (0, 1), and F and f are assumed to be equal to some constant real numbers.Thus, in this section, using a slight abuse of notation, we assume that F, f, g ∈ R. Let us also introduce g 0 , g 1 ∈ R so that g(0) = g 0 and g(1) = g 1 .The problem then reads as follows: find u : (0, 1) → R solution to    −εu (x) + F u (x) = f, ∀x ∈ (0, 1), −αu (0) + κu(0) = g 0 , αu (1) + κu(1) = g 1 . (30) Then, it can be easily checked that the solution to this equation reads as In the following, we assume that the values of κ and α do not take these values, and the above system is invertible.

Figure 1 .
Figure 1.Comparison of the behavior of the h 1 error for the different methods and different number of sampling points in training.From top to bottom and from left to right, the first figure is produced for K = 10, the second for K = 100, the third for K = 1000 and the last one for K = 10000.The set of points to train and test have been chosen with the uniform sampling method.The precision has been fixed to Float32 for all the tests.

Figure 2 .
Figure 2.For ε = 10 (region where all methods work well), we look at the comparison between methods and the difference with respect to the number of training points K.The h 1 error (right) and the computation times (left).The set of points to train and test have been chosen with the uniform sampling method.The precision has been chosen as Float32 for all the tests.

Figure 3 .
Figure 3.The best approximated solution out of 10 repetitions, for each model, and with K = 100 training samples.From left to right: ε = 0.039, 0.18, 10.The interest reader may experiment other configurations in our provided code.

Figure 4 .
Figure 4. Here, the comparison of the behavior of the model for different float precision.The tests have been performed for K = 100 and uniform sampling.

Figure 5 .
Figure 5. Here, the comparison of the behavior of the model for the two different sampling strategies.The tests have been performed for K = 100 and the float precision equal to Float32.

Figure 6 .
Figure 6.The comparison of the behavior of the l 2 error for the different methods and different number of sampling points in training.From up to down and from left to right, the first figure is produced for K = 10, the second for K = 100, the third for K = 1000 and the last one for K = 10000.The set of points to train and test have been chosen with the uniform sampling method.The precision has been chosen as Float32 for all the tests.