On basis set optimisation in quantum chemistry

. In this article, we propose general criteria to construct optimal atomic centered basis sets in quantum chemistry. We focus in particular on two criteria, one based on the ground-state one-body density matrix of the system and the other based on the ground-state energy. The performance of these two criteria are then numerically tested and compared on a parametrized eigenvalue problem, which corresponds to a one-dimensional toy version of the ground-state dissociation of a diatomic molecule. Résumé. Nous proposons dans cet article des critères généraux pour construire des bases atomiques localisées optimales en chimie quantique. Nous nous intéressons en particulier à deux critères: l’un basé sur la matrice densité à un corps du système, l’autre basé sur l’énergie de l’état fondamental du système. Les performances de ces deux critères sont évaluées et comparées sur un problème aux valeurs propres paramétré, correspondant à une version simpliﬁée du problème de dissociation de l’état fondamental d’une molécule diatomique.


Introduction
In quantum chemistry, a central problem is the computation of the electronic ground-state (GS) of a given molecular system.For many-electron systems, it is not possible to solve the N -body Schrödinger equations and most calculations are thus based on variational (e.g.Hartree-Fock) or non-variational (e.g.coupled cluster) approximations of the latter, or on Kohn-Sham density functional theory (DFT).For all these models, the continuous equations (e.g. a nonlinear elliptic eigenvalue problem in the Hartree-Fock or Kohn-Sham settings) are discretized into a finite-dimensional approximation space.Approximation spaces constructed from atomic orbitals (AO) basis sets [15,23] have many advantages and are therefore the most common choice in the quantum chemistry community.An AO basis set consists of a collection of functions χ = (χ z µ ) z∈CE, 1⩽µ⩽nz where CE is a set of atomic numbers (e.g.CE = {1, . . ., 92} for the natural chemical elements of the periodic table), n z a positive integer depending on the electronic shell-structure of the chemical element with atomic number z, and χ z µ ∈ H 1 (R 3 ) a fast decaying function centered at the origin called an atomic orbital.Consider an atomic configuration ω consisting of M nuclei with nuclear charges z 1 , . . ., z M (in atomic units) and positions R 1 , . . ., R M in the three dimensional physical space.If the AO basis set χ is chosen by the user, the (spatial component of the) one-electron finite-dimensional space in which the chosen electronic structure model of a molecular system with atomic configuration ω is discretized is The accuracy of the approximation therefore crucially depends on the quality of the AO basis set.The main advantage of AO basis sets is that only a small number of AO per atoms (typically a dozen) are necessary to obtain a relatively accurate result on most quantities of interest.This is in sharp contrast with standard discretization methods used in the simulation of partial differential equations such as finite-element methods.
To make connection with discretization methods used in mechanical and electrical engineering, AO basis set discretization methods can be considered as spectral methods [9], and share common features with the modal synthesis method [11,Chapter 7], [10].A drawback of AO basis sets is that conditioning quickly blows up when increasing the size of the basis by including polarization and diffuse basis functions, a problem known as overcompleteness [18].The numerical errors due to this large condition number can deteriorate the accuracy of the computed solutions and/or significantly increase computational times.AO basis sets can therefore not be systematically improved in a straightforward way.
In the early days, AOs were Slater functions [29], with exponential decay and a cusp at the origin.It was then realized by Boys [7] in 1950 that it was much more efficient from a computational viewpoint to use Gaussian-type orbitals (GTO), that are linear combinations of polynomials times Gaussian functions.Indeed the multi-center overlap, kinetic and Coulomb integrals arising in discretized electronic structure models can then be computed analytically by means of explicit calculations and recursion formulas.However, individual Gaussian function poorly describes the cusps of the bound states electronic wavefunctions at nuclear positions.Contracted Gaussians [20], that are linear combinations of Gaussians with different variances, were quickly introduced as they overcome this deficiency.Several classes of GTO basis sets have been proposed since the 50's: STO-ng basis sets [14] were built as the contraction of n Gaussians that fit Clementi STO SCF AOs in an L 2 least-squares sense [30].It was quickly realized that better GTO basis sets could be obtained by minimizing atomic Hartree-Fock ground state energy.This approach led to the splitvalence basis sets (e.g.6-31G) with core and valence orbitals being approximated differently, developed by Pople et al. [5].Basis sets better suited for correlated electronic structure methods were then introduced, notably Atomic Natural Orbitals (ANO) [2] and Dunning basis sets [13].ANO basis sets are built by selecting occupied and virtual orbitals from Hartree-Fock and natural orbitals from correlated computations of atomic systems.Dunning bases provide a (finite) hierarchy of bases obtained by consistently increasing the number of basis functions corresponding to different angular momenta.This optimization strategy yields the so-called correlation consistent cc-pVXZ basis sets, which are, with their augmented version, still commonly used nowadays.
Mathematical studies proving convergence rates or proposing systematic enrichment of GTO basis sets are so far quite limited.The approximability of solutions to electronic structure problems by Gaussian functions was studied in [16], and later on in [27,28].An a priori error estimate on the approximation of Slater-type functions by Hermite and even-tempered Gaussian functions was derived in [3].A construction of Gaussian bases combined with wavelets was proposed on a one-dimensional toy model in [25].
Commonly used Pople and Dunning GTO basis sets were optimized from atomic configuration energies and Hartree-Fock (and/or natural) atomic orbitals.Let us also mention [12,26] where system specific optimization of AO bases has been investigated, however focusing on specific models (e.g. one electron periodic Hamiltonian) or optimization criteria.In this article, we propose a different approach, which is adaptable to any criterion one might be interested in, and involves molecular configurations.In Section 2, we introduce an abstract mathematical framework for the construction of optimal AO basis sets, based on the choices of (1) a set of admissible atomic configurations Ω; (2) a probability measure P on Ω; (3) a set of admissible AO basis sets B; (4) a criterion j(χ, ω) quantifying the error between the exact values of the quantities of interest when the system has atomic configuration ω ∈ Ω -for the continuous model under consideration -and the ones obtained after discretization in the basis set χ ∈ B. We also provide examples of possible choices of Ω, P, B, and j.As a proof of concept (Section 3), we apply this strategy to a simple toy model of a 1D homonuclear diatomic "molecule" with two 1D non-interacting spinless "electrons", which allows for extremely accurate reference calculations.Finally, we present numerical results in Section 4, where we compare the efficiency of the so-optimized AO bases compared to AO basis constructed from the occupied and unoccupied orbitals of the isolated "atom".

Abstract framework
We start by formulating the problem of basis set optimization in an abstract setting.The procedure can be divided into four steps.
First, we select the set Ω of all possible atomic configurations we are a priori interested in.For instance, depending on the foreseen applications, one can consider the set of all possible finite atomic configurations containing only hydrogen, nitrogen, carbon, and oxygen atoms, or the set of all possible periodic arrangements of chemical elements with less than 20 atoms per unit cell.
Second, we equip Ω with a probability measure P in order to allow for different configurations to have different weights in the optimization procedure.We will see later that our method requires the computation of very accurate reference solutions for all ω's in the support of P. For practical reasons we therefore need to choose P of the form where {ω 1 , . . ., ω Nc } is a finite (not too large) subset of Ω, δ ωn the Dirac mass at ω n , and {β 1 , . . ., β Nc } are positive weights such that Nc n=1 β n = 1.Assume that we are solely interested in reproducing accurately the dissociation curve of the HF (Hydrogen Fluoride) diatomic molecule.Then the set Ω should be identified with the interval (0, +∞), and a configuration ω ∈ Ω with the H−F interatomic distance R ∈ (0, +∞), and P should be a probability measure on the interval (0, +∞).The selection of the ω n 's and β n 's can be done in various ways.An option is to i) choose a continuous probability distribution P on (0, +∞) on the basis of chemical arguments, putting little weight on usually unimportant very small interatomic distances, more weight on interatomic distances close to the equilibrium distance (d ≃ 0.92 Å), sufficient cumulated weight on very large interatomic distance to correctly reproduce the dissociation energy, and more or less weight on intermediate interatomic distances in the range 2 − 8 Å, depending on its importance for the targeted application; ii) fix the number N c according to the available computational means; iii) compute the ω n 's and β n 's using e.g.quantization algorithms [21] possibly based on optimal transport or clustering algorithms [24].
Third, we select the set B of admissible AO basis sets.Restricting ourselves to the framework of GTOs, this can be done by choosing, for each chemical element arising in Ω, the number, symmetries, and contraction patterns of the Gaussian polynomials of the AO associated with this particular element.In this case, B has the geometry of a convex polyedron of R d .
Given an atomic configuration ω ∈ Ω and an AO basis set χ ∈ B, we denote by χ ω the one-electron finitedimensional space obtained by using the AO basis set χ to describe the electronic structure of a molecular system with atomic configuration ω and an arbitrary number N of electrons.
The fourth and final step consists in choosing a criterion j(χ, ω) quantifying the quality of the results obtained when using the basis set χ ∈ B to compute the electronic structure of a molecular system with atomic configuration ω.The choice of the function j : B × Ω → R + depends on the quantity of interest (QoI) to the user, and on the respective weights of these quantities in the case of multicriteria analyses.For instance, if one focuses on the ground-state energy of the electrically neutral molecular system, a natural criterion is where E ω is the exact ground-state energy of the neutral system with atomic configuration ω for the chosen continuous model (e.g.Hartree-Fock, MCSCF, Kohn-Sham B3LYP. . . ) and E χ ω the ground-state energy obtained with the model discretized in the AO basis set χ.Note that the absolute value of the difference is squared to make j E differentiable.Another possible choice is to use a criterion based on the one-body reduced density matrices (1-RDM), for instance where A is a given self-adjoint, positive, definite operator on the one-particle state space H with form domain Q(A), γ ω the exact ground-state 1-RDM of the neutral system with atomic configuration ω for the chosen continuous model, and χω is the orthogonal projector on X ω for the inner product of H.If A = (1 − ∆), then Q(A) is the Sobolev space H 1 (R 3 ), and Π A χω is the orthogonal projector on X ω for the H 1 -inner product.Diagonalizing γ ω as where the n ω,j 's are the natural occupation numbers (NON) and ψ ω,j 's the natural orbitals (NO) for the chosen continuous model of the neutral system with atomic configuration ω, it holds Minimizing j A (χ, ω) thus amounts to maximizing the NON-weighted sum of the Q(A)-norms of Q(A)-orthogonal projections of the NON on the discretization space X ω .Other criteria may include errors on molecular properties, or a weighted sum of several elementary criteria, each of them targeting a specific property.The criterion should be chosen according to the intended application.
The aggregated criterion to be optimized then reads as an integral over the configuration space Ω with respect to the probability measure P and the problem of basis set optimization can be formulated as In the following, J E and J A denote the evaluation of the criterion (4) with j = j E and j = j A respectively.
Remark 1 (Reference solutions).The evaluation of criteria J E and J A hinges on the knowledge of exact GS energy E ω or 1-RDM γ ω for ω in the support of P. In practice, these data can be approximated by very accurate off-line reference computations for a small, wisely chosen, sample of configurations ω.This is the reason why the probability measure P can only be a finite weighted sum of Dirac distributions, as defined in (1).

Application to 1D toy model
In this section, we focus on a linear one-dimensional toy model, mimicking a homonuclear diatomic molecule.

Description of the model
Let us consider a system of two 1D point-like "nuclei" and two 1D spinless non-interacting quantum "electrons".The one-particle state space is then H = L 2 (R) and the configuration space Ω = (0, +∞).In this section, a configuration of Ω will be labelled by the positive real number a > 0 such that the nuclei are located at −a and a.The one-particle Hamiltonian at configuration a then is where V a models the nuclei-electron interaction.We choose V a to be a double-well potential with minima at −a and +a, defined by Several considerations led us to define the potential as such.First, V a is designed so that i) each H a admits a non-degenerate ground-state of energy E a , and ii) the function a → E a has the shape of the ground-state dissociation curve of a homonuclear diatomic molecule with atoms at −a and +a.Since the two "electrons" do not interact, the ground-state energy E a and density matrices γ a ∈ G 2 are given by where denoting the space of bounded linear operators on L 2 (R).The existence and uniqueness of the solution to problem (7) can be shown by elementary arguments of functional analysis and spectral theory that we do not detail here.Second, V 0 (x) = 1 4 x 4 so that (5) corresponds to the quartic oscillator, for which we have reference numerical solutions (e.g.[6]).Third, V a behaves like x 2 /2 around ±a for large values of a and V a (0) ∼ a 4 /8 → +∞ when a → +∞.Therefore, in the limit a → +∞, problem ( 7) is similar to two decoupled quantum harmonic oscillators centered in −a and +a whose bound states are all explicitly known.For the sake of illustration, we display in Figure 1 the potential V a for two different values of a.
In practice, it is convenient to compute γ a and E a from the lowest two eigenvalues λ a,1 < λ a,2 of H a and an associated pair (φ a,1 , φ a,2 ) of orthonormal eigenvectors ⟨•|•⟩ denoting the L 2 inner product.We indeed have The evaluation of criteria J A and J E requires the computation of reference ground -state density matrices or energies, which amounts to find very accurate solutions of (8) for the configurations a k in the support of the chosen atomic probability measure We chose to compute these reference data using a 3-point finite-difference (FD) scheme on a large enough interval [−x max , x max ] discretized into a uniform grid with N g grid points: We then impose homogeneous Dirichlet boundary conditions at −x max and x max .The parameter x max is chosen such that x max = a max + r max , where a max = max(supp(P)) and r max > 0 is is the radius beyond which atomic densities are zero at machine (double) precision.Note that this numerical scheme is independent of the configuration a.The FD discretization of problem (13) gives rise to the eigenvalue problem where is a real symmetric matrix of size N g × N g , and the reference data are obtained as and where P FD a can be interpreted as an approximation of the matrix γ a (x j , x j ′ ) containing the values of the (integral kernel of the) density matrix γ a at the grid points.

Variational approximation in AO basis sets
For any given configuration a ∈ R + and basis χ = {χ µ } 1⩽µ⩽N b ∈ B, problem (8) is solved using a Galerkin method with the basis χ a = {χ a,µ } 1⩽µ⩽2N b composed of two copies of the basis χ, the first one translated to a, and the second one to −a:

Defining the Hamiltonian matrix
and the overlap matrix S χ a = (⟨χ a,µ |χ a,ν ⟩) 1⩽µ,ν⩽2N b , the discretization of problem (8) in the AO basis set χ then reads as the generalized eigenvalue problem: find The approximation φ χ a,i of φ a,i in the AO basis set χ can then be recovered as the linear combination of atomic orbitals (LCAO) One way to compare the LCAO ground-state 1-RDM to the reference FD solution P FD a is to simply evaluate the former at the grid points x j , which gives rise to the matrix P χ a ∈ R Ng×Ng sym with entries Due to numerical errors, the matrix P χ a is however not a rank-2 orthogonal projector.We therefore chose to follow a slightly different route (leading to very similar results).The finite difference grid gives a reference discrete setting in which any quantity of interest for any configuration and AO basis set can be expressed.For all a's, the basis χ a is represented by a matrix X a ∈ R Ng×2N b .For any vectors Y 1 , Y 2 ∈ R Ng , the discrete A inner product simply reads δxY T 1 AY 2 where the notation A stands for both the continuous inner product and its finite-difference discretization matrix.We denote by ∥ • ∥ A the associated norm on R Ng .Solutions to (13) are then obtained by approximating respectively the Hamiltonian and overlap matrix by from which we get the discrete approximations Let us gather the coefficients C X a,i into the The ground-state density matrix in the basis χ a is approximated by

Overcompleteness of Hermite Basis Sets
Before getting into basis set optimization, we introduce the following standard Hermite Basis Set (HBS), constructed from eigenfunctions of the quantum harmonic oscillator.Those functions are solutions to the eigenvalue problem where p n is the Hermite polynomial of degree n (with the same parity as n) and c n a normalization constant such that (h n ) n∈N forms an orthonormal basis of L 2 (R).The h n 's are the analogues of the standard atomic orbitals obtained by solving atomic electronic structure problems.Let us first consider the AO basis set made of the first N b Hermite functions The overlap matrix for the configuration a then is of the form The matrix Σ a corresponds to the overlap of functions that are localized at different atomic positions.It satisfies Σ a ≃ 0 when a is large and Σ a ≃ I N b when a is close to 0, therefore causing conditioning issues on the overlap matrix S χ HBS a , a phenomenon known as overcompleteness: when a is too small, the basis functions centered at ±a are almost equal, hence almost linearly dependent in the basis set.We illustrate this problem by plotting the condition number of the overlap matrix S χ HBS a for different values of a in Figure 2, which indeed blows up for small values of a.This is a well-known issue, and several methods have been proposed in the literature to cure this phenomenon, such as the standard canonical orthonormalization procedure [19] or more recent work based on a Cholesky decomposition of the overlap matrix [17].Such methods are however not directly related to the optimization procedure presented in this paper.

Practical computation of the criterion J A and J E
The rest of this section is dedicated to the rewriting and the computation of criteria J A and J E for our 1D model in the discrete setting.

Reference orthonormal basis
In order to avoid potential numerical stability issues, each of the N b atomic orbital χ µ is decomposed on a given truncated orthonormal basis of L 2 (R) of size N such that N b ≪ N ≪ N g .We choose here the orthonormal basis introduced in (18).Hence, the matrix X a ∈ R Ng×2N b is written as with and where R ∈ R N ×N b gathers the coefficients of the atomic orbitals χ µ in the truncated HBS orthonormal basis.Note that we have duplicated R in I R as we consider the same basis at each position ±a, but everything that follows can be easily adapted to the case where we would like to optimize the bases at each position separately (to deal with heteronuclear molecular systems for instance).We moreover impose that R T R = I N b , so that the overlap matrix of X a , denoted by S(X a ), has the same form as in Section 3.3, that is where Σ a is the overlap between functions localized at +a and functions localized at −a.To avoid any issues arising from the conditioning of S(X a ), the minimal sampled distance a min should not be taken too small.
In the following, we detail the computation of each of the two criteria using the matrix R as the main variable.We will subsequently optimize the criteria J A and J E with respect to R to obtain optimal AO basis sets.In order to ease the reading of the following computations, every vector of R Ng is rescaled by a factor √ δx so that for any given Y 1 , Y 2 ∈ R Ng the discrete A inner product simply reads Y T 1 AY 2 .The same holds for overlap matrices: with this convention, S(X a ) = X T a X a .The output of the optimization is then scaled back to its former state by a factor 1/ √ δx to recover the original normalization.

Criterion J A
Let a ∈ R + be fixed and let S A (Y ) = Y T AY denote the overlap matrix for the A-inner product of any rectangular matrix Y ∈ R Ng×d .Since the columns of X a [S A (X a )] − 1 2 are orthonormal for the A inner product, that is Xa takes the simple form Hence, using the cyclicity of the trace and definitions (3), ( 12) and ( 22), one has where we have collected in the last expression all matrices independent of R into the matrix Then, using the probability measure P in (10), we get and the optimization problem finally writes, with unknown R ∈ R N ×N b and for a given inner product A Find R opt ∈ argmin the discrete counterpart of the Grassmann manifold G 2 , and write E R a (resp.H R a ) instead of E χ a (resp.H χ a ), so that the dependence in the matrix R appears explicitly.Equation (7) reads in the discrete setting Tr P H R a = min where, as for the previous case, all matrices independent of R have been gathered in the matrix and the matrix C R a is solution to the minimization problem and is given in practice by ) where u a,1 and u a,2 are orthonormal eigenvectors associated to the lowest two eigenvalues of From ( 10) and ( 25), one can compute and the optimization problem reads Find R opt ∈ argmin 4. Numerical results

Numerical setting and first results
Problems ( 24) and ( 28) are solved by direct minimization algorithms over the Stiefel manifold [1] The explicit computation of the gradients of J A and J E with respect to R is detailed in the Appendix.We used a L-BFGS algorithm (with tolerance 10 −7 on the norm of the projected gradient), as implemented in the Optim.jlpackage [22] in the Julia language [4].As initial guess, we picked the first N b Hermite functions introduced in Section 3.3.
In this subsection, we choose a probability distribution P supported in the interval I = [1.5, 5] so as to retain the physics of interest that takes place around the equilibrium configuration a 0 ≃ 1.925 and all the way to dissociation.In particular a min = 1.5 is taken sufficiently large to avoid the conditioning issues on the overlap matrices described in Section 3.
The quantities M offline A (a n ) and M offline E (a n ) are computed offline beforehand.We will discuss this choice and consider other probability measures P in Sections 4.2.2 and 4.2.3.
The finite-difference grid is a uniform grid on the interval [−20, 20] discretized into N g = 1999 points (δx = 0.02).Finally, we decompose the basis functions to be optimized in the HBS {h n } 0⩽n⩽N −1 of L 2 (R) of size N = 10.Regarding the choice of the inner product for the first criterion J A , we used the standard L 2 (R) and the H1 (R) inner products, and denoted by J L 2 and J H 1 the corresponding.This translates at the discrete level by choosing A = I Ng for J L 2 and A = I Ng − ∆ for J H 1 where ∆ is the 3-point finite-difference discretization matrix of the 1D Laplace operator.Once obtained, the optimal bases are used to solve the variational problem (15) on a much finer sampling of I and their accuracy is compared to the HBS.The code performing the simulations and plotting the results is available online 1 .Also, for the sake of clarity in the plots, E a (resp.ρ a ) denotes the GS energy (resp.the density) in the configuration a with a given basis (specified by the context) and E a (resp.ρ a ) stands for the reference energy (resp.density) on the finite difference grid.Note that we write HBS for the (nonoptimized) Hermite basis set, and L 2 -OBS, H 1 -OBS or E-OBS for optimized basis sets with respect to the criterion J L 2 , J H 1 , or J E .Figure 3 displays the dissociation curve and the energy difference on the interval I for different values of N b , the size of the AO basis set.For N b = 1, i.e.only one basis function at ±a, criterion J E shows better performance than the criterion J A , regardless of the choice of norm to perform the projections.It also very closely matches the accuracy of the standard HBS.When N b becomes larger however, the different criteria behave in a similar fashion and we observe that they approach the dissociation curve better than the Hermite basis.Comparing the values of criterion J E for all bases, which directly measures the distance to the dissociation curve, we see in Table 1 that all optimized bases give an increased accuracy of roughly four orders of magnitude over the interval I for N b = 4.
In Figure 4, we plot the density for a given value of a and the error on the density for different norms, with varying values of N b .The error is plotted with respect to three different distances: the L 1 -norm, which corresponds to the L 2 -norm on eigenvectors, the H 1 -norm of the error on the density, as it is common to compute the forces R ρ∇ a V a with good estimates on the H −1 -norm of ∇ a V a (see e.g.[8]), and the distance (recall that the von-Weizsäcker kinetic energy reads 1 2 R |∇ √ ρ| 2 ).We observe similar behaviors between these different distances.For N b = 1, both bases obtained with the first criterion behave slightly better than the standard Hermite basis and the basis computed with the second criterion.For N b = 3, we observe again that all optimal bases yield better accuracy than the Hermite basis.Table 1 gives the confirmation that each basis for a given criterion indeed performs better than the other bases for that particular criterion.As for dissociation curves, we read from the values of J L 2 and J H 1 that the optimized bases yield similar results for large N b , all of them giving lower values than the HBS.Note that the optimal bases for criterion J L 2 and J H 1 give similar results for any number of basis functions N b , so that the L 2 and H 1 norm optimizations seem equivalent.
In terms of computational time, first note that criterion J H 1 is always more expensive to compute than J L 2 as it requires additional matrix-vector products with the matrix A, this having noticeable impact on the computational time.Second, criterion J E requires less off-line data as it only needs to be given the reference eigenvalues while criterion J A requires the reference GS eigenvectors (or density matrices).In addition, the use of orthonormality constraints as detailed in appendix allows one to compute the gradient of J E at very low cost.In turn, criterion J E is more than twice faster to minimize than criterion J L 2 in our implementation.
Finally, for the sake of completeness, we plot in Figure 5 the different basis functions built with each criterion for different values of N b , confirming again the previous observations that the optimal basis functions are quite close to the standard Hermite basis functions.
The main conclusion of these observations is that, for N b large enough, there is no real difference between the proposed criteria.Still, if the bases we built do not seem to be very different from the standard Hermite basis (Figure 5), building optimal bases allows to increase accuracy on the quantities of interest we focused on by one order of magnitude in average.

Value of J L 2 for the different basis sets
Value of J H 1 for the different basis sets .40829 -7.70051 -7.74312 -7.77138 L 2 -OBS -7.43954 -7.76479 -7.77725 -7.77773 H 1 -OBS -7.43928 -7.76466 -7.77724 -7.77772 E-OBS -7.39410 -7.76425 -7.77720 -7.77772In Section 4.1, we used the first N b Hermite functions as a starting point for the optimization procedures.We obtain the same solutions if we start from a random matrix R on the Stiefel manifold, in the sense that the optimal values reached for each criterion are the same, as well as the error plots.However, the L-BFGS algorithm requires more iterations to converge.The basis functions obtained from the optimization algorithms are different from those observed in Figure 5, but still span the same space as the variational solutions are equal.

Extrapolating the parameter space I
In Section 4.1, we chose a probability measure P supported in the interval [1.5, 5] in order to avoid conditioning issues.Indeed, taking smaller values of a results in the L-BFGS algorithm having convergence problems when N b increases.This phenomenon was observed already for N b = 3 or N b = 4 when including a = 1 in the support of P. In practice, this problem can be solved by using preconditioning or getting rid of overcompleteness by pre-processing the basis χ a (e.g.selecting a smaller basis by filtering out the very small singular values of the original overlap matrix), but for brevity we will not elaborate further in this direction.
However, once we have computed optimal bases for a reasonable interval I, it is possible to use these bases to solve the variational problem (13) and extrapolate the energy and the density to smaller values of a that are not in the set I. The results are plotted in Figure 6.We notice that the quantities of interest are better approximated on I = [1.5, 5], but for smaller a's, there is no more gain in accuracy with respect to the standard HBS.

Choice of the probability P
The major drawback of our AO basis optimization lies in the necessity to compute very accurate reference solutions for all configurations in the support of P.This is not an issue for our 1D toy model but it can be very time consuming for real systems if the support of P is too large.It is therefore crucial to reduce as much as possible the support of P.
In this section, we study the influence of the probability measure P on the quality of the optimized bases.For simplicity, we restrict ourselves to uniform samplings of the interval I = [1.5,5].Numerical tests show that increasing the sample size above the reference sampling with N c = 10 points used in Section 4.1 (see Eq. ( 29)) brings no significant accuracy improvement.Therefore we chose to investigate in the following the performance of the optimal AO basis sets obtained with very sparse sampling.Figure 7 pictures the error of approximation of the dissociation curve and densities for three samplings: first, the two extreme points of the interval I = [1.5,5]; second, two points around the equilibrium distance a 0 ≃ 1.925 ; third, a single point near the equilibrium distance.All curves are plotted for a fixed number of basis functions N b = 3.
It appears that the latter sampling already provides satisfactory accuracy.The criteria J L 2 and J H 1 are equal to −5 × 10 −6 for optimized basis to be compared with −1.8 × 10 −3 for standard HBS.Hence they provide a gain of accuracy in energy of three orders of magnitude over the whole dissociation curve.The "a" in legends are the sampled configurations a.

Number of Hilbert basis functions
We now take the same setting as in Section 4.1, except that we set N = 5 instead of N = 10.This provides similar results as those collected in Table 1, see Table 2.However, the values of the criteria J A and J E are higher than for N = 10, in particular for N b = 4, where criterion J A cannot be optimized further than −10 −5 , which makes sense as the space over which the optimization algorithms are performed is smaller.Calculations with N = 15 were also performed: for N b = 1, 2, 3, the criteria are slightly improved but for N b = 4, convergence issues were noticed, due to ill conditioning of the overlap matrices for a = 1.5 as the number N of functions used to describe the optimal bases is larger. where with C a (R) defined in Section 3.
We now detail the computations of the two gradients of E a , namely ∇ R E a and ∇ C E a .Computation of the first gradient ∇ R E a .Using notation (32), we introduce

a for a = 5 Figure 1 .
Figure 1.x → V a (x) for a = 1 and a = 5.

Figure 2 .
Figure 2. Condition number of the HBS overlap matrix S χ HBS a for different values of a in loglog scale.The larger the basis set, the faster the condition number blows up for small values of a.

3 .
More precisely, all the results in this subsection are obtained with the probability

Figure 3 .
Figure 3. Energies for the optimal bases obtained with the different criteria.(Top) Dissociation curve.(Bottom) Errors on the energy on the range of configuration I = [1.5, 5].

Figure 4 .Figure 5 . 4 . 2 . 4 . 2 . 1 .
Figure 4. (Top) Densities for the optimal bases with the different criteria.(Middle) Errors on the density for different norms with N b = 1.(Bottom) Error on the density for different norms with N b = 3.

1 N b = 3 L 2 Figure 6 .
Figure 6.Energy and densities error with extrapolation up to a = 0.5, with basis functions optimized on I = [1.5, 5].

1 H 1 - 1 L 2 - 1 EFigure 7 .
Figure 7. Error plots for probability measures P corresponding to very sparse samplings of the interval I = [1.5, 5]: i) the two endpoints of I ii) two points near the equilibrium distance and iii) one point near the equilibrium distance.(Top line) Error on energy.(Bottom line) Error on density in L 1 norm.(Left) OBS for J H 1 .(Middle) OBS for J L 2 .(Right) OBS for J E .The "a" in legends are the sampled configurations a.

M
a := M offline E (a) and Σ(H) := I T H M a I R = H T M ++ a R H T M +− a R H T M −+ a R H T M −− a R ∈ R (2N b )×(2N b ) ,so that, with P = CC T ,Tr (P [dH a (R) • H]) = Tr P [Σ(H) + Σ(H) T ] = 2Tr(P Σ(H)) = 2Tr H T M ++ a RP ++ + M −+ a RP +− + M +− a RP −+ + M −− a RP −− .In the end,∇ R E a (R, C) = 2 M ++ a R(CC T ) ++ + M +− a R(CC T ) −+ + M −+ a R(CC T ) +− + M −− a R(CC T ) −− ∈ R N ×N b .Computation of the second gradient ∇ C E a .The Euler-Lagrange equation of the minimization problem(25) yields that there exist a symmetric matrix Λ a (R) ∈ R 2×2 such that∇ C E a (R, C a (R)) = 2H a (R) = 2S(B a I R )C a (R)Λ a (R),where Λ a (R) is actually a diagonal matrix whose diagonal is composed of the two lowest eigenvalues of H a (R).Moreover, if we differentiate the constraintC a (R) T S(B a I R )C a (R) = Id 2 , we get C a (R) T S(B a I R )(dC a (R) • H) + (dC a (R) • H) T S(B a I R )C a (R) = −C a (R) T (dS(B a I R ) • H)C a (R), so that ∇ C E a (R, C a (R)) • (dC a (R) • H) = 2Tr (S(B a I R )C a (R)Λ a (R)) T (dC a (R) • H) = −Tr (dS(B a I R ) • H) C a (R)Λ a (R)C a (R) T .Now, let us recall that dS(B a I R ) • H = I T H S(B a )I R + I T R S(B a )I H . Thus, by denoting Q a (R) = C a (R)Λ a (R)C a (R) T , we get that ∇ C E a (R, C a (R)) • (dC a (R) • H) = − 2Tr H T S(B a) ++ RQ a (R) ++ + S(B a ) +− RQ a (R) −+ + S(B a ) −+ RQ a (R) +− + S(B a ) −− RQ a (R) −− which ends the computations of the second gradient.

Final
gradient.Compiling the computations of the two previous paragraphs, we obtain∇ R E a (R) = 2 M ++ a RP a (R) ++ + M +− a RP a (R) −+ + M −+ a RP a (R) +− + M −− a RP a (R) −− − 2 S ++ a RQ a (R) ++ + S +− a RQ a (R) −+ + S −+ a RQ a (R) +− + S −− a RQ a (R) −− (37)whereP a (R) = C a (R)C a (R) T , M a = M offline A (a), S a = S(B a ) and Q a (R) = C a (R)Λ a (R)C a (R) T, and the gradient of J E is computed with (35).

Table 1 .
(Top & Middle) Values of the different criteria for the HBS and optimal bases, for increasing values of N b .(Bottom) Number of iterations of L-BFGS required for each criterion to achieve convergence up to requested tolerance (10 −7 on the ℓ 2 -norm of the gradient).