SURVEY OF SEQUENTIAL CONVEX PROGRAMMING AND GENERALIZED GAUSS-NEWTON METHODS

. We provide an overview of a class of iterative convex approximation methods for nonlinear optimization problems with convex-over-nonlinear substructure. These problems are characterized by outer convexities on the one hand, and nonlinear, generally nonconvex, but diﬀerentiable functions on the other hand. All methods from this class use only ﬁrst order derivatives of the nonlinear functions and sequentially solve convex optimization problems. All of them are diﬀerent generalizations of the classical Gauss-Newton (GN) method. We focus on the smooth constrained case and on three methods to address it: Sequential Convex Programming (SCP), Sequential Convex Quadratic Programming (SCQP), and Sequential Quadratically Constrained Quadratic Programming (SQCQP). While the ﬁrst two methods were previously known, the last is newly proposed and investigated in this paper. We show under mild assumptions that SCP, SCQP and SQCQP have exactly the same local linear convergence – or divergence – rate. We then discuss the special case in which the solution is fully determined by the active constraints, and show that for this case the KKT conditions are suﬃcient for local optimality and that SCP, SCQP and SQCQP even converge quadratically. In the context of parameter estimation with symmetric convex loss functions, the possible divergence of the methods can in fact be an advantage that helps them to avoid some undesirable local minima: generalizing existing results, we show that the presented methods converge to a local minimum if and only if this local minimum is stable against a mirroring operation applied to the measurement data of the estimation problem. All results are illustrated by numerical experiments on a tutorial example.


Introduction
Throughout this paper we consider nonlinear optimization problems of the form min w ∈ R n φ 0 (F 0 (w)) s.t.F i (w) ∈ Ω i , i = 1, . . ., q, g(w) = 0, with nonlinear functions φ 0 : R m0 → R, v → φ 0 (v), F i : R n → R mi and g : R n → R p .Critically, the function φ 0 (v) and the sets Ω i are assumed to be convex.The problem is thus characterized by "convex-overnonlinear" substructures φ 0 (F 0 (w)) and F i (w) ∈ Ω i .We call φ 0 and the Ω i "outer convexities" and F i the "inner nonlinearities".A special -but still quite general -case is when the sets Ω i can be described by smooth i.e., M 1 → M 2 means method M 2 is a special case of method M 1 .SSDP is only a special case of SCP if the SSDP variant with a zero Hessian is used.convex functions φ i : R mi → R, i.e., Ω i = {v ∈ R mi | φ i (v) ≤ 0} for i = 1, . . ., q.The inequality constraints could then be expressed as φ i (F i (w)) ≤ 0. We define f i (w) := φ i (F i (w)) as shorthand for this composition, for i = 0, . . ., q.
In this paper, we will give an overview over several methods that aim to exploit this convex-over-nonlinear structure.They do so by sequentially solving a convex approximation to (1), in which the nonlinear functions F i and g have been linearized.The methods differ in the way they handle the outer convexities.In the case of nonlinear least-squares, that is with φ 0 (v) = 1  2 v 2 2 and no constraints, all of these methods simplify to the classical Gauss-Newton method (GN).Thus, all of the presented approaches can be seen as generalizations of Gauss-Newton, though only one of these methods will be called the Generalized Gauss-Newton method (GGN) in the following.To avoid confusion it is important to be aware that so far there has been no generally used naming convention and the same name may refer to different methods.For example, GGN in [33] refers to a generalization of the Gauss-Newton method to general smooth outer convexities φ 0 , whereas GGN in [5] means the generalization to constrained nonlinear least-squares problems.In this paper we keep the name GGN for the method in [33] and call the method in [5] the Constrained Gauss-Newton method (CGN).

Outline
In this paper we distinguish between three classes of methods, each of them addressing a special case of the general problem (1): smooth unconstrained NLP, smooth constrained NLP and constrained problems with non-differentiable convex structure.Figure 1 gives an overview of the presented methods and the problem class which they address.The most central method is Sequential Convex Programming (SCP), which can address problems from all three classes.It obtains a convex approximation to (1) by linearizing the equality constraints g(w) and the inner nonlinearities F i (w), but keeping the outer convexities φ 0 (v) and Ω i .In its most general form, it can be stated as w k+1 ∈ arg min w ∈ R n φ 0 (F lin 0 (w; w k )) s.t.F lin i (w; w k ) ∈ Ω i , i = 1, . . ., q, g lin (w; Besides SCP, the first class, i.e., methods for smooth unconstrained NLP, contains the original Gauss-Newton method (GN) by Gauss [17] and the Generalized Gauss-Newton method (GGN) as introduced by Schraudolph [33].Section 2 gives a detailed introduction into the algorithms from this class.
In Section 3, we focus on methods for smooth constrained NLP, which is the second problem class.The probably oldest method in this sector is Sequential Linear Programming (SLP), which Griffith and Stewart introduced in 1961 under the name "method of approximation programming" [19].This sector also contains the classical Constrained Gauss-Newton method (CGN) proposed by Bock [5] for nonlinear least-squares problems with nonlinear constraints.We present several methods that can be seen as generalizations of the CGN method: we first present the smooth constrained variant of SCP, as well as Sequential Convex Quadratic Programming (SCQP) [40], and Sequential Quadratically Constrained Quadratic Programming (SQCQP), a novel method which can be seen as an intermediary between SCP and SCQP.These three methods share some intriguing properties and are the main focus of this paper.To complete the picture, we also introduce the Constrained Generalized Gauss-Newton (CGGN) method, a trivial combination of the CGN and GGN methods.We provide a local convergence analysis of three methods -SCP, SCQP, SQCQP -in Section 4. We show under mild assumptions that all of them share the same asymptotic linear contraction rate, which is exactly determined.This analysis trivially extends to the methods for smooth unconstrained NLP.Furthermore, we analyze a special case, in which the methods converge quadratically.In Section 5, we examine the interesting phenomenon of "mirroring" and desirable divergence, as first described by Bock et al. in the context of L 2 and L 1 estimation [3,4].We extend it to estimation problems with general smooth convex negative log-likelihood functions.
Algorithms of the third sector, i.e., constrained optimization problems with non-differentiable convex structures, are mentioned here for completeness, but are not discussed any further in the following.Besides SCP, this category comprises one variant of the sequential semidefinite programming (SSDP) algorithm proposed and analyzed by Fares et al [12].While different, possibly indefinite Hessian approximations were discussed in the original SSDP paper, only the variant with a zero Hessian falls into the class discussed in this paper.In our notation, this SSDP variant corresponds to the SCP algorithm in Eq. (2) for the case in which there is no convex structure in the objective, φ 0 (v) = v, and there is only one structured convex constraint, i.e., q = 1, with the corresponding feasible set given by the positive-semidefinite cone, Ω 1 = S + of appropriate dimension.The SSDP algorithm was also used and investigated under the name SSP in [15], and a special SSDP variant exploiting convex-concave decompositions of nonconvex bilinear matrix inequalities was proposed in [36].For an overview of nonlinear SDP algorithms like SSDP and its variants we refer to the overview paper by Yamashita and Yabe [43].Another relevant subclass in the nondifferentiable sector is sequential second order cone programming (SSOCP) which can be regarded a special case of SSDP.
All of the methods discussed here formulate and solve convex subproblems in each iteration, and in particular use only positive semidefinite Hessian approximations.While this is desirable for both theoretical and practical reasons, there is a price to pay for convexity of the subproblems: it generally limits the convergence speed to only linear, even if arbitrary positive semidefinite -but bounded -Hessian approximations are used, as shown in [10] for the simple example problem of minimizing the nonconvex objective The discussed methods can be seen as Newton-type methods if classified according to [14] or [21].For example, the methods solving QP subproblems (GN, GGN, CGN, CGGN, SCQP) are straightforward to recognize as variants of Sequential Quadratic Programming (SQP), using a specific positive semi-definite Hessian approximation, whereas SCP can be interpreted as a perturbed version of the Josephy-Newton method, cf.[21,37].As there are Newton-type methods which converge superlinearly or even quadratically, this opens the questions of what advantages are to be gained by paying the price of the restriction to a linear convergence rate.The first is the convexity of the subproblems.This makes the subproblems well behaved in the sense that, while their solutions are not necessarily unique, they are at least not isolated from each other.Further, this means that convex solvers can be utilized, as well as all of the appropriate theory and results from the field of convex optimization [6].Second, the subproblems are constructed mainly from first-order derivatives, such that almost no second-order derivatives need to be computed, which tend to be expensive.This is especially attractive when problems have to be solved in real time, for example in feedback control, or in the context of large scale optimization problems, where the cost of computing second-order derivatives might be prohibitively high, such as in the training of deep neural networks.Third, most, though not all, of the methods are multiplier-free in the sense that the construction of the subproblems depends only on the primal variables of the current iterate.Therefore the dual variables are not part of the memory.

Notation and preliminaries
We denote by ∂f ∂w (w) ∈ R m×n the Jacobian of a function f : R n → R m , w → f (w), and use the convention that the gradient symbol denotes its transpose, ∇f (w) = ∂f ∂w (w) .Thus, for scalar f , the gradient is a column vector.The gradient operator ∇ always refers to the first argument of a function, unless the argument is explicitly written as subscript.For the Jacobian of F i (w) we use the shortcut J i (w).If f (•) is scalar valued, its Hessian is denoted by ∇ 2 f (w).Linearizations are referred to as f lin (w; w) := f ( w) + ∇f ( w) (w − w).The notation f S (w), for a set of indices S ⊂ {0, . . ., q}, means the vertical concatenation of the f i (w), i ∈ S. Similarly, for a vector µ ∈ R q+1 , µ S is the vector slice of the corresponding indices.Slightly less conventional, φ S (F S (w)) denotes the concatenation of the corresponding φ i (F i (w)).The cardinality of set S is |S|.For a more lightweight notation, the vertical concatenation [x , y ] of two vectors x ∈ R n , y ∈ R m , is denoted by (x, y).• denotes a general vector norm.More specifically, • 2 is the Euclidean norm and • ∞ the maximum norm.For two vectors x, y ∈ R n we use x ≥ y and similar to denote elementwise inequality.For two symmetric matrices A, B ∈ R n×n , the matrix inequality A B means A − B 0, i.e., that A − B is positive semidefinite, and accordingly for similar inequalities.The identity matrix in R n×n is denoted by I n and the subscript n may be dropped if the dimension is clear from the context.The set of non-negative reals is R + .Concepts from the field of numerical optimization, such as strict complementarity, the linear independence constraint qualification (LICQ) and the Karush Kuhn Tucker (KKT) conditions are defined as in textbooks such as [30] unless otherwise stated.For several important results of this paper we require the functions defining problem (1) to be smooth.Technically, requiring the F i (w) and g(w) to be twice, the φ i (v) three times continuously differentiable would be sufficient for all results.Since this difference is in practice often irrelevant, and for a simpler text flow, we decide to use "smooth" as requirement, but by this we actually mean the just stated weaker requirements.

Illustrative Example
We now introduce an example that will accompany us throughout this article.In this example we use the pseudo Huber loss function parameterized by δ ∈ R + .For v close to 0, it approximates a least-squares loss function, whereas for v far from 0, it has linear behavior similar to the L 1 -norm.The size of the quadratic region is controlled by the Huber parameter δ.The larger δ, the larger the quadratic region.In the limit case of δ → 0, the pseudo Huber loss approaches the L 1 -norm.A visualization of this behavior is given in Figure 2 for a single component, N = 1.The example problems are implemented via the Python interface of CasADi [1], using the solver Gurobi [20] for second order cone programs (SOCP), and qpOASES for quadratic programs (QP) [13].
We have noisy measurements η i of this output, obtained at times x i , but they are associated with some unknown time delay w, i.e., x i = t i − w.Our aim is to identify this true time delay w from N input-output pairs (x i , η i ), i = 1, ..., N , such that we obtain an estimate of the true time t i = x i + w at which the output ψ(t i ) occurred.We model the measurements as η i = ψ(x i + w) + ν i , where ν i is unknown noise.The η i are collected in η ∈ R N and the model predictions in M (w), M : R → R N , with M i (w) := ψ(x i + w).If we choose the pseudo Huber loss (3) as penalty of the model-measurement mismatch, we obtain as our identification problem.This has convex-over-nonlinear structure, with outer convexity φ 0 (v) = ϕ δ (v) and inner nonlinearity F 0 (w) = η − M (w).For the purpose of a clean demonstration of the concepts presented in this paper, we assume N = 3 with the data given as x = (−0.5,0, 0.5) and η = (0, 0, 1).Unless otherwise stated, the Huber parameter is chosen as δ = 0.1.

Methods for smooth unconstrained NLP
Let us first regard only the unconstrained case, min w ∈ R n φ 0 (F 0 (w)) with φ 0 (v) and F 0 (w) smooth, and introduce two convexity exploiting methods, the unconstrained version of SCP and the Generalized Gauss-Newton method (GGN).Both of them are generalizations of the classical Gauss-Newton method to a general smooth outer convexity φ 0 (v).At the end of the section we state a theorem that exactly characterizes their linear local convergence rate.

Sequential Convex Programming
Sequential Convex Programming (SCP) obtains a convex approximation of (6) by linearizing the inner nonlinearity F 0 (w) at the current iterate w k , but keeping the full outer convexity φ 0 (v).It thus iterates as solving a convex -but generally nonlinear -subproblem at every iteration.In case the minimizer of ( 7) is not uniquely defined, one has to take extra measures, such as picking the minimum norm solution, in order to obtain a well defined algorithm.If we apply SCP to the nonlinear least-squares problem, i.e., if φ 0 (v) = 1 2 v 2 2 , we recover the classical Gauss-Newton method, as in this case f SCP 0 (w; w k ) = 1  2 F lin 0 (w; w k ) 2 2 .

Generalized Gauss-Newton
Generalized Gauss-Newton (GGN) iterates by solving a convex quadratic approximation of (6) at every iteration [33].As in the classical Gauss-Newton method, this approximation is obtained by splitting the Hessian of the objective f 0 (w) into two terms, one of them positive semidefinite, and the other generally indefinite.The Hessian approximation is then obtained by neglecting the indefinite term.For the objective function f 0 (w) this split is given by where ∇ 2 F 0,i (w) denotes the Hessian of the i-th component of F 0 (w).The first term B GGN (w) is the Generalized Gauss-Newton Hessian approximation and contains the curvature of the outer convexity φ 0 (v).Due to convexity of φ 0 , we have B GGN (w) 0 for all w.Note that apart from ∇ 2 φ 0 (v) one only needs first-order derivatives to evaluate B GGN (w).The "error matrix" E GGN (w) comprises the components we neglect by choosing B GGN (w) as our Hessian approximation.In particular, it contains the curvature of the inner nonlinearity F 0 (w).This error is small if the neglected curvature ∇ 2 F 0,i (w) or the derivatives ∂φ0 ∂vi (F 0 (w)) are small.The second condition in particular includes the case that the residuals F 0 (w) are close to a minimizer of φ 0 (v) (if it exists).Intuitively, the smaller the approximation error, and therefore the closer B GGN (w) is to the true Hessian, the better is the convergence behavior we would expect.We will exactly quantify this intuition in Theorem 2.2.
The GGN iterations are defined via the unconstrained Quadratic Program (QP) Since B GGN (w) 0 for all w, this QP is always convex.As for SCP, in case the solution to (9) is not unique, we need a procedure for picking exactly one solution, such that w k+1 is uniquely defined.A definition equivalent to (9) is obtained by noting that (9) in effect amounts to solving the linear system at every iteration.This is -in general -considerably faster than solving a full nonlinear optimization problem at every iteration, as is necessary in SCP.If the SCP subproblem is solved with a Newton-type algorithm, this in fact means several iterations with the structure of (10) per iteration of SCP.For the special case of nonlinear least-squares, i.e., φ 0 We therefore obtain B GGN (w) = J(w) J(w).This is the well-known Gauss-Newton Hessian approximation.
Example 2.1.Recall the guiding Example 1.1.We compute the SCP and GGN approximations to the objective function, i.e., f SCP 0 (w; w) resp.f GGN 0 (w; w) as defined in (7) and (9).We do so for two distinct linearization points, w1 = 0.1 and w2 = 0.3.The results are illustrated in Figure 3.It can be seen that SCP manages a closer approximation to the objective function f 0 (w), since it keeps the information about the specific shape of the outer convexity φ 0 (v).GGN on the other hand always approximates the objective as a quadratic function.

Local Convergence Analysis
We state here already a theorem on the linear local convergence rate of unconstrained SCP and GGN.This is actually a special case of Theorem 4.5 which will be proven later for the smooth constrained case.We therefore refrain from giving the proof here.
Illustration of the SCP and GGN approximations to the nonlinear objective f 0 (w) for two values of w.GGN approximates f 0 (w) only quadratically, whereas SCP is able to match the characteristic shape of the outer convexity φ 0 (v).We can also see that for w2 , which is close to a local minimum, this difference does not seem to be too important, since locally both methods provide a good approximation of the true objective.
Theorem 2.2 (Linear local convergence of SCP and GGN [11]).Regard a local minimizer w * of f 0 that satisfies ∇f 0 (w * ) = 0 and B GGN (w * ) 0. Then w * is a fixed point for both the SCP and GGN iterations, the iterates of both methods are well-defined in a neighborhood of w * , and the local linear contraction -or divergencerates of SCP and GGN are equal to each other and given by the smallest α ≥ 0 that satisfies the linear matrix inequalities (LMI) As a consequence, a sufficient condition for Q-linear local convergence with contraction rate α < 1 is given by the LMI Also, a necessary condition for local convergence is given by B GGN (w * ) 3. We return to our example problem defined in (5).Since w ∈ R, the LMI in (11) simplify to scalar inequalities.We can thus explicitly compute the smallest α satisfying (11) as Only for a local minimizer w * the interpretation of α(w * ) as linear local convergence rate is valid, but it is still interesting to visualize α(w) for any w ∈ R. In Figure 4, the objective function f 0 (w) as well as α(w) are illustrated for the example problem.For the local minimum at w * good ≈ 0.1 -which is actually the global minimum -we compute the theoretical contraction rate α(w * good ) ≈ 0.02.Furthermore, we hope to awaken the reader's curiosity by pointing to an interesting observation: there is also a second, worse, local minimum at w * bad ≈ 3.7, for which holds α(w * bad ) ≈ 3200 1.This means that SCP and GGN would strongly diverge from this undesirable local minimum.This is actually not a coincidence and later in this paper we dedicate a full section to this behavior.

Methods for smooth constrained NLP
We will now return to the constrained case and consider methods that can be applied to NLP of the form composed of only smooth functions, and with φ i (F i (w)) =: f i (w), i = 0, . . ., q.We define its Lagrangian as with Lagrange multipliers -or dual variablesµ ∈ R q and λ ∈ R p .In this section, we discuss four methods that exploit the convex substructure of ( 14), namely the smooth constrained version of SCP, Constrained Generalized Gauss-Newton (CGGN), Sequential Convex Quadratic Programming (SCQP) and Sequential Quadratically Constrained Quadratic Programming (SQCQP).

Sequential Convex Programming
In a straight-forward generalization of the unconstrained case, SCP approximates ( 14) by linearizing the inner nonlinearities F i (w), while keeping the outer convexities φ i (v).In every iteration, SCP thus solves a nonlinear but convex optimization problem of the following form: In accordance with the definition of f SCP 0 (w; w k ), we introduce f SCP i (w; w k ) := φ i (F lin i (w; w k )) for i = 0, . . ., q as shorthand for the approximation functions.We point out that -though every solution to ( 16) is also associated with Lagrange multipliers µ k+1 ∈ R q and λ k+1 ∈ R p -the iterations only depend on the primal variable w k .In other words, SCP is a multiplier-free method.For constrained nonlinear least-squares with no (or ignored) convex substructure in the constraints, i.e., φ 0 (v) = 1 2 v 2 2 and φ i (v) = v for i = 1, . . ., q, SCP recovers the constrained Gauss-Newton (CGN) method [5], in which the objective function is approximated as 1 2 F lin 0 (w; w) 2 2 and the constraints are completely linearized.Another special case occurs if additionally there is no convex substructure in the objective, φ 0 (v) = v.Then also the objective function is completely linearized and SCP simplifies to Sequential Linear Programming (SLP) [19].

Constrained Generalized Gauss-Newton
As for the unconstrained case, SCP has to solve a generally nonlinear program at every iteration, which can be quite expensive.We therefore turn our attention to methods that approximate (14) by cheaper subproblems, similar to GGN.Unlike SCP, the generalization of GGN to constrained NLP can go into several directions, such that none of them can be considered to be the most straightforward.Maybe the simplest idea is to use the GGN Hessian approximation B GGN (w; w) -as defined in (8) -for a quadratic approximation of the objective function and to linearize the constraints in order to obtain a constrained QP.This yields a method that we call the Constrained Generalized Gauss-Newton method (CGGN), with the subproblem defined as To the authors' knowledge, CGGN has never been explicitly discussed before.Given the straightforwardness of the approach, this seems surprising.It might be due to the fact that GGN originates from the neural network community, which usually deals with unconstrained optimization problems [26,33].Note that CGGN only uses information about the curvature of the objective function in its approximation, as B GGN ( w) only depends on f 0 (w).Possible convex curvature information of the inequality constraints is thrown away.This method is therefore mainly interesting for problems without exploitable convex substructure in the constraints, i.e., when φ i (v) = v for i ≥ 1.In the case of nonlinear least-squares, φ 0 (v) = 1 2 v 2 2 , the CGGN method simplifies to CGN [5].CGGN can therefore be seen both as a generalization of GGN to constrained NLP and as generalization of CGN to general smooth convexities in the objective.When there is no convexity in the objective function, φ 0 (v) = v, the Hessian approximation is B GGN ( w) = 0, and CGGN recovers SLP [19].

Sequential Convex Quadratic Programming
We will now try to improve upon CGGN by exploiting our knowledge of the convex-over-nonlinear structure in the constraints, while still only solving a QP at every iteration.This leads us to Sequential Convex Quadratic Programming (SCQP), which was introduced by Verschueren et al in [40].Consider the Hessian of the Lagrangian (15) of the original problem, which is given by where F i,j denotes the j-th entry of F i .In the spirit of Gauss-Newton, we split the true Hessian into the SCQP Hessian approximation B SCQP (w, µ) -containing the curvature of all outer convexities φ i (v) -and the SCQP Hessian approximation error E SCQP (w, µ, λ), which contains the curvature of the inner nonlinearities F i (w) and the equality constraints g(w).We introduce the shorthands such that B SCQP (w, µ) = B 0 (w) + q i=1 µ i B i (w), and see that B 0 (w) is exactly the GGN Hessian as defined in (8), i.e., B 0 (w) = B GGN (w).An immediate consequence is the following proposition.Proposition 3.1.Assume (w * , λ * , µ * ) is a KKT point of (14).Then it holds that B SCQP (w * , µ * ) B GGN (w * ). Proof.
We can now use B SCQP (w, µ) as Hessian in a QP approximation to (14), and define the SCQP subproblem as with f SCQP 0 (w; w) the shorthand for the objective function.Here, we point to a subtle, but important, difference that should be kept in mind: all convex approximation schemes introduced so far, and also all that will follow, only depend on the current primal iterate w k as linearization point.To construct B SCQP (w k , µ k ) on the other hand, we also need the dual iterate µ k .The full SCQP iteration scheme thus needs to extract µ k+1 from the multipliers satisfying the KKT conditions of (21).An immediate consequence is that all components of µ k+1 are nonnegative.It follows that, similar to Proposition 3.1, we have B SCQP (w k+1 , µ k+1 ) 0 for all k ≥ 0. Therefore we only need to pick an initialization µ 0 ≥ 0 to ensure convexity of ( 21) at all iterations.If φ i (v) = v for all i ≥ 1, we recover CGGN, since then ∇ 2 φ i (v) = 0 for all i ≥ 1.

Sequential Quadratically Constrained Quadratic Programming
A different approach to exploit the known curvature information would be to approximate both the objective function and the inequality constraints by convex quadratic functions.This leads to the Sequential Quadratically Constrained Quadratic Programming method (SQCQP).To our knowledge this method -using generalized Gauss-Newton Hessian approximations to ensure convexity of the subproblems -was not yet presented in the literature.An exact-Hessian variant has been proposed and discussed in [16].SQCQP solves a Quadratically Constrained Quadratic Program (QCQP) at every iteration: with the Hessian approximations B i (w) as defined in (19).We abbreviate these quadratic approximations by Compared to SCQP, this method should lead to a slightly closer approximation of the original NLP, as the constraints are approximated quadratically instead of linearized.On the other hand, each iteration is in general more expensive, as a QCQP has to be solved instead of a QP.Compared to SCP, the QCQP approximation is generally worse, while each iteration of SQCQP is possibly cheaper.We might say that SQCQP is in between SCP and SCQP.As for SCQP, for the case that φ i (v) = v for i = 1, . . ., q, we get ∇ 2 φ i (v) = 0 for all i ≥ 1 and therefore recover the CGGN method.
Example 3.2.We now illustrate the global convergence behavior of SCP, SCQP and SQCQP.To this end, we revisit Example 1.1.We reformulate the parameter estimation problem (5) as min w, s where we introduced slack variables s ∈ R N , and subsumed the model-measurement residual in F i (w) = η i − M i (w).Note that the positiveness constraint on s is not strictly necessary as it is implicitly enforced by the first constraint.For the SCQP and SQCQP approximations to (23) though, this is not the case and including the constraint significantly improves their global convergence behavior.Note that the SCP subproblem is a second order cone program (SOCP) and in consequence solved as such.We initialize all three methods at 1000 different values of w 0 , linearly spaced between -1.1 and 1.5.The slack variables are initialized as zero, and, in the case of SCQP, all multipliers by one.The resulting number of iterations is shown in Figure 5.As heuristically predicted, SCP shows the best global convergence behavior, both in terms of the number of iterations and the fraction of initializations for which it converges.Furthermore, it has the steadiest behavior, in the sense that the number of iterations does not jump around wildly for small changes in w 0 .SCQP shows the worst behavior for these three measures, whereas SQCQP is in between the other two methods.Nonetheless we point out that the take-away message of this example is not that SCP should always be the method of choice.This would ignore the fact that the iterations of SCQP are significantly cheaper than those of SCP.A globalized SCQP method might outperform full step SCP on the considered properties.

Local Convergence Analysis
We will now investigate the local convergence behavior of the introduced methods.The most general are SCP, SCQP and SQCQP, so we will focus our attention on them.All other methods, namely GGN and CGGN, and of course also GN and CGN, are special cases in the absence of convex structure in the constraints.Therefore the results trivially extend to them as well.For unconstrained SCP and GGN, the results on linear convergence have been obtained already in [11].Tran-Dinh et al. proved linear convergence of general SCP methods under mild assumptions, but without a tight characterization of the rate [35].For SCQP a tight characterization has been obtained in [40], and for constrained SCP in [28], while SQCQP has not yet been investigated before.
We start by establishing stationarity of the methods at a solution to (14).Collecting both primal and dual variables in z k = (w k , µ k , λ k ), the subproblems of each method, i.e., (16), (21), resp.(22), define iteration maps We assume uniqueness of the next iterate has been assured by taking the solution closest to z k .More specifically, if a subproblem has a set of solutions Z, we define z k+1 := arg min z∈Z z − z k .Note that in order to obtain a simpler notation, the iteration maps have been defined as taking the full z k as input.This hides the fact that z sol SCP (z k ) and z sol SQCQP (z k ) only depend on w k , since they are multiplier-free, and that z sol SCQP (z k ) additionally depends on µ k , but none of the methods depends on λ k .
We denote the set of KKT points of ( 14) as Z * .The active set A(z) ⊆ {1, . . ., q} contains the indices of all inequality constraints which are active at z.For a specific z * ∈ Z * , we use the shorthand A * := A(z * ), with corresponding multipliers µ A * .Accordingly, we have the inactive set I(z) with I * := I(z * ) and µ I * .Lemma 4.1.Let z * be a feasible point of (14) at which LICQ holds.Then z * is a fixed point of all three iteration maps in (24) if and only if z * ∈ Z * .Furthermore, strict complementarity of the SCP, SCQP and SQCQP subproblems (16), ( 21) and ( 22) holds at z * ∈ Z * if and only if it also holds for (14) at z * .More specifically, the active set A * and the corresponding multipliers are identical.
Proof.To increase readability we only state the proof for SCP.The proofs for SCQP and SQCQP are completely analogous.Assume z is a fixed point of z sol SCP (z k ), i.e., z sol SCP (z) = z.This means that z solves the SCP subproblem (16) and, due to LICQ, is also a KKT point.Then, by substituting z k+1 = z k = z into the KKT conditions of ( 16) -which we know to hold in this case -they collapse to the KKT conditions of ( 14) -which therefore also hold.It follows that z ∈ Z * .Vice versa, assume that z * ∈ Z * .Writing down the KKT conditions of (16), we can see that they hold at z k+1 = z k = z * .Therefore, due to convexity, z * solves (16) at z * .Denote by Z the set of all solutions to (16) at z * .We know that z * ∈ Z. From z k+1 = arg min z∈Z z − z * = z * it follows that z * is stationary w.r.t. the SCP iteration defined in (24).For the statement on strict complementarity we point again to the fact that the KKT conditions of ( 14) and ( 16) are identical at z k+1 = z k = z * ∈ Z * .
In order to prepare the statement and proof of the main theorem of this paper, Theorem 4.5, we start by pointing out that for a KKT point z * the inactive constraints have no influence on its position (apart from necessitating µ I * = 0).The active inequality constraints on the other hand can be treated exactly like equality constraints.In fact, if the active set A(w * ) of a local minimizer w * is known, finding w * simplifies to the solution of an equality constrained problem.Furthermore, if the active set is stable, i.e., if it does not change close to w * , then the inactive inequality constraints will also not influence the local convergence behavior, and for the purpose of local convergence analysis we can completely disregard them.Having this already in mind, we need to set up some notation and intermediate results.We consider a local minimizer w * of ( 14) with active set A * and define the corresponding extended equality constraints as This is not surprising, given that at the linearization point all three approximations match the original NLP up to first order.We also regard the Hessians of the Lagrangians of the subproblems ( 16), (21) resp.( 22) at z, which are B SCP (w, µ; w) := J 0 ( w) ∇ 2 φ 0 F lin 0 (w; w) J 0 ( w) + As the expressions look confusingly similar, we explicitly point out the differences.First, recall that w and µ are variables of the subproblems, whereas w and μ are parameters defining the point of approximation.SCP and SQCQP are multiplier-free methods, so their Hessians do not depend on μ.B SCP depends on w via the functions F lin i (w, w).As these are the arguments of the ∇ 2 i φ(•), SCP keeps some curvature information of φ i even when moving away from w.The SCQP subproblem is a QP.Therefore its Hessian is constant with respect to the decision variables, and only depends on the linearization point ( w, μ).The expressions for SCQP and SQCQP are almost identical, with the slight difference that μ is substituted by µ.All three methods linearize the equality constraints g(w) without keeping any curvature information, so none of the Hessians depends on the corresponding multiplier λ.An important property is that for z = (w, λ, µ) = ( w, λ, μ) = z, all three Hessians are identical, i.e., B SCP ( w, μ; w) = B SCQP ( w, μ) = B SQCQP (μ; w) =: B(z), (30) which follows directly from F lin i ( w; w) = F i ( w).Finally, we define B * := B(z * ), Γ * := Γ(w * ) ∈ R r×n and Z as a basis of the null space of Γ * .The reduced Hessian approximation is denoted by B * := Z B * Z.One way to obtain Z would be via the QR factorization where r) .The non-zero block of R ∈ R n×r is the upper triangular R ∈ R r×r .Note that Γ * Z = 0 and Γ * Y = R , which is invertible if Γ * has full rank (LICQ holds).We can now state the following lemma on stability of the active set.This has in fact been proven by Robinson for a more general class of iterative methods for solving NLPs [32].We thus refrain from giving a proof here.The reader interested in a proof for our specific case and notation is referred to [28].
Since SCQP approximates ( 14) by a QP, its residual map is affine in y and can be written as For SCP and SQCQP this map is generally nonlinear.All KKT points y * of the reduced problem are characterized by R(y * ) = 0 and each method iterates by solving the root-finding problem R M (y k+1 ; y k ) = 0, for M = SCP, SCQP, SQCQP, until a stationary point is found.Before taking a closer look at these iterations, we state a lemma on the residual maps, which is central to our later theorem on the local convergence rates.
but match the residual map of the original problem (14) only up to first order, Proof.The proof will go by showing identity of the residuals and their partial derivatives at y = ȳ.From (26) we already know that ∇G SCP ( w; w) = ∇G SCQP ( w; w) = ∇G SQCQP ( w; w) = Γ( w) .From g lin ( w; w) = g( w) and The partial derivative w.r.t.w of the first row of each residual map is exactly the Hessian of the Lagrangian of the corresponding subproblem.They are given in (27) to (29) and from (30) we know them to be identical at y = ȳ.In consequence, ∂R SCP (y; ȳ) Before moving to the next theorem, we consider the special case in which a local minimizer w * is fully determined by the active constraints.

Definition 4.4 ( [22]
).We say a feasible point w of NLP ( 14) is fully determined by the active constraints if LICQ holds at w and |A(w)| = n − p.
We introduce this special case for two reasons.First, if a local minimizer w * is fully determined by the active constraints, the constraint Jacobian Γ * ∈ R (|A * |+p)×n is a square matrix, since for the fully determined case |A * | + p = n.Due to LICQ, Γ * is invertible and its null space contains only the zero vector.Therefore the null space basis Z would be empty, Z ∈ R 0×0 , and in consequence the reduced Hessian approximation B := Z BZ, which plays a central role in the following theorem, would be ill-defined, B * ∈ R 0×0 .Second, this case distinction is not only technical, but interesting in its own right.In a later theorem, we will actually obtain quadratic convergence for this special case.For now, we only focus on the case that w * is not fully determined by the active constraints.

Linear Convergence Rate
We are now ready to state the main theorem of this paper.Similarly to the reduced Hessian approximation B * , we define Λ * := Z ∇ 2 L(w * , µ * , λ * )Z for the true Hessian of the original problem (14), and E * := E SCQP (w * , µ * , λ * ) as well as E * := Z E * Z for the corresponding Hessian approximation error.Note that Λ * = B * + E * .Theorem 4.5.Assume z * = (w * , µ * , λ * ) is local minimizer of (14), at which LICQ and strict complementarity hold.Then z * is a fixed point of SCP, SCQP and SQCQP.If w * is not fully determined by the active constraints and B * 0 holds, then all three methods have the same asymptotic local linear contraction -or divergencerate.This asymptotic contraction rate is given by the smallest α that fulfills the condition In consequence, if holds for some α < 1, the methods converge Q-linearly with contraction rate α and a necessary condition for local convergence is given by B * Proof.Stationarity follows from Lemma 4.1.From Lemma 4.2 we know that the active set A * of all three methods is stable close to z * .It follows that the µ I will be zero, µ I * = µ * I * = 0, for z sufficiently close to z * .We can therefore disregard the inactive constraints and characterize the convergence behavior from the residuals maps R M (y; ȳ) as defined in (33) to (35) for M = SCP, SCQP, SQCQP.As a consequence of Lemma 4.3 their partial derivatives are identical at y = ȳ = y * and given by for M = SCP, SCQP, SQCQP.We have R M (y * ; y * ) = 0, and R M (•; •) is continuously differentiable with respect to both arguments, due to the assumptions made on the composing functions.Furthermore, K * is invertible due to the KKT Lemma [30,Lem. 16.1], which holds because of LICQ and B * 0. We thus know from the implicit function theorem that there exists a neighborhood of y * in which the continuously differentiable functions y sol M (ȳ) are uniquely defined by R M (y sol M (ȳ); ȳ) = 0, with M = SCP, SCQP, SQCQP.For ȳ sufficiently close to y * , their Jacobians are given by At the fixed point y * , with y sol M (y * ) = y * , this evaluates to for M = SCP, SCQP, SQCQP.Now consider the iteration y k+1 = y sol M (y k ), with y 0 sufficiently close to y * .Taylor expansion around y * yields As is a standard result of linear stability analysis of nonlinear systems, convergence of y k to y * is determined by the spectral radius ρ(A * ) [31,Chap. 22].If 0 < ρ(A * ) < 1, the sequence converges linearly with asymptotic contraction rate ρ(A * ).When ρ(A * ) = 0, it converges faster than linear.If ρ(A * ) = 1, we cannot decide about convergence by considering only ρ(A * ), and if ρ(A * ) > 1, then y * is an unstable stationary point.We now derive an explicit expression of A * , to be able to characterize its spectral radius ρ(A * ).Note that to compute A * -due to the special structure of L * -we only need the first block column of with by solving the linear system via the null space method [30,Chap. 16.2].We split D into two components, Similarly, left-multiplying the first block row of (48) by Y and substituting the above result for D yields Finally, we can explicitly write A * as The next step is to find the spectral radius ρ(A * ).Due to the zero blocks on the right-hand side, the non-zero eigenvalues of A * coincide with those of its upper left block, Now Â * is similar to is similar to A * and they share the same eigenvalues.From the rightmost expression in (57) we can see that A * is symmetric, so its eigenvalues are real.Its spectral radius ρ( These two LMI are equivalent to non-negativity of ξ (αI 42) is a direct consequence.Substituting α = 1 into the left-hand side LMI of (42) we can see that B * 1 2 Λ * is a necessary condition for convergence: if we had B * ≺ 1 2 Λ * , the LMI could only hold for α > 1.Finally, if Λ * 0, we know that the right-hand side LMI of (42) has to hold for some α < 1, and the same goes for the left-hand side LMI if B * 1 2 Λ * holds.Therefore B * 1 2 Λ * is sufficient for linear convergence.
Example 4.6.To illustrate the local convergence behavior, we revisit the problem formulation introduced in Example 3.2.We initialize the three methods at w 0 = 0, s 0 = 0, and, in the case of SCQP, µ 0 = 1.For the obtained iteration sequences we compute the empirical contraction rate as where ω k = (w k , s k ).The theoretical asymptotic rate α(w * ) is computed as defined by the LMI in (41).The results are shown in Figure 6.Note how the empirical rates approach the theoretically predicted rate in the final iterations.

Analysis of the Fully Determined Case
In this subsection, we consider the case in which the solution is fully determined by the active constraints, as defined in Definition 4.4.As we will show, this is an interesting special case in which the KKT conditions are sufficient for local optimality and SCP, SCQP, and SQCQP converge quadratically.We motivate the analysis with the following example.
Example 4.7.We continue with the just introduced slack reformulation ( 23).This time we want to investigate the convergence behavior when varying the Huber parameter δ.Recalling that for δ → 0 the pseudo Huber penalty approaches the L 1 -norm, we also consider a variation of our example in which the residuals are penalized   The observed contraction rate of the three methods is compared to the theoretically predicted rate α(w * ).During the final iterations, all three methods approach this rate.Also note how SCP has the fastest rate during the initial iterations, whereas SCQP is the slowest.by the L 1 -norm.This leads us to the problem min w∈R η−M (w) 1 , for which the smooth epigraph reformulation is min w, s We use SCP to solve the above problem, as well as the problem given in (23) for values of δ ∈ [10 −6 , 10 2 ].Note that applying SCP to (60) actually simplifies to SLP [19].For each δ we compute the theoretical contraction rate.The results are visualized in Figure 7 on the left side.For δ 1 the contraction rate flatlines at α ≈ 0.04.This happens when δ is so large that all residuals are penalized quadratically, i.e., in a least-squares fashion.For δ → 0 something much more interesting happens: it seems that α → 0 as δ approaches 0, i.e., in the limit we would obtain convergence faster than linear.As in Example 4.6 we also compute the empirical contraction rate κ k of SCP for a few values of δ and the L 1 -norm, initializing at w 0 = 2.The resulting contraction rates are compared to the theoretically predicted rate α in Figure 7 on the right side.What happens here is that as δ → 0, we approach the SLP iterations defined by the L 1 -norm.For a linear program, the solution -if it exists and is unique -always lies in a vertex of the feasible set.This corresponds to the fully determined case and therefore the convergence rate is quadratic, as we will see in the following.
Intuitively, the quadratic convergence can be explained as follows: If the active set A * is known, i.e., we know G(w * ), and w * is fully determined by the active constraints, i.e., G(w * ) has n components and Γ * is of full rank, then solving (14) simplifies to solving the feasibility problem G(w) = 0, which is independent of the objective function.Solving this nonlinear root-finding problem with the classical Newton's method would yield locally quadratic convergence [8].Now SCQP linearizes all constraints, and thus in effect solves this special case with Newton's method, and should therefore converge quadratically.SCP and SQCQP keep some curvature in the constraints, so this reasoning is not fully applicable, but we will see that nonetheless they converge quadratically as well.A second consequence is the following proposition, which we might call the first order sufficient conditions (FOSC).Proposition 4.8.Suppose z * = (w * , λ * , µ * ) is a KKT point of (14).If w * is fully determined by the active constraints and LICQ and strict complementarity hold at z * , then w * is a strict local minimizer of (14), i.e., the KKT conditions are necessary and sufficient conditions for optimality.
Proof.LICQ and strict complementarity hold at z * , so the critical cone C(w * , µ * ) is given by the null space of the Jacobian of the extended equality constraints G(w * ), i.e., C(w * , µ * ) = null (Γ * ) .As w * is fully determined by the active constraints, Γ * is an n × n matrix and due to LICQ has full rank.Thus, the null space of Γ * contains only the zero vector, C(w * , µ * ) = {0}.For any non-zero feasible direction d ∈ F(w * ), with tangent cone there exists at least one index j ∈ A * such that d ∇f j (w * ) < 0. From ∇L(w * , λ * , µ * ) = 0, we can conclude that Thus, any non-zero feasible direction d is an ascent direction, which implies that w * is a strict local minimizer of ( 14).
Theorem 4.9.Assume z * = (w * , µ * , λ * ) = (w * , ϑ * ) is a KKT point of (14), at which LICQ and strict complementarity hold.If w * is fully determined by the active constraints, then w * is a local minimizer and SCP, SCQP and SQCQP converge Q-quadratically in the primal variable w, and R-quadratically in the dual variable ϑ = (µ, λ), i.e., there are constants c 1 , c 2 ∈ R + such that Proof.Proposition 4.8 implies that w * is a local minimizer of (14).For the convergence analysis, we follow the proof of Theorem 4.5: We disregard the inactive multipliers µ I * = 0 and focus the analysis on y = (w, µ A * , λ) = (w, γ).Consider again the iterations defined by the solution operators y k+1 = y sol M (y k ), for M = SCP, SCQP, SQCQP, and their Jacobian at y * , as given in (45) as −A * = −K −1 * L * .For the fully determined case, Γ * is a square matrix and due to LICQ invertible.We can therefore obtain the first block column of K −1 * as D = 0 and C = Γ − * , cf. (47), and subsequently The only non-zero block of A * is in its lower left, from which ρ(A * ) = 0 and therefore at least superlinear convergence follows.Now consider again the Taylor expanded iteration map (46), We now make two observations: (a) both SCP and SQCQP are multiplier-free methods, i.e., their solution operators depend only on w k , but not on γ k ; (b) while SCQP is in general not multiplier-free, for the fully determined case we can explicitly obtain the primal part of its solution operator as w k+1 = w k − Γ(w k ) −1 G(w k ) from the second block row of (36), which does not depend on γ k .Combining these observations with the first block row of (66), we can write w k+1 − w * = O( w k − w * 2 ), from which Q-quadratic convergence of the primal variable w follows.The second block row is given as For y k sufficiently close to y * , we can find a constant c 2 ∈ R + such that γ k+1 − γ * ≤ c 2 w k − w * .Since the sequence w k − w * converges Q-quadratically to zero, R-quadratic convergence of the sequence γ k to γ * follows.
Figure 7. Left: Theoretical linear contraction rate for varying the pseudo Huber loss parameter δ.Right: Empirical convergence rate of SCP for several δ and the limit case of the L 1 -norm (which corresponds to δ = 0).The dotted lines indicate the theoretically predicted rates.For the L 1 -norm this rate is 0, and therefore not visualizable on a log-scale.

Mirroring and desirable divergence
We now turn our attention to an interesting phenomenon that arises in the context of parameter estimation.As we have seen in the previous section, the discussed methods do not always converge locally.At first glance this might seem like a weakness.In this section we argue why -in the context of maximum likelihood estimation -this is actually a strength.This idea has first been presented by Bock in the context of L 2 estimation [3] and was later extended to L 1 estimation [4].In [11] we generalized to unconstrained NLP with general convex symmetric penalty functions, and now generalize this further to problems with equality constraints.
Imagine we have a model function m(x; w) that maps an input x ∈ R nx to an output m(x; w) ∈ R, depending on a parameter w ∈ R n .Of this output we only have noisy measurements η i = m(x i ; w) + ν i , where ν i ∈ R is some form of noise.Our wish is now to identify this parameter from input-output pairs (x i , η i ), i = 1, . . ., N .In the framework of maximum-likelihood estimation this leads to an objective function in the form of f (w) = N i=1 ϕ(η i − m(x i ; w)) with a cost function ϕ : R → R determined by the assumed probability distribution of ν i .More generally we may phrase this as Here M (w) is the vertical concatenation of the m(x i ; w) and correspondingly for η.The constraint g(w) encodes further prior knowledge on the parameter, that might be derived from physical insight into the problem.As should be straightforward to see, (68) includes scenarios more general than the one described, e.g., it also includes scenarios with multi-output models or models with time dependency.Central to our results here is the observation that most commonly used noise distributions are radially symmetric.That is, if a measurement is perturbed by some noise realization ν, its negative version −ν should be equally likely.In consequence, the cost function defined by the corresponding distribution is symmetric, φ 0 (y) = φ 0 (−y).With these properties in mind we define the "mirror problem".Example 5.2.We return to our example as defined in (5).For the two local minima, w * good and w * bad , we compute the mirrored measurements η(w * good ) resp.η(w * bad ) as explained in Definition 5.1.This is visualized in Figure 8 alongside the measurement model for the respective minimizer.In this context the model is m(x i ; w) := ψ(x i + w) as defined in (4).Note that varying the model parameter w corresponds to shifting ψ horizontally.Proof.We start by defining the Lagrangian of (68) as and correspondingly for its mirror problem (69).Now consider the KKT conditions of the mirror problem: In these, first substitute z = (w * , −λ * ) for z.Then note that due to symmetry ∇φ 0 (v) = −∇φ 0 (−v), and therefore ∇φ 0 (η(w These are exactly the KKT conditions of (68) at z * , which hold by assumption.Therefore (71) holds at z and z is a KKT point of the mirror problem.For the reverse direction note that the mirror problem of the mirror problem is the original problem.We are now ready to state the main theorem of this section.But before we do so, we would like the reader to note that the estimation problem (68) has no inequality constraints.This leads to the fact that in this case both SCQP and SQCQP simplify to CGGN.As in the previous section, a tilde is used to denote the reduced Hessians, e.g., B * = Z B * Z. Theorem 5.5 (Divergence from undesirable local minima).Assume z * is a local minimizer of (68), at which LICQ, strict complementarity, Λ * 0 and B * 0 hold.Then z * is a KKT point of (68) and z as defined in Lemma 5.3 is a KKT point of its mirror problem.Both are therefore stationary points of SCP and CGGN applied to the respective problem.But if these methods applied to (68) diverge from z * , then z is not a local minimizer of the mirror problem.
Proof.z * is a KKT point by definition, and that z is a KKT point follows from Lemma 5.3.Stationarity of the methods at z * resp.z follows from Lemma 4.1.From Theorem 4.5 we know that local divergence means that −α B * E * α B * only holds for some α > 1. Therefore there exists some α ∈ (1, α) such that one of the two LMI is violated.The left-hand side can be reformulated as α B * + E * 0, and due to B * + E * = Λ * 0 and B * 0 we see that this can never be violated for α > 1.It follows that the right-hand side must be violated.This means that there exists some p ∈ R n Z \ {0} -with n Z the dimension of the null space -such that Since ˘ Λ is the reduced Hessian of the mirror problem evaluated at z, it follows that the second order necessary conditions do not hold and z cannot be a minimizer of the mirror problem.
Example 5.6.Continuing directly on Example 5.2, we now investigate the mirror problem as defined in (69).From Example 2.3 we recall that for the asymptotic linear contraction rates we have α(w * good ) ≈ 0.02 and α(w * bad ) ≈ 3200 1.In Figure 9, we visualize the objective functions of the mirror problems at both local minima and see how w * good remains a clear minimizer for the mirror problem, whereas the local minimum at w * bad is transformed into a local maximum of the mirror problem.

Conclusions
We provided an overview of methods that exploit convex-over-nonlinear substructure in optimization problems.These were divided into three sectors: smooth unconstrained NLP, smooth constrained NLP, and optimization problems with non-smooth convex substructures.For all three sectors we discussed methods for the most general problem formulation as well as special cases, of which the simplest is nonlinear least-squares.A special focus lay on methods for smooth constrained problems and for these we provided an analysis of local convergence.We proved that under mild assumptions SCP, SCQP and SQCQP have the same local linear contraction rate.For a fully determined solution, they even converge quadratically.For a simple numerical example we have seen that SCP has better global convergence behavior, whereas SCQP has cheaper iterations.SQCQP can be seen to provide a trade-off between these two cases.It would be interesting to further validate this on a broad set of example problems and with a rigorous comparison of the computational cost.As we have only considered the full step methods here, an analysis of globalized versions might also yield important results.Furthermore, we have shown that the generalized CGN and SCP methods locally converge to a local minimizer for equality constrained estimation problems with symmetric convex penalties if and only if this minimizer is stable under a mirroring operation.Thus, the methods are only attracted by those local minima that satisfy the desirable property of stability under mirroring.

Figure 1 .
Figure1.Overview of the methods discussed in this paper.Arrows indicate specializations, i.e., M 1 → M 2 means method M 2 is a special case of method M 1 .SSDP is only a special case of SCP if the SSDP variant with a zero Hessian is used.

Figure 4 .
Figure 4. Visualization of the objective function and α(w).Note that α(w) attains its meaning as local contraction rate only at local minima.

Figure 5 .
Figure 5. Number of iterations until convergence to w * ≈ 0.1 depending on the initial guess w 0 .SCP converged to w * in 100.0% of the cases in 100 iterations or less, SQCQP in 95.7%, and SCQP in 90.3%.
r) an orthonormal basis of the null space of Γ * and Y ∈ R n×r such that [Y Z] [Y Z] = I.Y and Z might be obtained via QR factorization of Γ * , see (31).From the second block row of (48) we have Γ * Y D Y + Γ * ZD Z = 0 ⇔ D Y = 0 (50) since Γ * Z = 0 and Γ * Y ∈ R m×m has full rank due to LICQ.Left-multiplying the first block row of (48) by Z we get Z B * D + Z Γ * C = Z .(51) Substituting D = ZD Z and noting that Z Γ * = 0 and that Z B * Z 0 is invertible by assumption, D Z = (Z B * Z) −1 Z follows, and therefore

2 *
where we used Y Z = 0, Z Z = I, B * = Z B * Z and E * = Z E * Z.It follows that Â * and Â * have the same eigenvalues.The non-zero eigenvalues of Â are given by the eigenvalues of its lower right block A * := B −1 * E * .Due to B * 0, the symmetric positive definite square root B 1 exists and is uniquely defined.Therefore

Figure 6 .
Figure 6.Convergence to the local minimum at w * ≈ 0.1.The observed contraction rate of the three methods is compared to the theoretically predicted rate α(w * ).During the final iterations, all three methods approach this rate.Also note how SCP has the fastest rate during the initial iterations, whereas SCQP is the slowest.

Definition 5 . 1 (Figure 8 .
Figure 8. Illustration of the mirror problem for two local minima, w * good and w * bad .The mirrored measurements η are obtained by mirroring the original measurements η vertically at the model function.

Figure 9 .
Figure 9.Comparison of the original objective function f 0 (w) to the objective functions f (w; w * ) of the mirror problems for the two local minima w * good and w * bad .The bad local minimum at w * bad turns into a maximum of the corresponding mirror problem.