Multilevel Minimization for Deep Residual Networks

We present a new multilevel minimization framework for the training of deep residual networks (ResNets), which has the potential to significantly reduce training time and effort. Our framework is based on the dynamical system's viewpoint, which formulates a ResNet as the discretization of an initial value problem. The training process is then formulated as a time-dependent optimal control problem, which we discretize using different time-discretization parameters, eventually generating multilevel-hierarchy of auxiliary networks with different resolutions. The training of the original ResNet is then enhanced by training the auxiliary networks with reduced resolutions. By design, our framework is conveniently independent of the choice of the training strategy chosen on each level of the multilevel hierarchy. By means of numerical examples, we analyze the convergence behavior of the proposed method and demonstrate its robustness. For our examples we employ a multilevel gradient-based methods. Comparisons with standard single level methods show a speedup of more than factor three while achieving the same validation accuracy.


Introduction
Deep residual networks or ResNets are widely used architectures that demonstrate state-of-the-art performance in complex statistical learning tasks with applications in various fields, such as computer vision (Jung et al., 2017;Chen et al., 2017), or speech recognition (Wu et al., 2016;Xiong et al., 2018). The popularity of ResNets originates from their remarkable performance in the ImageNet (Russakovsky et al., 2015) and the MS COCO (Lin et al., 2014) image recognition competitions.
A major drawback of very deep ResNets is their long training time. To mitigate this issue, different strategies have been proposed, for example networks with stochastic depth (Huang et al., 2016), mollifying networks (Gulcehre et al., 2016), spatially adaptive architectures (Figurnov et al., 2017), or multilevel parameter initialization strategies (Haber et al., 2018;Chang et al., 2017).
In this work, we propose to accelerate the training of ResNets using multilevel minimization. Our work is motivated by the fact that the network depth is of paramount importance for achieving the necessary approximation properties (Håstad & Goldmann, 1991;Simonyan & Zisserman, 2014). However, very deep networks are computationally expensive to train, as the cost of forward-backward propagation scales linearly with respect to the number of parameters. In contrast, shallower networks might not show the necessary approximation properties, but their training cost is relatively low. Our multilevel framework exploits a multilevel hierarchy of auxiliary networks with different depths. The training of the deepest network is then accelerated by internally training the shallower networks.
The proposed multilevel framework is inspired by multigrid methods (Briggs et al., 2000;Hackbusch, 1985), which have originally been developed for the solution of elliptic partial differential equations. An extension of linear multigrid methods to nonlinear problems, called full approximation scheme (FAS), can be found in (Brandt, 1977). Later, several nonlinear multilevel minimization techniques have emerged, for example the multilevel linesearch method (MG/OPT) (Nash, 2000), the recursive multilevel trust region method (RMTR) (Gratton et al., 2008;Groß & Krause, 2009;Kopaničáková et al., 2019;Kopaničáková & Krause, 2020), or higher-order multilevel optimization strategies (Calandra et al., 2019). Our multilevel minimization method can be seen as a variant of an MG/OPT framework which is tailored for training ResNets.
The main challenge in designing an efficient multilevel minimization framework is to construct a suitable multilevel hierarchy. Here, we leverage the emerging dynamical system's viewpoint (Haber et al., 2018;Weinan, 2017), which casts a ResNet as the discretization of an initial value problem. The training process is then formulated as the minimization of a time-dependent optimal control problem. As a consequence, we can obtain a hierarchy of ResNets with different depths by discretizing the same optimal control problem with different discretization parameters.
A dynamical system's viewpoint was first used in a multilevel context in (Chang et al., 2017), where the authors trained shallow networks to initialize parameters of a deep network. The same parameter initialization strategy was recently extended for layer-parallel training of ResNets (Cyr et al., 2019). Our method differs from the methods proposed in (Chang et al., 2017) and (Cyr et al., 2019), as we take advantage of a multilevel hierarchy during the whole training process, not only in the beginning. Nevertheless, it is possible to incorporate a multilevel initialization strategy into our multilevel minimization framework. We do not exploit this possibility in the presented paper, as aim of this work is to test the proposed multilevel training framework by itself.
This work makes the following contributions: • We present an abstract nonlinear multilevel minimization framework for training deep residual networks. • Using our multilevel framework, we propose multilevel variants of gradient and mini-batch gradient methods. • We numerically analyze the convergence behavior of our multilevel training strategies using two different datasets and ResNets with more than 2, 000 layers. In addition, comparisons with a standard single level methods are made, which demonstrate a speed-up of more than factor three.

Deep Residual Networks
This section provides a brief overview of deep residual networks (ResNets) in the context of supervised classification. Through the following, we consider a dataset D = {(x j , c j )} p j=1 of p samples. Each sample is a pair consisting of an input feature x j ∈ R q and its corresponding label c j ∈ R m . The size of the label vector c j is determined by the number of output classes, i.e. m, as the i-th component of vector c j corresponds to the probability of example x j belonging to the i-th class.

Classification
The main idea behind supervised learning is to construct a model, which describes the relationship between input and output for a labeled dataset D. The model function f m : R q ×R n → R m is parametrized by a set of parameters θ ∈ R n . The process of finding suitable parameters θ is called training and it usually requires solving the following minimization problem: where a loss function ℓ : R m → R measures the deviation of the the predicted output from the known label. The regularizer R : R n → R in (1) is chosen such that it ensures the existence and regularity of the parameters θ. A common choice for the regularizer is Tikhonov regularization (Engl et al., 1996), however other possibilities have also been used, see for example (Ng, 2004).
In the context of classification, the model function f m is constructed by composing the forward propagation f : R q × R r → R v with the hypothesis function P : R m → R m . The forward propagation filters input features in a nonlinear manner, while the hypothesis function predicts the class label probabilities using the output of the forward propagation. In abstract form, the model function f m is defined as where we split the model parameters θ into parameters of classification θ K and forward propagation θ, thus θ = {θ, θ K }. The classification parameters θ For multinomial classification problems, it is common to employ a cross-entropy loss function together with the softmax hypothesis function. For alternatives choices, we refer interested readers to (Goodfellow et al., 2016).

FORWARD PROPAGATION VIA RESNET
In deep learning, the neural network constitutes a form of forward propagation function f . The parametric function f is created by concatenating many functions, called layers. Each layer k is usually composed of affine linear and point-wise nonlinear transformations, that are parametrized by the layer parameters θ k ∈ R d .
In this work, we consider residual networks with identity shortcut connections (He et al., 2016b). The propagation of the input sample x through a network with K residual layers can be then expressed as where y k ∈ R v denotes the state of layer k. For simplicity, Equation (3) assumes a constant network width v. Hence, we map an input sample x ∈ R q into the feature space with the help of the linear operator Q ∈ R v×q , e.g. y 0 := Qx. The elements of the matrix Q can be fixed or learned during the training process.
The transformation F : R v × R d → R v from (3) describes the residual module, c.f. (He et al., 2016a). Here, we assume that F takes form of the simple one layer perceptron where σ : R v → R v is the nonlinear activation function, for example the rectified linear unit (ReLu), defined as σ(z) := max{0, z}. For alternatives, such as logistic sigmoid, or hyperbolic tangent, see (Goodfellow et al., 2016). The affine transformations in (4) are defined by a set of layer parameters θ k := {W k , b k } consisting of weights W k ∈ R v×v and biases b k ∈ R v . The linear operator W k can be a dense matrix, or sparse, e.g. in the case of a convolutional neural network, where it expresses the convolutional operator, see (Goodfellow et al., 2016).

Classification as optimal control problem
Following (Haber et al., 2018), Equation (3) can be seen as a simplification of the more generic formula for a one-step method with ∆ t = 1. Now, the forward propagation through the network (5) can be interpreted as a forward Euler discretization of the initial value problem The dynamical system above then continuously transforms the initial state y(0) into the network output y(T ), while the time-dependent control variables θ(t) define the behavior of the system. The classification problem is now formulated as the following continuous optimal control problem (Haber & Ruthotto, 2017): where y j (T ) denotes the output of the network for the data sample x j . The continuous formulation (7) opens the door to many new developments. For example, the design of stable network architectures (Haber & Ruthotto, 2017;Benning et al., 2019), the parallel approach to training (Günther et al., 2018;Parpas & Muir, 2019), or novel solution strategies (Li et al., 2017). In this work, we leverage the continuous formulation in order to design an efficient multilevel training strategy, see Section 3.

DISCRETIZATION
To solve the continuous optimal control problem (7) numerically, we discretize (7) in time. Thus, we consider the time-grid 0 = τ 0 < .... < τ K = T of K + 1 uniformly distributed time points τ k := ∆ t k, where ∆ t := T /K represents a time-step. The discretized control θ k ≈ θ(τ k ) and state y j,k ≈ y j (τ k ) variables then correspond to the Figure 1. An example of time grids used for the multilevel discretization. On the fine level, we consider 10 time-steps, while on the coarse level, we use 5 larger time-steps.
parameters and the state of the k-th layer of the ResNet, respectively.
In the discrete setting, we obtain the following constrained minimization problem: where we have used an explicit Euler scheme to discretize the time derivative ∂ t y(t) in (7). This choice of discretization is what imposes the particular ResNet architecture. However, other, possibly more stable discretization schemes, can be considered, see for instance (Haber & Ruthotto, 2017). Employing an explicit Euler method, we can ensure the stability of a forward propagation by ensuring that the time-step ∆ t is sufficiently small (Haber & Ruthotto, 2017).
Multilevel discretization We can discretize (7) using different discretization parameters. This allows us to construct a multilevel-hierarchy of auxiliary networks with different resolutions. We consider a hierarchy of L levels, denoted by l = 1, . . . , L. The finest level, l = L, represents the discretization of the optimal control problem (7) with satisfactory resolution/representation capacity. This means, that the time-step ∆ L t is sufficiently small and that the network has sufficiently many layers to ensure desirable approximation properties of the model. In order to obtain coarser level networks, we discretize the time interval [0, T ] with larger time-steps. For instance, if we assume a uniform coarsening in time by a factor of two, the following relation holds for time-steps of subsequent levels ∆ l−1 t = 2∆ l t . As a consequence, the number of layers is halved between the networks on level l and l − 1. Figure 1 demonstrates the process for a simple 2-level example. Since the networks on coarser levels of the multilevel hierarchy are constructed with fewer layers, they have less trainable parameters. Therefore, they are computationally cheaper to optimize, due to the fact that the cost of forward-backward propagation used during the training grows linearly with respect to the number of parameters (Hecht-Nielsen, 1992). As a consequence, it is roughly two-times faster to perform one forward-backward propagation on a coarser level than on the subsequent finer level.

Multilevel Training for ResNets
In this section, we introduce a nonlinear multilevel minimization framework for training ResNets. The presented framework can be seen as a variant of the MG/OPT framework (Nash, 2000) originally developed for solving the large scale problems arising from the discretization of partial differential equations. Our variant of MG/OPT is tailored to the minimization of the discrete optimal control problem (8). In particular, we employ a hierarchy of auxiliary networks with different depths, see Figure 2, which are used to accelerate the training of the original network. Each auxiliary network is trained by approximately minimizing the associated level-dependent optimal control problem, see Section 3.1.1 for the details. The minimization of the level-dependent optimal control problem is carried out using an optimizer associated with a given level.
Through the following, we use a pair (l, µ l ) of superscripts to denote the quantities related to a level l and iteration µ l . If no subscript is used, we refer to quantities on all layers of the network simultaneously. Otherwise, the subscript identifies the quantities associated with a given layer. For example, θ 1,µ 1 k denotes the parameters related to the k-th layer of the coarsest network, l = 1, after µ 1 update steps.
Transfer operators The multilevel training framework requires to transfer data between subsequent levels of the multilevel hierarchy. For this reason, we employ two types of transfer operators. The interpolation operator I l : R n l → R n l+1 transfers weights and biases from level l to level l + 1. Here, we consider piecewise constant interpolation in time. Other choices of the transfer operators, such as linear interpolation, are also possible and may be even preferable. We plan to incorporate them into our multilevel training framework in future work. In addition to the interpolation operator, the multilevel method also uses a restriction operator R l : R n l+1 → R n l , in order to transmit data, such as gradients, from level l + 1 to level l. As common in multgrid literature (Hackbusch, 1985), we choose the restriction operator as R l := (I l ) T .
The downward phase starts on the finest level, l = L, with initial weights θ L,0 and passes through all levels until the coarsest level is reached. On each level, we perform µ l 1 level-optimizer steps in order to find an approximate solution of the level-dependent optimal control problem. The approximate solution, i.e. the updated network parameters θ l,µ l 1 are then used to initialize weights on the subsequent coarser level, e.g. θ l−1,0 = R l−1 θ l,µ l 1 . This process is repeated until we reach the coarsest level, l = 1.
Once the coarsest level is reached and we have performed µ 1 level-1-optimizer step, yielding the parameters θ 1,µ 1 , we can initiate the upward phase. During the upward phase, we return to the finest level, while passing through all levels of the multilevel hierarchy. Starting on the coarsest level, we compute the coarse grid correction e l = θ l,µ l − θ l,0 , which characterizes the difference between the initial and the updated parameters on a given level. This correction is then transferred to the next finer level using the interpolation operator as e l+1 = I l e l . Once we have the interpolated correction, we use it to update the parameters of the finer network, thus θ l+1,µ l+1 1 +1 = θ l+1,µ l 1 + e l+1 . Finally, we perform µ l 2 steps of the level-optimizer in order to improve the current approximation of the parameters on level l. The whole process is summarized in Algorithm 1.

LEVEL-DEPENDENT MINIMIZATION PROBLEMS
On each level of the multilevel hierarchy, we look for an approximate solution of some level-dependent optimal control problem. As common for nonlinear multilevel (minimization) schemes, such as FAS (Brandt, 1977), or RMTR (Gratton et al., 2008), we define the level-dependent optimal control problems as min θ l ,y l H l (θ l , y l K ) : = L l (θ l , y l K ) + δg l , θ l subject to y l k+1 = y l k + ∆ l t F (y l k , θ l k ), where δg l is given by for all levels l < L. For the finest level, l = L, we assume that δg L := 0, and therefore the functional H l : R n l → R coincides with the loss functional L L defined in (8).
On the coarser levels, the functional H l consists of two terms: the loss functional L l and the so-called coupling term δg l , θ l . The coupling term creates a connection between two subsequent levels of the multilevel hierarchy. This is accomplished using the δg l term, which measures the deviation between the restricted fine-level gradient ∇H l+1 (θ l+1,µ l 1 , y l+1,µ l 1 K ), and the initial coarse-level gradient ∇L l (θ l,0 , y l,0 K ). The use of this coupling term is of major importance, as it enforces the following relationship: for the first optimizer step on a given level. In addition, it guarantees, that the minimization on the coarse level is guided by the restricted fine level gradient and that the prolongated coarse level correction will be a descent direction on the fine level (Nash, 2000).

MULTILEVEL GRADIENT-BASED METHODS
Algorithm 1 employs an auxiliary optimizer on every level of the multilevel hierarchy. By design, we are conveniently independent in the choice of the optimizer on each level. Our multilevel framework does not even require to employ the same type of optimizer on all levels. One can, for example, utilize computationally expensive optimizers on the coarser levels, while employing computationally cheaper optimizers on the finer levels. In the multilevel community, it is quite popular to employ a second-order optimizer on the coarsest level and gradient-based optimizers on all finer levels.
The easiest way to construct a multilevel training algo-rithm is to employ a gradient method on all levels. One level-optimizer iteration then consists of a simple gradient step computed using the whole dataset, see Algorithm 2. Although, the choice of other gradient-based algorithms, such as LBFGS (Le et al., 2011), or RMSprop (Tieleman & Hinton, 2012), might be more beneficial, using a vanilla gradient descent method allows for plain testing of our multilevel framework without introducing additional hyper-parameters.
Algorithm 1 V-cycle of MG/OPT(L l , l, θ l,0 , δg l ) Constants: µ l 1 , µ l 2 , µ 1 ∈ N 1. Downward phase Construct H l by means of (9) [θ l,µ l 1 ] = LevelOptimizer(H l , θ l,0 , µ l 1 ) θ l−1,0 ← R l−1 θ l,µ l 1 Evaluate δg l−1 2. Recursion or call to optimizer on the coarsest level if l = 2 then Multilevel mini-batch gradient descent Since minibatch gradient descent (SGD) is typically the algorithm of choice when training a neural network, here we propose its multilevel variant, Algorithm 3. Similarly to the Algorithm 2 LevelOptimizer(H l , θ l,0 , max it) Constants: α ∈ R + 1: for i = 1, . . . , max it do 2: θ l,i = θ l,i−1 − α∇H l (θ l,i−1 , y l,i−1 K ) 3: end for return: θ l,max it single level SGD algorithm, we split the dataset into nb mini-batches. The algorithm then iterates through all minibatches. For each mini-batch, the algorithm invokes a multilevel gradient descent step, thus a V-cycle of MG/OPT configured with a gradient descent optimizer on all levels. Computational complexity One V-cycle of the multilevel training strategy is computationally more expensive than one iteration of a single level optimizer. For the gradient-based optimizers, the computational cost is associated with the evaluation of the gradient, thus with the cost of a forward-backward pass. To provide a fair comparison between multilevel and single level methods, we introduce the notation of work units. One work unit U L represents the cost of a gradient evaluation on the finest level. Assuming a coarsening factor of two, the cost related to the gradient evaluation on the coarser levels is U l = 2 l−L U L . The computational cost of one V-cycle, denoted U c , can be obtained by summing over the cost required on each level, thus The cost on the coarsest level is related to µ 0 leveloptimizer steps. On all other levels, we have to take into account the gradient evaluation that is required for computing the coupling term δg l , in addition to µ l 1 and µ l 2 level-optimizer steps. The overall computational cost of the multilevel training U is then simply computed as U = (#V-cycles) U c , where (#V-cycles) denotes the number of V-cycles required to achieve a prescribed tolerance.

Numerical Experiments
We analyze the performance of the proposed multilevel optimizers using two classification problems: • Co-centric circles: This simple example was proposed in (Lin & Jegelka, 2018) and requires the classification of particles into two distinct classes. The input features x j ∈ [−3, 3] 2 describe the position of a particle in a two-dimensional plane, while the output vector c j ∈ R 2 prescribes an affiliation to a given class. In particular, the i-th element of label c j is defined as follows: The dataset consists of 3, 000 samples, where 2, 000 are used for training and 1, 000 for testing. • MNIST: Our second classification task considers the database of handwritten digits (LeCun et al., 1998). The dataset contains greyscale images of size 28 × 28 pixels that are uniformly divided into ten classes. As a preprocessing, we standardize the images, so pixel values lie in the range [0, 1], and perform centering by subtracting the mean from each pixel. The data is split into 60, 000 samples for training and 10, 000 samples for testing. Implementation and testing environment Our implementation of deep residual networks and nonlinear multilevel training framework uses the Keras (Chollet et al., 2015) and Tensorflow (Abadi et al., 2015) library.
The classification is performed using a ResNet architecture as described in (3). We employ a simple variant of residual blocks, i.e. a one layer perceptron with ReLu activation function, see (4). Each layer consists of 3 nodes for the co-centric circles example and 10 nodes for the MNIST example. The operator Q, which maps input features into network width, is learned during training. All layers are fully-connected.
Unless specified differently, the deep residual network consists of 2, 048 residual blocks. In the case of multilevel training, the multilevel-hierarchy of auxiliary networks with different resolutions is created by coarsening in time with a factor of two, see Section 2.2.1 for more details. On each level of the multilevel hierarchy, we consider the final time T = 1. As commonly used, we employ Tikhonov regularization, thus R(·) := β · 2 F , where · 2 F denotes the Frobenius norm. On the finest level (deepest network), we prescribe the regularization parameters β = 10 −4 and β = 10 −5 for co-centric circles and MNIST, respectively. On the coarser levels, the value of the regularization parameter β is scaled by a coarsening factor 2 L−l , where l denotes a given level.
All presented experiments were performed on our local cluster consisting of 42 compute nodes, each equipped with 2 Intel R E5-2650 v3 processor with a clock frequency of 2.60 GHz. The memory per node is 64 GB.
Algorithmic setup We trained both test examples using gradient-based multilevel optimizers with a constant learning rate of 0.1 in the co-centric circle example and 0.01 in case of the MNIST dataset. During all numerical tests, the weights are initialized randomly, while biases are set to zero. The co-centric circles example is trained using the full dataset, giving rise to a multilevel gradient descent. We terminate training if a validation accuracy of 1 is achieved. The training of the MNIST example is performed using a multilevel mini-batch gradient descent with a mini-batch size of 1, 000. As a termination criterion, we require the validation accuracy to be higher than 0.93.
During multilevel training, level-optimizers perform several steps while completing the downward and upward phase of the V-cycle. To describe our particular leveloptimizer setup, we introduce a list notation, such as [(1), 2, 1, 3, {4}] for a 5-level training strategy. Each entry of the list indicates a number of optimizer steps used on a given level. The list is ordered from the finest to the coarsest level. If no bracket is used, we assume µ l 1 = µ l 2 . The use of a regular bracket implies that the level-optimizer was not called during the upward phase, thus µ l 2 = 0. The curly bracket indicates the number of optimizer steps on the coarsest grid, i.e. µ 1 .

Numerical results
Convergence behavior with respect to levels We analyze the convergence behavior of the proposed multilevel training methods with respect to varying numbers of levels. The presented results also include the single-level version of the algorithms, thus vanilla gradient descent and mini-batch gradient descent. During the experiments, we employ one iteration of level-optimizer for all levels l ∈ {L/2 + 1, . . . , L}. On the coarser levels, that is l ∈ {1, . . . , L/2}, we perform 2 level-optimizer iterations. The described setup is used for both the downward and the upward phase of a V-cycle, except on the finest level, where we skip the call to the optimizer during the upward phase. Thus, for 6-level training, we use following setup [(1), 1, 1, 2, 2, {2}].  Figure 4 demonstrates the obtained results for both test problems. As we can see, adding more levels reduces the number of required V-cycles significantly. In particular, using the 2-level method already leads to a decrease in the number of iterations by a factor of 3, compared to the single level method. For the 4-level method, we obtain a reduction in the number of V-cycles by a factor of 8, while for the 8-level method, the number of V-cycles is approximately reduced by a factor of 15.
We compare the total computational cost of the multilevel training methods. Following the analysis presented in Section 3.1.2, we show the computational cost in Table 1 for both datasets. For more than four levels, the computational cost does not increase substantially anymore, as the cost of numerical operations on those levels is negligible compared to the cost of the same operations performed on the finest level. The results also demonstrate, that the total computational cost U decreases as the number of levels increases. This is not surprising, as the number of V-cycles required for convergence decreased. In particular, the 8level method is approximately 3.4 times computationally more efficient than its single level counterpart.
Convergence behavior with respect to number of residual blocks Further, we analyze the convergence behavior of the proposed multilevel training strategy for varying numbers of residual blocks. We keep the same parameters as described in the beginning of Section 4, but alter the number of residual blocks. Figure 5 illustrates the obtained results for both datasets. As we can see, the method exhibits the same asymptotic convergence behavior independently on the number of residual blocks. These results are very promising, as they suggest that the convergence rate of our multilevel training strategy does not deteriorate with network depth.
Influence of hyper-parameters In the end, we investigate the sensitivity of multilevel methods with respect to the choice of hyper-parameters. Firstly, we demonstrate how different setups of level-optimizers influence the convergence properties of the multilevel training strategy. Table 2 reports the results obtained for different setups of the 8-level method trained on the co-centric circle example. As contemplated, increasing the number of optimizer calls on the lower levels decreases the total computational cost of the multilevel method. This is due to the fact, that additional calls to coarse level optimizers lower the number of required V-cycles. But at the same time, those additional calls do not considerably increase the cost of one V-cycle. The most expensive part of the V-cycle is the optimizer call on the finest level. Therefore, it is beneficial to skip it during the upward phase. This has no substantial impact on the performance of the method, as the upward optimizer step is immediately followed by the downward optimizer step of the next V-cycle.
Secondly, we study the sensitivity with respect to the choice of learning rate and regularization parameter. Here, we consider the co-centric circle example and three different values of learning rate, α = {0.05; 0.1; 0.5}, and regularization parameter β = {10 −3 ; 10 −4 ; 10 −5 }. This yields nine different hyper-parameter setups, which we tested using single level and 8-level ([(1),1,1,1,2,2,2,{2}]) methods. On average, the computational cost of the 8-level method is 3.54 lower than the cost required by a single level method. The relative standard deviation of our results is 8.02%.

Conclusion
In this work, we proposed a nonlinear multilevel minimization framework for training the deep residual networks. Our multilevel framework is based on MG/OPT framework (Nash, 2000) and utilizes a hierarchy of auxiliary networks with different depths to speed up the training process of the original network. Using our novel training framework, we proposed multilevel gradient and mini-batch gradient methods. The performed numerical experiments demonstrated the convergence behavior of multilevel training methods and showed significant decrease in the computational cost compared to its single level variants.