BOX-CONSTRAINED OPTIMIZATION FOR MINIMAX SUPERVISED LEARNING

In this paper, we present the optimization procedure for computing the discrete boxconstrained minimax classifier introduced in [1, 2]. Our approach processes discrete or beforehand discretized features. A box-constrained region defines some bounds for each class proportion independently. The box-constrained minimax classifier is obtained from the computation of the least favorable prior which maximizes the minimum empirical risk of error over the box-constrained region. After studying the discrete empirical Bayes risk over the probabilistic simplex, we consider a projected subgradient algorithm which computes the prior maximizing this concave multivariate piecewise affine function over a polyhedral domain. The convergence of our algorithm is established. Résumé. Nous présentons dans cet article le problème d’optimisation lié au calcul d’un classifieur minimax à contrainte de boîte, ainsi que l’algorithme permettant de calibrer ce classifieur, que nous avons introduit dans [1, 2]. Notre approche considère des variables descriptives discrètes ou préalablement discrétisées. Les contraintes de boîte définissent des bornes sur chaque proporition par classe. Le classifieur est calibré en calculant la distribution a priori qui maximise le risque d’erreur minimum sur le simplexe contraint par boîte. Après avoir montré que ce risque d’erreur minimum est une fonction concave affine par morceaux avec un nombre fini de faces sur le simplexe, nous considérons un algorithme de sous-gradient projeté pour calculer la distribution a priori qui maximise ce risque de Bayes discret sur un domaine polyhédral. La convergence de l’algorithme est démontrée.


Introduction
Supervised classification is becoming essential in several real applications such as medical diagnosis, condition monitoring, or fraud detection. However, in such applications, we often have to face the following difficulties: imbalanced class proportions, prior probability shifts, presence of both numeric and categorical features (mixed attributes), and dependencies between some features.
Context and notation. Given K ≥ 2 classes and a set S = {(Y i , X i ) , i ∈ I} of m labeled training samples, the objective in fitting a supervised classifier [3,4] is to learn a decision rule δ : X → Y := {1, . . . , K} which assigns each sample i ∈ I to a class δ(X i ) ∈ Y from its feature vector X i := [X i1 , . . . , X id ] ∈ X composed of d observed attributes, and such that δ minimizes the empirical risk of classification errorŝ where L : Y ×Ŷ → [0, +∞) is the loss function such that, for all (k, l) ∈ Y ×Ŷ withŶ = {1, . . . , K} the set of predicted labels, L(k, l) := L kl corresponds to the loss, or the cost, of predicting the class l whereas the real class is k. Let ∆ := {δ : X →Ŷ} be the set of all possible classifiers.
Dealing with imbalanced datasets. The risk of classification errors (1) can be written as (see [5]) whereπ := [π 1 , . . . ,π K ] corresponds to the class proportions of the training set, such that for all k ∈ Y, π k := 1 m i∈I 1 {Yi=k} , and whereR k (δ) corresponds to the empirical class-conditional risk associated with class k, defined byR In (3),P S (δ(X i ) = l | Y i = k) corresponds to the empirical probability for the decision rule δ to predict the class l given that the true class is k on the set S of samples. When the class proportionsπ are imbalanced (that is when the classes are not equally represented), and as a consequence of (2), most classifiers essentially focus on the dominating classes containing the largest number of training samples, and underestimate the least represented ones [6][7][8][9]. Hence, the task of well classifying the instances from the smallest classes is difficult, which leads the minority classes to have a large conditional risk (3).
Dealing with prior probability shifts. Prior probability shift [10,11] characterizes an evolution in the distribution of the priors between the training set and the test samples. We will use the notation δ π for precising that the classifier δ was fitted with the prior distribution π in the K-dimensional probabilistic simplex , i ∈ I } be a test dataset containing m test instances and for which the class proportions π = [π 1 , . . . , π K ] are unknown. The classifier δπ fitted on the training set S is then used to predict the classes Y i of the test samples i ∈ I from their associated attributes X i ∈ X . To simplify our explanation, we assume thatP S coincides withP S , i.e., there is no probability shift between the training set and the test set. Hence, our attention is focused on the prior shift, i.e., π is different fromπ. As explained in [5], the risk of classification errors, with respect to the fitted classifier δπ, as a function of π , iŝ r (π , δπ) = k∈Y π kRk (δπ) .
Since theR k (δπ)'s do not depend on π , the risk (4) is clearly a linear function with respect to π . Hence, when the distribution of the priors is uncertain and changes in time, the risk of classification errors is expected to evolve linearly. This is an important issue since we generally do not know when and why prior probability shifts may occur. Note that we haver (δπ) =r (π, δπ). The maximum value ofr (π , δπ) that can be reached is M (δπ) := max k∈YRk (δπ). This issue is therefore especially highlighted when the class-conditional risks are imbalanced. An illustration of this issue is given in Figure 2 in Appendix A. The task of learning a robust classifier with respect to uncertain prior distributions is therefore necessary. This task is in the field of Bayesian Robustness [12] for Machine Learning.
Reminder on the minimax criterion. Training with imbalanced datasets and dealing with prior probability shifts share a common trait, namely the sensitivity to imbalanced class-conditional risks. In order to address this issue, a legitimate solution is to learn a minimax classifier [5,[12][13][14]. Instead of minimizing the global risk of classification errors (1), the objective of this approach is to minimize max k∈YRk (δπ). In other words, the minimax criterion tends to balance as more as possible the class-conditional risks (3), and according to equation (4), the resulted decision rule becomes robust when π differs fromπ. As explained in [5], learning a minimax classifier δ M is equivalent to solve the following problem: Introduction of the box-constrained minimax classifier. Although the minimax classifier is adequate for addressing the issues regarding the class proportions, such a decision rule can appear sometimes too pessimistic as discussed in [12,15]. This drawback essentially occurs when prior probability shifts can append only over a subset of the simplex S and the global risk of classification errors associated with δ M becomes too high. In order to alleviate this drawback, a solution is to shrink the priors constraint. In the literature, this task is called Γ-minimax classification [12], where Γ corresponds to a set containing only the acceptable prior distributions.
In this paper, we consider Γ to be a box-constraint B defined by which allows to bound each class proportion π k independently in the interval [a k , b k ] for all k ∈ Y. The main asset of considering such a box-constraint stems from the fact that the experts of the application domain can easily and rationally build it, by providing some independent bounds [a k , b k ] on each class proportion 1 . When considering the new constraint (6) on the priors, we therefore set up the box-constrained simplex Hence, to compute the box-constrained minimax classifier δ C with respect to U, the problem (5) becomes Let us note that the minimax classifier (5) is a particular case of the box-constrained minimax classifier (8). Indeed, the minimax classifier is still accessible when considering B = [0, 1] K , so that U = S.
Dealing with both numeric and categorical features. The task of dealing with both numeric and categorical attributes is difficult for reaching optimal results. To compute a minimax classifier, we need a good estimate of the joint distribution of the input features in each class. However, in the presence of mixed attributes, and due to the curse of dimensionality (as noted in [13,16]), this estimation is quite difficult. In such a case, a relevant solution is to discretize the numeric attributes in order to model the joint distribution of features with a probability mass function. Hence, since the number of values taken by the joint distribution is finite, we can estimate their probabilities of occurrence without making any assumptions of independence between the attributes. Many works have shown that the discretization of the numeric features generally leads to accurate results [17][18][19][20][21], with favorable statistical properties. For example, the true error rate of the histogram rule which minimizes the risk of error on a discrete training set can be calculated exactly as in [22][23][24]. In the following, we consider that all the features are discrete or beforehand discretized with a finite number of values.
Contributions. In this paper, we provide a new algorithm to compute the box-constrained minimax classifier (8) in the context of discrete or beforehand discretized features. In section 2, we develop the procedure for solving the minimax optimization problem (8). This procedure is based on a projected subgradient algorithm, which computes the least favorable prior over the polyhedral constraint (7). The convergence of this algorithm is established. In section 3, we illustrate on a real public database the performance of the box-constrained minimax classifier with respect to the box-constraint bounds. Finally, section 4 concludes the paper.

Computation of the Discrete Box-Constrained Minimax Classifier
In the following, we consider that all the features are discrete or beforehand discretized. In this section, given a box-constraint B, we present the optimization procedure for solving the minimax optimization problem (8).

Reasoning to compute our discrete Box-constrained minimax classifier
Dealing only with discrete or beforehand discretized features, it follows that each each attribute X ij can take on a finite number of values t j . Hence, the feature vector Each vector x t can be interpreted as a "profile vector" which characterizes the samples. Let T = {1, . . . , T } be the set of indices.
Since |X | = T is finite, it follows that |∆| = |Ŷ| |X | = K T is finite. When the set of classifiers ∆ is finite, the famous Minimax Theorem [25] establishes that Let us define δ B π the optimal Bayes classifier associated with the given priors π ∈ S, such that δ B π := arg min δ∈∆r (π, δ).
Let V (π) =r(π, δ B π ) denote the Bayes risk for a given π: it is the minimum risk for a given π. Hence, according to (9), and provided that we can calculate δ B π and its minimum Bayes risk V (π) for any prior π ∈ U, the optimization problem (8) is equivalent to compute the least favorable priors so that the solution δ C of (8) is the Bayes classifier given by (10) with the priors (11). The least favorable priors are generally difficult to compute as underlined in [12,[26][27][28]. Subsection 2.2 is devoted to the calculation of the minimum Bayes risk V (π) over the simplex. Subsection 2.3 is devoted to compute the least favorable priors π solution of (11).

Calculation of the minimum empirical risk over the simplex
Dealing only with discrete or beforehand discretized features, we can estimate from the labeled learning instances S = {(Y i , X i ) , i ∈ I} the probabilitiesp kt of observing the feature profile x t ∈ X given that the class label is k, for all t ∈ T and for all k ∈ Y, such that In (12), for all k ∈ Y, I k = {i ∈ I : Y i = k} denotes the set of learning samples from the class k, and m k = |I k | corresponds to the number of instances in I k . Since we can only consider the instances from the training set, the probabilitiesp kt defined in (12) are assumed to be estimated once for all. Indeed, the statistical estimation theory [29] has established that the estimatesp kt correspond to the maximum likelihood estimates of the true probabilities p kt for all couples (k, t) ∈ Y × T . By estimating these probabilities with the full training set, we get the best unbiased estimate with the smallest variance. This paper assumes that these classconditional probabilities are representative of the test set, i.e., that the test samples follow the same theoretical class-conditional probabilities as the training samples.
The following theorem provides the analytic formula of the discrete Bayes classifier (10) associated with the training class proportions π, and its associated risk. Theorem 1. The empirical Bayes classifier δ B π , which minimizes the empirical risk (10) over ∆, is given by Its associated empirical risk is where, for all k ∈ Y,R with, for all l ∈Ŷ and all t ∈ T , λ lt = k∈Y L kl π kpkt .
Proof. The proof is established in Theorem 1 in [1].
In other words, the function V : π ∈ S → V (π) gives the minimum value of the empirical risk when the class proportions are π and the class-conditional probabilitiesp kt remain unchanged. The following proposition studies the function V over S. Proposition 1. The empirical Bayes risk V : π → V (π) is a concave multivariate piecewise affine function over the simplex S with a finite number of pieces. Moreover, if the following condition is satisfied, then V is non-differentiable over the simplex S.
Proof. The proof is established in Proposition 1, Proposition 2 and Corollary 1 in [1].
Note that the condition (16) is almost always satisfied. Otherwise, it would mean that each class conditional riskR k δ B π would remain equal whatever the prior π ∈ S, even at the vertices of the simplex. The empirical Bayes risk V would be an affine function over S.

Maximization of the minimum empirical risk V over U
In order to compute our box-constrained minimax classifier, according to (11) and when considering (14), our objective is to solve the following optimization problem Since V : π → V (π) is in general non-differentiable provided that the condition (16) is satisfied, it is necessary to develop an optimization algorithm adapted to both the non-differentiability of V and the domain U. To this aim, we propose to use a projected subgradient algorithm based on [30] that follows the scheme π (n+1) = P U π (n) + γ n η n g (n) .
In (18), at each iteration n ≥ 1, g (n) denotes a subgradient of V at the point π (n) , γ n denotes the subgradient step, η n = max{1, g (n) 2 }, and P U denotes the exact projection onto the box-constrained simplex U. Let us note that this algorithm remains applicable in the particular case where the condition (16) is not satisfied, i.e. when the function V is affine over U. The following lemma gives a subgradient of the target function V . Lemma 1. Given π ∈ U, the vector composed of all the class-conditional risksR δ B π := R 1 δ B π , . . . ,R K δ B π is a subgradient of V at the point π.
Proof. Let us remind that, for a concave function f : In our case, given π ∈ U, let consider π ∈ U. DenotingR δ B π the vectorR δ B π := R 1 δ B π , . . . ,R K δ B π of all class-conditional risks, we get: This inequality holds for any π ∈ U, hence the result.
In the following, we choose g (n) =R δ B π (n) at each iteration n ≥ 1 in (18). The following theorem establishes the convergence of the iterates (18) to π . Theorem 2. When considering g (n) =R δ B π (n) and any sequence of steps (γ n ) n≥1 satisfying the sequence of iterates (18) converges strongly to a solution π of (17), whatever the initialization π (1) ∈ S.
Proof. The proof is a consequence of Theorem 1 in [30]. Here we have the strong convergence since π (n) belongs to a finite dimensional space.

Remark 1.
In the general case where the empirical Bayes risk V is not constantly equal to zero over S, the subgradientR δ B π at the box-constrained minimax optimum is not null. Otherwise, the associated risk V (π ) would vanish due to (14). This would contradict the fact that π is solution of (17).
According to Remark 1, in the general case where the empirical Bayes risk V is not constantly equal to zero over S, the sequence of iterates (18) is infinite, and we need to consider a stopping criterion. To this aim, we propose to follow the reasoning in [31] which leads to the following corollary.
Proof. The proof is detailed in Appendix B.
In practice we can choose ρ 2 = K since all the proportions belong to the probabilistic simplex. Since (20) converges to 0 as N → ∞, we can choose a small tolerance ε > 0 as a stopping criterion: we fix ε and, then, we compute N = N ε such that the bound in (20) is smaller than ε.

Exact projection onto the box constrained region
When considering the sequence of iterates (18), we need to compute the exact projection onto the boxconstrained probabilistic simplex U at each iteration n. Let us remind that U = S ∩ B, where B := {π ∈ R K : ∀k = 1, . . . , K, 0 ≤ a k ≤ π k ≤ b k ≤ 1}. Let us define for all i ∈ {1, . . . , 2K + 2} where, for all k ∈ {1, . . . , K}, e k ∈ R K is the indicator vector with 1 in coordinate k, and 1 K ∈ R K is the vector fully composed of ones. We therefore can write U as In other words, our box-constrained simplex U is a polyhedral set. Thus, in order to compute the exact projection onto U, we propose to use the algorithm provided in [32] which computes the exact projection onto polyhedral sets in Hilbert spaces. Let us note that in the case where we are interested in computing the minimax classifier (5), we have U = S, and we can perform the projection onto S using the algorithm provided in [33] for which the complexity is lower.

Box-constrained minimax classifier Algorithm
The procedure for computing the box-constrained minimax classifier δ B π is summarized in Algorithm 1. In practice, we choose the sequence of steps (γ n ) n≥1 = 1/n which satisfies (19).

Numerical experiments
Database description. For illustrating the interest of our box-constrained minimax classifier, we applied our algorithm on the real public Framingham database [34]. The objective of the Framingham study is to predict the development of a Coronary Heart Disease (CHD) within 10 years based on d = 15 clinical and biological features (7 categorical and 8 numeric). In this paper, we do not study the effects of the discretization of continuous features, and we consider the discretized attributes as built in [1,2]. This database contains K = 2 classes, with class 2 corresponding to individuals who have developed a CHD, and class 1 corresponding to the others. For this database, 3,658 patients have been followed for 10 years. Among these patients, 85% did not developped a CHD, while 15% developped a CHD within 10 years. In other words, the dataset is imbalanced: π = [0.85, 0.15], which complicates the task of well predicting a CHD based on the labeled learning observations. For this experiment, let us consider the L 0-1 loss function, such that L 11 = L 22 = 0, and L 12 = L 21 = 1.
Procedure of the experiment and results. In the following, letπ := argmax π∈S V (π) be the least favorable priors over the simplex S, and thus let δ B π be the minimax classifier δ M solution of (5). The box-constrained minimax classifier δ B π solution of (8) aims to find a trade-off between achieving an acceptable global risk and balancing the class-conditional risks with respect to the box-constraint (6). In other words, the box-constrained minimax classifier δ B π is designed to find a trade-off between the discrete Bayes classifier δ B π (13) associated with the class proportions of the training setπ, and the minimax classifier δ B π . These results depend on the box-constraint bounds. In practice, the box-constraint can be established by the experts of the application field by bounding some or all the prior probabilities independently. If the results are not enough satisfying, the experts can easily tighten or spread the box-constraint bounds in order to find an acceptable trade-off between balancing the class-conditional risks and achieving an acceptable global risk of error.
Note that the minimax classifier δ B π was trained using our Algorithm 1 when considering U = S, and for this particular case the projection onto the simplex S was performed using the algorithm provided in [33].
The procedure of our experiment is the following: we performed a 10-fold cross-validation, that is we randomly split the main dataset such that 90% of the instances composed training set and the 10% staying instances belong to the test set. We repeated ten times this splitting, and at each repetition of this cross-validation, we ranged β from 0 to 1 so that we increased the box-constraint radius, and we measured V (π ) and ψ δ B π , where ψ : ∆ → R + such that, for all δ ∈ ∆, In other words, the criterion ψ aims to measure how a given classifier δ ∈ ∆ performs for balancing the class-conditional risks.
The results of the experiment are presented in Figure 1. We can observe that as β increases, and thus as the radius ρ β increases, then the better δ B π performs for balancing the class-conditional risks, and thus the better δ B π performs for well predicting the patients who tend to develop a CHD. However, as β increases, and thus as the radius ρ β increases, the more pessimistic δ B π becomes since V (π ) converges to V (π) which is the maximum value of V . Concerning the computing time, at each iteration of the cross-validation procedure, the time to train our Discrete Box-constrained Minimax Classifier δ B π was around 0.67s for each parameter β, against 0.01s for the Discrete Bayes Classifier δ B π (13). Note that the computing time associated with the Discrete Minimax Classifier δ B π was around 0.35s using the algorithm provided in [33] to project onto the simplex S, which is faster than the training time associated with δ B π .  Figure 1. Impact of the box-constraint radius on δ B π when β increases from 0 to 1 in (22), after a 10-fold cross-validation procedure. The results are presented as [mean ± std].

Conclusion
This paper presents the optimization procedure for computing a box-constrained minimax classifier in the context of discrete or discretized features with multiple classes, a positive loss function, and some dependencies between the features. This minimax classifier aims to address the issues of imbalanced datasets and prior probability shifts. Our method is in the field of Γ-minimaxity and Bayesian Robustness for Machine Learning. Our approach is designed for considering independent bounds on the class proportions, which can be easily and rationally provided by the experts from the application domain, and which allow us to find a trade-off between minimizing the maximum of the class conditional risks, and achieving an acceptable global risk of errors, based on the interest or the knowledge of the experts.
The computation of the box-constrained minimax classifier results from the computation of the least favorable prior which maximizes the minimum empirical risk of classification errors over the box-constrained probabilistic simplex, using a projected subgradient algorithm. The convergence of our algorithm is established.
An important work would be to improve the computation time of the exact projection onto the boxconstrained simplex, which would be essential for dealing with databases containing a large number of classes.
A. Illustration of prior probability shifts issue Figure 2. For this experiment we generated a training dataset (Up-Left) containing m = 5, 000 instances described by d = 2 features and clustered into K = 2 classes which satisfies the class proportionsπ = [0.90, 0.10]. We then trained the Logistic Regression δ LR π on this training set and we applied it on 5 different test sets containing m = 1, 000 instances. Each dataset was generated using the make_blobs function provided by Scikit-Learn [35] from the same features distributions in each class, but the test sets differ according to the class proportions π ranging over the simplex S. The last subfigure describes the global risk associated with each dataset. Since we have K = 2 classes, the global risk (4) associated with δ LR π can be written asr π , δ LR π = π 1 [R 1 δ LR It follows that, Let us remind that for all i ∈ {1, . . . , n}, η i = max 1, g (i) 2 . We can therefore distinguish two cases : where for all k ∈ YR Finally, coming back to equation (25), and since π (1) − π 2 2 ≤ K, it follows that