ROBUST CLASSIFICATION WITH FEATURE SELECTION USING AN APPLICATION OF THE DOUGLAS-RACHFORD SPLITTING ALGORITHM

This paper deals with supervised classification and feature selection with application in the context of high dimensional features. A classical approach leads to an optimization problem minimizing the within sum of squares in the clusters (`2 norm) with an `1 penalty in order to promote sparsity. It has been known for decades that `1 norm is more robust than `2 norm to outliers. In this paper, we deal with this issue using a new proximal splitting method for the minimization of a criterion using `1 norm both for the constraint and the loss function. Since the `1 criterion is only convex and not gradient Lipschitz, we advocate the use of a Douglas-Rachford minimization solution. We take advantage of the particular form of the cost and, using a change of variable, we provide a new efficient tailored primal Douglas-Rachford splitting algorithm which is very effective on high dimensional dataset. We also provide an efficient classifier in the projected space based on medoid modeling. Experiments on two biological datasets and a computer vision dataset show that our method significantly improves the results compared to those obtained using a quadratic loss function. Résumé. Ce document traite de la classification supervisée et de la sélection des descripteurs avec application dans le contexte des descripteurs de haute dimension. Une approche classique conduit à un problème d’optimisation minimisant la norme quadratique (norme `2) avec une pénalite de norme `1 afin de promouvoir la parcimonie. Il est bien connu que la norme `1 est plus robuste que la norme `2 aux valeurs aberrantes. Dans cet article, nous abordons cette question avec une nouvelle méthode proximale pour la minimisation d’un critère utilisant la norme `1 à la fois pour la contrainte et la fonction de perte. Comme le critère `1 n’est que convexe et non gradient de Lipschitz, nous proposons l’utilisation d’une solution de minimisation de type Douglas-Rachford. Nous tirons parti de la forme du critère et, en utilisant un changement de variable, nous fournissons un nouvel algorithme de type DouglasRachford dans le primal qui est très efficace dans le contexte des descripteurs de haute dimension. Nous fournissons également un classifieur efficace dans l’espace projeté basé sur la modélisation de medoïds. Des expériences sur des données biologiques et de vision par ordinateur montrent que notre méthode améliore les résultats par rapport à ceux obtenus avec une fonction de perte quadratique.


Introduction
In this paper we consider methods in which feature selection is embedded in a classification process [23,25]. It is well-known that classification in a high dimension suffers from the curse of dimensionality: As dimensions increase, vectors become indiscernible and the predictive power of the aforementioned methods is drastically reduced [1,36]. In order to overcome this issue, the main idea of the following methods is to project data onto a low dimensional space. A popular approach for high-dimensional data is to perform Principal Component Analysis (PCA) prior to classification. However in general this approach is not relevant [44]. The Partial least squares (PLS) method closely related to principal component regression is designed to deal with this issue with high dimensional correlated features [37,45]. An alternative approach is to perform dimension reduction by means of Linear Discriminant Analysis (LDA) [14,17]. The authors of [2] and [21] propose a convex relaxation of this approach in terms of a suitable semi-definite program (SDP) at the cost of an increased computational complexity. Nie et al proposed a feature selection based on 2,1 norm minimization [34]. A popular approach for selecting sparse features in supervised classification or regression is the Least Absolute Shrinkage and Selection Operator (LASSO) formulation [22,28,33,42,46]. The LASSO formulation uses the 1 norm [8,15,18,19] as an added penalty term instead of an 0 term. In this paper, we propose to minimize an 1 norm both on the penalty term and the loss function. In this case, the criterion is convex but not gradient Lipschitz. Thus, we propose to use the Douglas-Rachford splitting method for the minimization of our criterion. This splitting was successfully used in signal processing [4,9,10,11,20,39,41]. However, for classification, we cannot apply straightforwardly Douglas-Rachford like in [7] since the proximal operator for the affine transform involved in the criterion is not available (See equation 1). We take advantage of the particular form of the criterion: The sum of two 1 norms is equal to a single 1 norm after a change of variables, and the linear constraint can be integrated into the cost. We also provide an efficient classifier in the projected space based on medoid modelling. The paper is organized as follows. Section 2 deals with criterion and state of the art methods. In section 3 we develop the proposed solution for supervised classification. In section 4 experimental results are provided on real biological datasets and finally in section 5 we conclude the paper.

A robust framework
Let X be the nonzero m × d matrix made of m line samples x 1 , . . . , x m belonging to the d-dimensional space of features. Let Y ∈ {0, 1} m×k be the label matrix where k 2 is the number of clusters. Each line of Y has exactly one nonzero element equal to one, y ij = 1 indicating that the sample x i belongs to the j-th cluster, see [2,21] . Let W ∈ R d×k be the projection matrix, k d. The classical quadratic loss criterion in the projected space is: where . F stands for the Frobenius norm. Projecting the data in lower dimension is crucial for efficient feature selection [36]. Besides, it is well known that the quadratic Frobenius loss criterion is not robust to outliers [27,29], so we propose to minimize instead the 1 loss cost, with an 1 penalty regularization to promote sparsity and induce feature selection. So, given the matrix of labels Y , we consider the following convex supervised classification problem: min Here, the 1 norm of an m by n matrix A denotes the 1 norm of its vectorization: |a ij |.

State of the art methods based on ADMM
Problem (2) falls into the following class: under the affine constraint y = Ax, where x and y are vectors of finite dimensional spaces. Then state of the art methods based on ADMM (Alternating-direction method of multipliers, see [6,12,24,35]) considers the augmented Lagrangian (γ > 0 is a scalar parameter), The Lagrangian is minimized over x, then over y, and a third step of the method updates the Lagrange multiplier z using a proximal maximization step. Note that ADMM algorithm can be derived from an application of the Douglas-Rachford algorithm to the dual. In our case, we take advantage of the fact that the two functions f and g are simple convex functions ( 1 norms) the sum of which is again a known convex function (yet another 1 norm). This eliminates one function and allows us to penalize the affine constraint to integrate it into the cost. The resulting proximal operator is a simple projection operator on a linear subspace, which can be computed (see, e.g., Lemma 7). As explained in the next paragraph, it is thus possible to use a Douglas-Rachford algorithm to the primal.

An equivalent formulation
The cost in Problem (2) is the sum of two 1 norms, which suggests the use of a splitting algorithm [12]. Such methods are very efficient to minimize the sum of convex functions and do not require differentiability properties. Note indeed that, having replaced the squared Frobenius norm by the 1 norm of Y − XW , it is not possible to use forward-backward splitting [12] as none of the functions is differentiable. In order to use the more general Douglas-Rachford scheme (see next subsection) that is able to cope with mere convex functions, one still has to be able to compute their proximity operators. Now, while the prox of the 1 norm is well known and expressed in terms of soft thresholding, there is no explicit expression for the prox of the 1 norm of the affine transform Y − XW . We propose to introduce the auxiliary variable ζ := (Y − XW )/λ ∈ R m×k and to minimize min W W 1 + ζ 1 with respect to W under the affine constraint The sum of the two 1 norms is equal to the single 1 norm of the augmented variableW := (W, ζ) ∈ R (d+m)×k . Let C be the affine subset of (d + m) × k matrices such thatXW = Y wherẽ where I m is the identity matrix. The problem can be recast as min under the affine constraintXW = Y Let moreover i C be the indicator function of the set C, vanishing on C and equal to +∞ outside C. Then problem (2) is equivalent to min This new problem is readily equivalent to the previous one as W 1 = W 1 + ζ 1 . (Note that the cost has been re-scaled by a factor λ.) Problem (3) involves two convex functions the proximity operators of which can be efficiently computed (see, e.g., Lemma 7), which allows a Douglas-Rachford scheme to be used .
We recall [12,31] that the proximity operator "prox"x at pointx of a lower semi-continuous convex function x → f (x), with parameter τ , is the unique minimizer of: In our case, F is the 1 norm, so prox τ F is given by soft thresholding (parameterized by τ ). Indeed, one can separate the variables and use the prox of the scaled absolute value dimension one: Since G is an indicator function, whatever τ the associated prox simply is the projection operator on the affine subspace C of (d+m)×k matrices. For the sake of completeness, we recall the associated expression of this projection.
Lemma Let A be an m × n matrix of rank m < n, and let b be a vector in R m . The orthogonal projection of z ∈ R n on the affine subspace {x ∈ R n | Ax = b} is This Lemma can readily be applied in R (d+m)×k to obtain the prox of G, independently of τ , and (4-5) can be translated into Algorithm 1.

Feature selection and scalability.
In applications, particularly in biological ones, the issue of feature selection may be even more important than classification itself. This selection is achieved by means of the 1 sparsity inducing penalty in Problem 2 cost. A feature i ∈ {1, . . . , m} is selected if the corresponding line in the matrix of weights W is not zero ( W (i, :) = 0). We precompute X T (XX T ) −1 only once. Note that the matrix (XX T ) is a small (m × m) matrix, so we can use fast adapted algorithm to compute the inverse. The complexity of the resulting iterations is O(d × k × d) for the proximal part, plus O(d × k) for the projection.

Classification using medoid
In order to build a robust classifier, we compute a medoid for each cluster. In contrast to the k-means algorithm, k-medoids chooses actual data points as centers. For the j-th cluster, a medoid µ j is any member of the class minimizing the average dissimilarity inside the class in the projected space. In general, the k-medoids problem is NP-hard to solve exactly. Heuristic solutions such as PAM ( Partitioning Around Medoids) or Voronoi iteration methods have been proposed [38]. Let us define µ ∈ R m×k , the medoid matrix of the clusters. Computing medoids minimization can be reformulated as minimizing the following 1 norm: Then, a new query x (a dimension d row vector) is classified according to the following rule: It belongs to the (supposedly unique) classj such thatj = min j=1···k xW − µ j 1 .
The cost of medoid computation is expected to be O(m × k) in average [3].

Alternating Minimization
Let now define the following global problem We propose an alternating (or Gauss-Seidel) scheme to solve problem (10). Given the medoid matrix µ, the first subproblem in W is solved by using our primal Douglas-Rachford algorithm. Conversely, given the matrix of weights W, the second subproblem in µ is solved by medoid computation on the projected data as explained in the previous section. Algorithm 2 summarizes the resulting alternating minimization. µ ← medoids(Y, XW ) 10: end for 11: Output: W, µ Proposition As each step of the alternating minimization scheme decreases the norm Y − XW 1 , which is nonnegative, the following readily holds. The norm Y − XW 1 converges as the number of iterates L in Algorithm 2 goes to infinity. Note that the criterion 8 is convex in (µ,W), so a primal dual method with convergence guarantees was proposed recently [5,9].

Application to real datasets
Experimental settings. We compare the labels obtained from our classification with the true labels to compute the MCE (misclassification error, i.e. the number of misclassified observations divided by the number of observations) on the test set. We compare our ADRS method with four state of the art methods: the filter method using the Ttest (we rank the features according to their p-values using a PLS classifier (partial least squares) [45]), PLS with a sequential selection of features [25], and the Frobenius norm criterion using the classical forward-backward algorithm [12]. The different algorithms are evaluated and compared using three public datasets (Ovarian, FaceDrive and Cancer RNA-Seq) which are available at https://archive.ics.uci.edu/ml/datasets. In our experiments, we set γ = 1 and N = 40. Results are reported after a 4-fold cross-validation.
Dataset: Ovarian [26]. . The data available on UCI data base were obtained from two sources: The National Cancer Institute (NCI) and the Eastern Virginia Medical School (EVMS). The samples include patients with cancer (ovarian or prostate cancer). Healthy or control patients form a set of 216 samples with 15, 000 features.     The only reasonable contender would be the popular ADMM algorithm first proposed by and Gabay & Mercier in 1976 [24] and extensively developed in [6,12,35].

Conclusion
In this paper, we propose to minimize the 1 norm related to the data term with an 1 regularization. Our contribution is twofold. First, we provide a new primal Douglas-Rachford splitting algorithm adapted to a high dimensional dataset. Second, we propose a new efficient classification algorithm using medoid and alternating minimization. Experiments on biological datasets and a computer vision dataset show that our method significantly improves the results of methods using a quadratic loss function. The authors thank internship Yuxiang Zhou for processing simulations and Professor Jean-Baptiste Caillau for fruitful discussion.

Appendix: Minimize Frobenius norm using a gradient-projection splitting method
The criterion is given by: 1 2 Y − XW 2 F + λ W 1 → min To solve this problem, we use a gradient-projection method. It belongs to the class of splitting methods ( [12,13,30,32,40]). We use the following forward-backward scheme to generate a sequence of iterates. For any fixed step γ ∈ (0, 2/σ 2 max (X)), the forward-backward scheme applied to the above criterion converges towards a solution.