Bayes in action in deep learning and dictionary learning

. This article summarizes some recent works and associated challenges in the field of Bayesian statistics that were presented during the Journées MAS 2020. The goal of the session was to give an overview of the many aspects of Bayesian statistics investigated by young researchers of the community. Résumé. Cet article résume quelques travaux récents et leurs défis dans le domaine de la Statistique Bayésienne. Ces travaux ont été présentés aux Journées MAS en 2020. Le but de la session était de donner un aperçu de tous les aspects de la Statistique Bayésienne investigués par de jeunes chercheuses et chercheurs de la communauté.


Introduction
On Thursday, 26th of August 2021 the "Bayesian statistics" session was held at the Journées MAS 2020 in Orléans.This article describes some of the works presented during this session.
The Bayesian interpretation of probabilities stipulates that probabilities must be understood as a measure of the degree of belief in the event.The main philosophy of Bayesian inference is based on this idea.In short, a Bayesian statistician starts with an initial belief on the object of interest (the prior belief) and updates this belief on the basis of the observations at hand (the posterior belief).The prior belief may be based on anterior studies, knowledge, or personal belief.This differs from the frequentist paradigm, which is based on the other main interpretation of probabilities, in which the probability of an event is viewed as the limiting value of its relative frequency after an infinite repetition of independent trials.
Formally, a Bayesian model is the joint distribution Π of a random variable (X, θ) taking values in X × Θ, where X is the observation and θ ∈ Θ the "random" parameter.That is conditional on θ it is assumed that X has law P θ .Prior information about θ is incorporated in the model via the marginal distribution ν(•) := Π(X × •), the so-called prior distribution.Having observed X = x, Bayesian inference is made on the basis of the law of θ | X = x, written here Π(θ ∈ • | X = x), known as the posterior distribution.In contrast, a frequentist model is a family {P θ : θ ∈ Θ} of distributions over X , with the usual modeling assumption that there exists θ 0 ∈ Θ unknown such that X as law P θ0 .This last modeling assumption is not required for Bayesian inference, and the most subjective Bayesians reject its validity [1].
Besides the philosophical differences, it is of interest to understand what the Bayesian framework can offer to a pragmatic statistician.Arguably one of the most appealing features of the Bayesian framework is the ability to incorporate prior knowledge in the model in a most natural manner via ν.We can also mention that the approach gives a systematic and natural way to handle uncertainty quantification through credible sets that are sets C x such that Π(θ ∈ C x | X = x) ≥ 1 − α, where 1 − α is a credibility level.In nonparametric statistics, it is also often the case that Bayes' estimators are intrinsically adaptive to the smoothness of the true parameter -assuming it exists -while frequentist adaptative procedures can be tricky to build [2,Chapter 10].Somewhat anecdotally, even the most frequentist statisticians may find some comfort in the use of Bayes' estimators, using them as a powerful tool to construct admissible (even minimax) frequentist procedures [3].
Yet, as appealing as the Bayesian framework can seem, there is still a lot of research going on to understand and address its limitations.We may establish the following taxonomy of fruitful research topics in Bayesian statistics.We believe these are the current main research areas in the community, but we do not exclude the possibility that this list may be biased by the authors' centers of interest.
(A) Design of prior distributions and complex models with meaningful interpretation.As many modern applications involve complex data and/or sampling mechanisms, it is often challenging to design interpretable and tractable Bayesian models.We note that this issue is not specific to the Bayesian paradigm.Yet, it is an important field of research in the community and deserved to be mentioned here.(B) Efficient posterior computations.The posterior distribution is rarely available in closed-form, neither numerically computable.The traditional way to overcome this difficulty is to obtain representative samples from the posterior Π(θ ∈ • | X = x).A powerful and popular technique involves building a Markov chain whose stationary distribution is the posterior [4].However, this method is intrinsically sequential and often does not scale well with a large number of data.Many efforts have been put in recent years to try and solve this issue.Among the most popular solutions are methods based on subsampling the data [5], or building coresets [6], or optimizing for the best approximant of the posterior distribution in a given family [7], or splitting the data and perform computation in parallel [8].(C) Frequentist validation of Bayes procedures.Of interest to the pragmatic Bayesian are the frequentist properties of Bayes estimates.That is, under the frequentist modeling assumption that X has law P θ0 for a fixed "true" θ 0 ∈ Θ, what is the behaviour of Π(θ ∈ • | X) (as a measure-valued random variable).Questions of interest regard its concentration on neighborhoods of θ 0 , the rate of this contraction, the limiting distribution shape, etc.Although these questions are rather well understood for parametric models [9, Chapter 10], they are much more challenging in nonparametric models [2].(D) Bayesian Machine Learning.The impressive successes of Machine Learning (ML) algorithms in recent years have attracted the attention of many statisticians.This is especially the case of learning algorithms, which consists on learning, either from labeled data (X 1 , Y 1 ), . . ., (X n , Y n ) ∈ X × Y (supervised learning setting), or from unlabeled data X 1 , . . ., X n ∈ X (unsupervised learning setting), a function f : X → Y that maps features to labels.The popularity of learning in the statistical community is certainly due to the emergence of Statistical Learning Theory [10,11], a popular framework for theoretical analysis of learning algorithms.Since the goal of learning is reminiscent to prediction, without surprise many statistical model-based approaches (including Bayesian!) have been proposed over the last decade to design learning algorithms.
The goal of the "Bayesian statistics" session was to present works by young researchers in the field, with the aim to cover as much as possible the whole spectrum of topics described above.Let us briefly summarize the content of the session, listing the talks in their order of appearance.The first talk was given by Maxime Vono and was concerned with the research topic (B).He presented a generic method to perform distributed (ie.parallel) posterior computations.The second talk was given by Hong-Phuong Dang and was concerned with topics (A), (B) and (D).She has presented the advantages of a Bayesian approach for image denoising, together with an algorithm for efficient posterior sampling.The third talk, given by Thibault Randrianarisoa, was concerned with topic (C).He presented frequentist guarantees for point-estimation and confidence sets in density estimation for a Bayesian model based on Pólya trees.The last talk, given by Mariia Vladimirova, was concerned with topics (A) and (D).She explained how we can understand and interpret the internals of somewhat complicated models such as Bayesian neural networks.This article presents in more detail some of the works that have been presented during the session, organised and chaired by Zacharie Naulet who wrote this introduction.The focus of the article is put on topic (D), in particular on two popular Bayesian supervised learning algorithms that are Bayesian neural networks and sparse linear models.Section 2, written by Mariia Vladimirova and Julyan Arbel, focuses on distributional properties of Bayesian neural networks.Section 3, written by Hong-Phuong Dang, Clément Elivra, and Cédric Herzet, describes an approximate posterior sampling scheme for dictionary learning.

Notations
We use the following notations throughout the paper.Matrices and vectors are respectively denoted by uppercase (e.g., W, D) and lowercase (e.g., g, h, d) bold letters.I L is the identity matrix of dimension L. The nth column of matrix D is written as d n and element (k, n) as d kn .⊙ denotes element-wise product between two vectors.∥ • ∥ F is the Frobenius norm of a matrix; ∥ • ∥ 0 returns the number of nonzero elements in its argument.I{•} is the indicator function which is equal to 1 when the statement between braces is true and to 0 otherwise.The symbol ∝ refers to equality up to a normalization constant.Finally, the notations N , IG, Ber, Beta, and GWT R stand respectively for the Normal, Inverse Gamma, Bernoulli, Beta, and generalized Weibull-tail distributions.We use the following notations for neural networks.The number of hidden layers, called depth, is denoted L. Each layer following the input layer consists of units which are linear combinations of previous layer units transformed by a function (oftentimes nonlinear), referred to as the activation function and denoted by ϕ : R → R. Given an input x ∈ R N (for instance an image made of N pixels) the ℓ-th hidden layer, ℓ = 1, . . ., L, consists of two vectors of size denoted by H ℓ (called the width of layer).The vector of units before application of the non-linearity is called pre-activation, and is denoted by H ℓ ), while the vector obtained after element-wise application of ϕ is called post-activation and is denoted by H ℓ ).More specifically, these vectors are defined as where W (ℓ) is a weight matrix of dimension H ℓ ×H ℓ−1 including a bias vector, with the convention that H 0 = N , the input dimension.

Bayesian neural networks distributional properties
Neural networks suffer from numerous limitations, despite many advances and high interest in the deep learning field.Their adoption in practical and safety-critical applications is still restrained [12].To overcome limitations, many researchers work on understanding the mechanisms behind deep learning models and developing new tools.For instance, the influence of different architecture and training procedure on outputs and their better description can help with the choice of a proper model for a given problem and, in general, with transparency and trustworthiness.In the same way, it is essential to have well-calibrated uncertainty for believing the prediction outputs [13,14].Uncertainty estimates can also serve as a means of transparency as they inform when the model does not know the correct prediction [15].
Bayesian inference is considered as one of the solutions to trustworthiness as it allows to provide some uncertainty for the outputs.Instead of taking into account a single answer of a model, Bayesian methods allow considering an entire distribution of answers.Bayesian neural networks achieve good performance while providing an uncertainty quantification of their outputs.The Bayesian approach does not open the black box of neural network-based models; however, it gives a new perspective to studying internal mechanisms of neural networks.For recent reviews, see [16][17][18].
We suggest studying the internal mechanisms in Bayesian neural networks.More specifically, we focus on prior distributions at the unit level.The flagship result is that function-space priors converge to a Gaussian process when the layers' width tends to infinity.We extend this result to finite-width Bayesian neural networks by providing a characterization of the marginal prior distribution of the units.We provide an accurate characterization of hidden units tails through sub-Weibull and Weibull-tail descriptions.The obtained results illustrate the heavy-tailed nature of hidden units in deep layers for different weight priors.We believe that these characterizations help to understand the internal mechanisms of neural networks and to suggest model improvements.

Sub-Weibull hidden units
The recent work by Bibi et al. [19] provides the expression of the first two moments of the output units of a one-layer neural network.Obtaining moments is a preliminary first step to characterizing a whole distribution.However, the methodology of [19] is also limited to one hidden layer neural networks.
Later work focuses on moments of hidden units and shows that any order moments are finite under mild assumptions on the activation function.More specifically, the sub-Weibull property of distributions is shown, conjecturing that hidden units are heavier-tailed with going deeper in the network [20,21].This is in contrast with their GP limit which is obtained when going wider.To describe this result, we start with the formal definition of a Sub-Weibull random variable: Definition 2.1 (Sub-Weibull random variable).A random variable X satisfying for all x > 0 and for some θ > 0 is called a sub-Weibull random variable with so-called tail parameter θ, which is denoted by X ∼ subW(θ).
Sub-Weibull distributions are characterized by tails lighter than (or equally light as) Weibull distributions; in the same way as sub-Gaussian or sub-exponential distributions correspond to distributions with tails lighter than Gaussian and exponential distributions, respectively.Sub-Weibull distributions are parameterized by a positive tail index θ and are equivalent to sub-Gaussian for θ = 1/2 and sub-exponential for θ = 1.
Given some input x, such prior distribution induces by forward propagation (1) a prior distribution on the pre-nonlinearities and post-nonlinearities, whose tail properties are the focus of this section.To this aim, the activation function ϕ is required to span at least half of the real line as follows.We introduce an extended version of the activation function assumption from Matthews et al. [22]: Definition 2.2 (Extended envelope property).An activation function ϕ : R → R is said to obey the extended envelope property if there exist c 1 , c 2 ≥ 0, d 1 , d 2 > 0 such that the following inequalities hold The interpretation of this property is that ϕ must shoot to infinity at least in one direction (R + or R − , at least linearly (first line of (3)), and also at most linearly (second line of ( 3)).Of course, compactly supported nonlinearities such as sigmoid and tanh do not satisfy the extended envelope property but the majority of other nonlinearities do, including ReLU, ELU, PReLU, and SeLU.Theorem 2.1 (Vladimirova et al. [20]).Consider a Bayesian neural network with centered Gaussian priors and with activation function ϕ satisfying the extended envelope condition of Definition 2.2.Then conditional on the input x, the marginal prior distribution induced by forward propagation (1) on any unit (pre-or post-activation) of the ℓ-th hidden layer is sub-Weibull with tail parameter θ = ℓ/2.That is for any 1 ≤ ℓ ≤ L, and for any where a subW distribution is defined in Definition 2.1, and u

Weibull-tail hidden units
Further, this result is improved by showing that hidden units are Weibull-tail distributed.Weibull-tail distributions are characterized in a different manner than sub-Weibull distributions, not based on moments but on a precise description of their tails.Denote by F X ( • ) and F X ( • ), respectively, the cumulative distribution function and survival function of some random variable X. Definition 2.3 (Generalized Weibull-tail on R).A random variable X is generalized Weibull-tail on R with tail parameter β > 0 if both its right and left tails are upper and lower bounded by some Weibull-tail functions with tail parameter β: , for x < 0 and −x large enough, where l r 1 , l r 2 , l l 1 and l l 2 are slowly-varying functions.We note X ∼ GWT R (β).This tail description reveals the difference between hidden units' distributional properties in finite-and infinite-width Bayesian neural networks, since hidden units are generalized Weibull-tail with a tail parameter depending on those of the weights: Theorem 2.2 (Vladimirova et al. [23]).Consider a Bayesian neural network as described in Equation (1) with ReLU activation function.Let ℓ-th layer weights be independent symmetric generalized Weibull-tail on R with tail parameter β (ℓ) w .Then conditional on the input x, the marginal prior distribution induced by forward propagation (1) on any pre-activation is generalized Weibull-tail on R: for any 1 ≤ ℓ ≤ L, and for any , where a GWT R distribution is defined in Definition 2.3.
Note that the most popular case of weight prior, iid Gaussian [24], corresponds to GWT R (2) weights.This leads to units of layer ℓ which are GWT R ( 2 ℓ ).

Generalized Weibull-tail vs sub-Weibull properties
Some of the commonly used techniques to study the tail behavior is to consider probability tail bounds such as sub-Gaussian, sub-exponential, or their generalization to sub-Weibull distributions [21,25].The sub-Weibull property of Definition 2.1 for a non-negative random variable ensures the existence of the moment-generating function as well as bounds on moments.In contrast, the Weibull-tail property of Definition 2.3 characterizes the survival or density functions without a hand on moments.
Theorem 2.1 shows that hidden units of Bayesian neural networks with iid Gaussian priors are sub-Weibull with tail parameter proportional to the hidden layer number, that is θ = ℓ 2 .It means that the unit distributions
of hidden layer ℓ can be upper-bounded by some Weibull distributions ae −x 2/ℓ for all ℓ.For larger tail parameter θ, Weibull distribution ae −x 1/θ is heavier-tailed but being sub-Weibull does not guarantee the heaviness of the tails.However, this upper bound is optimal in the sense that it is achieved for neural networks with one hidden unit per layer.From Theorem 2.2, for neural networks with independent Gaussian weights, hidden units of ℓ-th layer are generalized Weibull-tail with tail parameter β = 1/θ = 2/ℓ so they have upper and lower bounds of the form e −x 2/ℓ l(x) up to a constant where l is some slowly-varying function.Therefore, it proves that hidden units are heavier-tailed when going deeper for any finite number of hidden units per layer.

Related literature and discussion of the results
Regularization interpretation.It is well-known that performing a maximum a posterior (MAP) estimation in a Bayesian context is akin to employing a penalized maximum likelihood estimation (MLE), where the role of the penalty is played by the negative log-prior.According to this lens, [20] show that the sub-Weibull prior obtained in Theorem 2.1 induces a different regularization at the level of the units U than at the weights level W. As summarized in Table 1, the negative log-prior for some Gaussian prior is nothing but an L 2 penalty (also called weight-decay in the deep learning community), while the negative log-prior for sub-Weibull priors of tail parameter ℓ/2 may take the more elaborate form of L 2/ℓ penalties.
Layer Penalty on W Approximate penalty on U  Gaussian pre-activations.The study of the neural networks' distributional properties through Bayesian analysis, where the weights are assigned some prior distribution, has attracted a lot of attention in recent years.One of the main results in the field states that Bayesian deep neural network units converge in distribution to a Gaussian process when the layers' width goes to infinity.Originally stated by [27] for one-hidden-layer neural networks, this result was recently shown to carry over to deep neural networks by [22,28].In contrast, Theorem 2.2 shows that the non-asymptotic (i.e. for finite-width neural networks) prior distribution of units from the ℓ-th layer induced by some prior on the weights gets heavier when going deeper in the network.This result puts into perspective the infinite-width Gaussian process property which might be far from holding for real-world, often very deep, neural networks.To illustrate this point, we conducted the following experiment.We simulated Bayesian fully connected neural networks according to standard Gaussian weights, with varying depth ℓ ∈ {1, 3, 10} and varying width (but fixed for each architecture) H = H ℓ ∈ {1, 2, . . ., 10}.We propagated those random weights to units conditional on a fixed (once for all randomly sampled) input.For every layer, we then computed the tail index estimator θ proposed in [21] for the tail parameter θ appearing in Equation 2.1.We can see in Figure 2 that the theoretical result of Theorem 2.1 that states that θ = ℓ/2 is well in line with the estimates obtained with networks of width H = 1.When the width increases, the estimates for θ tend to decrease, narrowing the gap to the lower bound of 1/2 corresponding to a Gaussian distribution.A better understanding of this Gaussian is important since it is assumed to hold in a number of subsequent works, for instance, relative to information propagation in the neural network, as described below.
Information propagation and Edge of Chaos.An active line of research focuses on the propagation of deterministic inputs in neural networks [29][30][31].These works build upon the limiting Gaussian process property of neural networks, in order to devise efficient initialization rules for neural networks.The main idea is to explore the covariance between pre-activations for two given different data points.[29] and [30] obtain recurrence relations under the assumption of Gaussian initialization and Gaussian pre-activations.They conclude that there is a critical line, so-called Edge of Chaos, separating signal propagation into two regions.The first one is an ordered phase in which all inputs end up asymptotically correlated.The second one is a chaotic phase in which all inputs end up asymptotically independent.To propagate the information deeper in a neural network, one should choose Gaussian prior variances corresponding to the separating line Edge of Chaos.[31] show that the smoothness of the activation function also plays an important role.Since this line of works considers Gaussian priors not only on the weights but also on the pre-activations, it is closely related to a wide regime where the number of hidden units per layer tends to infinity.Given that hidden units are heavier-tailed with depth, we argue that the impact of the Gaussian pre-activations assumption on the Edge of Chaos should be better understood.This issue is further investigated in [32].
Cold posterior effect and priors.It was recently empirically found that Gaussian priors in neural networks lead to the cold posterior effect in which a tempered "cold" posterior, obtained by exponentiating the posterior to some power greater than one, performs better than an untempered one [33].The performed Bayesian inference is considered sub-optimal due to the need for cold posteriors, and the model is deemed misspecified.From that angle, [33] suggest that Gaussian priors might not be a good choice for Bayesian neural networks.In some works, data augmentation is argued to be the main reason for this effect [34,35] as the increased amount of observed data naturally leads to higher posterior contraction [34].At the same time, even considering the data augmentation for some models, the cold posterior effect is still present.[35] hypothesize that using an appropriate prior incorporating knowledge of data augmentation might provide a solution.Moreover, heavy-tailed priors have been shown to mitigate the cold posterior effect [36].According to Theorem 2.2, heavier-tailed priors lead to even heavier-tailed induced priors in function-space.Thus, the heavy-tail property of distributions in function-space might be a highly beneficial feature.[36] also proposed correlated priors for convolutional neural networks since trained weights are empirically strongly correlated, see also [37].Correlated priors improve overall performance but do not alleviate the cold posterior effect.Another research direction to understanding the cold posterior effect is through the lens of generalization bounds such as PAC-Bayesian ones [38].It is argued that discussions of the cold posterior effect should take into account that approximate Bayesian inference does not readily provide guarantees of performance on out-of-sample data, which are better described through generalization bounds.

Conclusion
Bayesian inference is one of the solutions to trustworthiness as it provides a framework to describe some uncertainty for the outputs.Instead of taking into account a single answer of a model, Bayesian methods allow considering an entire distribution of answers.Bayesian neural networks achieve good performance while providing an uncertainty quantification of their outputs.Though the Bayesian approach does not open the black box of neural network-based models, it opens a new perspective to study the internal mechanisms.We discussed some recent advances in describing Bayesian neural networks at the level of units.We hope that these results help to understand better the underlying working flow.

Problem statement
Most digital acquisition devices have traits that make measurements subject to noise.A standard signal processing task therefore consists of recovering some unknown signal s ∈ R L from its noisy observation where b represents some corrupting noise.It is well-known that reducing the noise variance is only possible if some prior information on s is known, e.g., s ∈ S target ⊂ R L [39, Ch. 11].Since such knowledge is rarely available in practice, many denoising algorithms leverage the construction of a surrogate model S model ⊇ S target from a set of collected data, say In this paper, we consider the so-called "s-sparse" model where each element of {s n } N n=1 is assumed to stem from a noisy linear combination of a few columns of some matrix D ∈ R L×K , that is During the last decade, model (6) has attracted the attention of many researchers.Its success mostly revolves around the combination of two ingredients.First, many natural signals have been shown to lie close to a sparse model, see e.g., [39, Sec.9.3] for a discussion in image processing.Second, a remarkable amount of algorithms (along with their theoretical analyses) have been proposed in the literature [40] to obtain sparse representations.
In particular, many of them have been shown to be robust to additive noise and stable with respect to sparsity defect, see e.g., [40,Th. 4.19].
From the point of view of our denoising problem, a "good" sparse model should satisfy several (often contradictory) objectives.On the one hand, robustness against noise requires S model not to "spread" too much in R L .This implies that s, K and δ should be chosen as small as possible.On the other hand, S model must obviously contain the set of target vectors S target .Hence, s, K and δ should be chosen sufficiently large to obey this requirement.Lastly, the value of s, K and δ is also constrained from a "learning" point of view: since S model has to be learned from a finite number of (noisy) samples {x n } N n=1 , the number of degrees of freedom of S model should not be too large to avoid overfitting and allow for generalization [41].
Many approaches to build sparse models satisfying these requirements have been proposed in the literature.The problem of finding a "good" matrix D from a set of data points {x n } N n=1 is often known as "dictionary learning".Most methods addressing dictionary learning leverage the seminal work by Olshausen & Field [42] and are based on the resolution of an optimization problem where some constraints on s and δ are included explicitly or implicitly, see e.g., [43][44][45][46][47].The most popular example of such an approach is probably the "K-SVD" algorithm proposed by Aharon et al. in [43] where the authors introduced a dictionary learning algorithm searching (heuristically) a solution of the following problem: min for some λ > 0.Here the trade-off between sparsity level and modeling/observation noise is implicitly chosen via the tuning parameter λ.The choice of λ and K has to be done manually using, e.g., cross-validation [41].This operation may be computationally cumbersome since problem (7) needs then to be solved for each new value of K.Moreover, the number and the range of values to test for K often remain a heuristic choice.A few works have elaborated on the K-SVD approach to propose adaptive dictionary learning methods that infer the size of the dictionary within the optimization process, see [48][49][50][51].Nevertheless, these procedures still call for important parameter tuning.For example, most of them need to know the noise or the sparsity level.
Another well-known family of dictionary learning procedures are Bayesian methods, see e.g., [52][53][54][55].Unlike "optimization-based" methods, Bayesian procedures can naturally include the evaluation of model parameters within the estimation process.For example, in [52][53][54] the parameters of a Bernoulli-Gamma-Gaussian model and the variance of the observation noise were embedded within the dictionary learning problem.In [55], Dang et al. went one step further and incorporated the size of the dictionary (K) into the learning task by using an "Indian-Buffet-Process" (IBP) prior.In these papers, a tractable implementation solving the joint "dictionary-parameters" estimation problem is usually achieved by resorting to Monte-Carlo sampling algorithms.Although these procedures are known to converge asymptotically (in the number of samples) to exact a posteriori estimates, their main drawback stands in their computational complexity since a large number of iterations are often needed to attain the target sampling distribution.
Many solutions have been proposed in the literature to overcome the high-computational cost of Monte-Carlo methods, e.g., stochastic methods, sequential Monte Carlo, particle Markov chain Monte-Carlo, stochastic variational inference or variational Bayes methods.In this paper we focus on the so-called "small-variance asymptotics" (SVA) approximation of the Gibbs sampler proposed in [56] and further extended in [57][58][59][60][61][62][63][64].We emphasize that exploiting this type of approximation in the context of dictionary learning can lead to procedures combining the advantages of "optimization-based" and "Bayesian-based" methods, namely reasonable computational cost and integrated estimation of the model parameters.The material of this paper leverages contributions [65][66][67].

Target
Distribution family Parameters definition Expressions of some a posteriori conditional probabilities related to model ( 8)-( 12).Shorthand notation "X | •" refers to random variable X conditioned to all the other variables of the problem.

Bayesian model
This section introduces the Bayesian model used for our dictionary learning problem.We assume that each element of the training set {x n } N n=1 is an independent realization of the following model corresponding (to some extent) to the sparse model ( 6) with δ = 0,1 and where and w obeys some "sparsity-enforcing" distribution.
In this paper, we consider the following option: and p(z) is a distribution on {0, 1} K which favors sparsity.We will consider two particular choices for p(z) herefafter, namely a Bernoulli distribution (Section 3.3.1)and an "Indian-Buffet Process" (Section 3.3.2).We finally assume that variances σ2 b and σ 2 c are distributed as for some positive hyper-parameters c 0 , d 0 , e 0 , f 0 .In the sequel, we will use the notations X, W, C, Z to denote matrices whose columns correspond to realizations {x n } N n=1 , {w n } N n=1 , {c n } N n=1 and {z n } N n=1 , respectively.As a final remark, we note that the variance of each d k in (9) has been fixed to L −1 to address the multiplicative factor indeterminacy in the pair (D, W). 2

Small-variance approximation of Gibbs sampling
A standard approach to compute a point estimate from some a posteriori distribution is to resort to Monte-Carlo approximation and Gibbs sampling [4].The former consists of approximating the expectation of a random quantity by the arithmetic mean of a set of realizations; the latter is a powerful tool to drawn samples from the distribution of interest.Gibbs sampling is a "Markov chain Monte Carlo" method: it generates a sequence of iterates which can be shown to asymptotically converge to realizations of the target distribution.The sequence of iterates is generated by sequentially sampling realizations of the problem's random variables (here σ 2 b , σ 2 c , elements of D, C, Z and potential hyperparameters) from their conditional a posteriori distributions.Some conditional distributions associated to σ 2 b , σ 2 c , D, C are summarized in Table 2.The conditional distribution of Z will be detailed in Sections 3.3.1 and 3.3.2for two particular choices of p(Z).We note that, as far as σ 2 b , σ 2 c , D, C are concerned, the conditional probabilities involved in the implementation of the Gibbs sampler correspond to distributions which can be efficiently sampled by many well-known methods of the literature, see e.g., [4,Chapter 2].
Unfortunately, despite this desirable feature, solving dictionary learning problems via Gibbs sampling often entails a computational complexity superior to "optimization-based" methods by orders of magnitude.In this paper we consider a particular approximation of the Gibbs sampler, known as "small-variance asymptotics" (SVA), to decrease the computational cost of this method.SVA approximation has been proposed in [56] and consists of approximating (some of) the problem's conditional distributions by their limit expression as the noise variance σ 2 b tends to zero.More specifically, SVA methods commonly build on the following two key ingredients:3 i) A Gibbs procedure designed to sample the target posterior probabilities (that is p(C, Z, D, σ 2 b , σ 2 c |X) in our image-denoising framework).The definition of this sampler involves in particular the specification of conditional probabilities on (subsets of) variables C, Z, D, σ 2 b , σ 2 c .ii) The approximation of the above conditional probabilities by single-mass discrete probabilities.The name "SVA" comes from the fact that the construction of the single-mass approximation is based on the asymptotic behavior of the target conditional probability as σ 2 b tends to zero (and for some particular choice for the scaling of hyper-parameters).
We note that, upon the approximation mentioned in ii), the sampling of the conditional probabilities appearing in the Gibbs procedure reduces to deterministic (sometimes closed-form) updates of the variables C, Z, D, σ 2 b , σ 2 c .Hence, in a nutshell, SVA methods can be seen as approximations of Gibbs samplers in which the sampling of the conditional probabilities are replaced by deterministic updates.Although there exist (to the best of our knowledge) no theoretical guarantees on the quality of the posterior estimates obtained from SVA approximations, this framework has been shown in many contributions [56][57][58]68] to drastically reduce the computational complexity of standard Gibbs samplers while achieving nice empirical estimation performance.
In rest of this section, we illustrate how the above two keys ingredients particularize to our image-denoising setups.We consider the cases where p(Z) corresponds to a Beta-Bernoulli distribution in Section 3.3.1 and an "Indian-Buffet Process" in Section 3.3.2.

Beta-Bernoulli model
In this section, we consider the case where We see that the number K of atoms is fixed in advance in model (13).Nevertheless, the probability π k of activation of each atom is left as a degree of freedom to the estimator.The learning procedure can therefore possibly "decrease" the number of atoms in the dictionary by setting some activation parameters to zero.
The conditional posterior probabilities of z kn and π k take the form: SVA approximation consists of considering the limit form of (some of) the conditional probabilities in Table 2 and ( 14)-( 15) for σ 2 b → 0 in the Gibbs sampler.To avoid degeneracy of the problem at stake, the activation probabilities are moreover parameterized as a particular function of the noise variance.More specifically, we let so that π k → 0 as σ 2 b → 0. Considering the parameter of the Bernoulli distribution in ( 14), we thus easily find that lim where Hence, under hypothesis (16) and in the small-variance limit, the realizations of the a posteriori conditional probability on z kn obey the following deterministic rule: This update is tantamount to defining the Gibbs sampling update as the mode of the conditional probabilities.A similar approach (not detailed here) can be followed for the distributions specified in Table 2.When the distribution is Gaussian, the SVA approximation thus corresponds to the limit of the mean as σ 2 b → 0, see [67].

Indian-Buffet-Process model
Let us now assume that the elements of {z n } N n=1 correspond to binary sequences (K = ∞).With a slight abuse of notation, we will stick to matrix notations and let Z denote the column-wise concatenation of sequences {z n } N n=1 and z kn the kth element of sequence z n .We consider the following model on Z (better known as "twoparameter Indian-Buffet 4 Process" (IBP) in the literature): where α > 0, ξ > 0 are two model parameters,5 K + ≜ ∥ N n=1 z n ∥ 0 (that is, the number of atoms that participate to the reconstruction of at least one observation), β is the beta function and γ(Z) ̸ = 0 is a function which basically depends on the "structure" of Z (but unimportant for our exposition here), see [70].
As K = ∞, the number of atoms considered in our model is theoretically infinite.However, only atoms corresponding to "not all-zero columns" of Z (referred to as active atoms) contribute to the reconstruction of the observations.A careful examination of model (19) indicates that the IBP promotes realizations of Z with finite numbers of active atoms (K + ).In that sense, we say that the size of the dictionary is left as a degree of freedom in the IBP model.
Let us concentrate on the SVA approximation of the following conditional probability Similarly to Section 3.3.1,parametrization of α, ξ as functions of σ 2 b must be considered to avoid degeneracy of the conditional probability when σ 2 b → 0. In particular, letting we have that (20) concentrates on its mode(s) when σ 2 b → 0.Moreover, the latter correspond to the minimizers of the following function: We note that, for a given value of K + , the minimization over the first K + components6 of c n and z n reduces to a standard ℓ 0 -penalized sparse representation problem over a dictionary of K + atoms.This problem has been extensively studied in the literature during the last decades and many heuristic methods exist to solve it (exactly or to good accuracy) in many setups, see e.g., [40].
As a final remark, let us note that (22) may admit infinitely many minimizers since for all n and k ≤ K + such that z kn = 0, all choices of the coefficient c kn lead to the same value of the objective function.To resolve this ambiguity, we proceed to an extra sampling step according to the posterior probability lim after solving (22).An SVA analysis of (23) indicates that the value of c kn remains unchanged if z kn = 1 and that c kn has to be drawn according to some normal distribution otherwise (see Table 2).

Numerical experiments
This section reports the main conclusions of several numerical experiments in image processing conducted in [65][66][67].More precisely, a standard method in image processing to assess the relevance of different dictionary learning approaches is to compare their denoising performance.As far as our simulation setup is concerned, the two models presented in this paper lead to denoising performance comparable to the state of the art.Moreover, they benefit from some properties of Bayesian techniques while inducing a computational cost similar to "optimization-based" methods.Figure 3 displays typical denoising results obtained by our proposed procedures.The rest of the section is organized as follows.We describe below our simulation setup.The performance of the dictionary learning procedures obtained from the two considered models are discussed in Sections 3.4.1 and 3.4.2,respectively.
Our simulation setup is as follows.A set of 5 (standard) images of size 512 × 512 is considered, namely "barbara", "hill", "mandrill", "lena" and "peppers".Each image is corrupted with an additive noise whose entries are i.i.d.realizations of a zero-mean Gaussian with standard deviation σ img .We consider the cases σ img = 25 and σ img = 40, respectively referred to as "low-noise scenario" and "high-noise scenario" hereafter.Each image is then decomposed into a set of 8×8 overlapping patches, resulting in N = 16129 (corresponding to 50% overlapping) training vectors {x n } N n=1 ⊂ R 64 .We then apply the Bayesian model described in Section 3.2 to all patches.More precisely, we interpret each patch x n as a deteriorated version of some noise-free patch which obeys the sparse model (6).For the two models, the estimator of the nth (denoised) patch is thus defined as D( c n ⊙ z n ) where the "hat" symbol refers to the SVA approximation of a MAP estimator.More details can be found in [65,67].The reconstructed image is eventually obtained by merging all estimated patches.Since the latter have been designed so that each pixel is involved in many patches, the value of one pixel of the reconstructed image is obtained by averaging its value over all patches where it was involved.The denoising performance is finally obtained by evaluating a peak signal-to-noise ratio (PSNR) between the original image and the estimated one.Hence, the higher the PSNR, the better the denoising performance.

Results for the Beta-Bernoulli model
In this section, we report numerical results obtained with the SVA approximation of the Gibbs sampler presented in Section 3.3.1.We refer to this method as "BBG-sva" hereafter.
As mentioned earlier, the size of the dictionary (namely K) is a parameter of the model for BBG-sva.Figure 4 shows the evolution of the PSNR obtained with BBG-sva seen as a function of K for the two considered noise levels (i.e., σ img = 25 and σ img = 40) and several images.For all considered setups, we observe that the denoising performance stabilizes for some value of K.Such a behavior has already been observed in [55] and suggests the existence of an "optimal" size that depends on both the image and the noise level.Besides, overestimating K seems to only affect the computational cost of the method.For these reasons, we focus on the value of K = 300 in the next experiment.Interestingly, this finding is also coherent with typical setups in image processing where a dictionary of size K = 256 or 512 atoms is typically learned [43,54,71].
An empirical comparison of the denoising performance of BBG-sva with state-of-the-art dictionary learning approaches has been carried out in [67, Table 1] for K = 300.We have observed that BBG-sva achieves denoising performance similar to its competitors for a running time of the same order of magnitude.This indicates that BBG-sva is able to automatically adjust the values of the hyper-parameters {λ k } K k=1 in (18) and still features the flexibility of Bayesian approaches despite the SVA approximation.

Results for the Indian-Buffet-Process model
In this section, we illustrate the performance of the SVA approximation of the Gibbs sampler presented in Section 3.3.2.We refer to this method as "IBPDL-sva" hereafter.In our simulations, we use OMP -a standard greedy procedure, see [40]-to address (22).Details about the implementation of the addition/removal of atoms (that is the update of K + ) can be found in [65].
In constrast to BBG-sva, IBPDL-sva includes the size of the dictionary as a variable of the problem.Such a feature has nevertheless been obtained at the cost of two additional hyperparameters in the model, namely λ 1 and λ 2 (see (21)).In our simulations, these two parameters are tuned by resorting to cross-validation on one image and reused for the others.
Figure 5 shows the evolution of the effective size K + of the dictionary along the iterations of the procedure for several couples of (λ 1 , λ 2 ) and in the low-noise scenario.The "peppers" image is considered in this simulation.The blue curve denotes the couple (λ 1 , λ 2 ) chosen by cross-validation while the orange ones correspond to the rejected couples.Interestingly, one observes some form of stability with respect to the choice of λ 1 and λ 2 .More specifically, all the curves tend to stabilize around a common value of K + for all choices of λ 1 and λ 2 .
An empirical comparison of the denoising performance of IBPDL-sva with state-of-the-art dictionary learning approaches is available in [65,66].Two conclusions can be drawn from these experimental results.First, the denoising performance of IBPDL-sva is similar to that of other state-of-the-art dictionary learning procedures while automatically estimating the size of the dictionary and the noise level corrupting the data.Second, a comparison of the complexity of IBPDL-sva and the Gibbs sampler proposed in [55], 7 reveals that the former enables gain in terms of computational time of (at least) one order of magnitude with respect to the latter.More precisely, we have observed that Gibbs sampler in [55] performs in average 30 iterations per hours while our proposed methods reaches 150 iterations in 30 minutes 8 .

Conclusion
Here we studied the "small-variance asymptotics" approximation of the Gibbs sampler related to two Bayesian models for dictionary learning.Our analysis leverages a carefully-designed coupling between the parameters of the model and the (estimated) noise variance.For the two models, the resulting method gathers both the flexibility of Bayesian modeling and the numerical efficiency of optimization methods.The relevance of the proposed approach was assessed on an image denoising application.The performance of the proposed approach was shown to be comparable with that of existing supervised methods, while automatically tuning the size of the dictionary and the level of the corrupting noise.

m
is either a pre-activation g (ℓ) m or a post-activation h (ℓ) m .

Figure 3 .
Figure 3. Denoising results in the high-noise scenario obtained by using the IBP model.From left to right: noisy, denoised, original images and learned dictionary.

Figure 4 .+Figure 5 .
Figure 4. Evolution of BBG-sva performances (PSNR of the reconstructed image) seen as function of the dictionary size K for several images.Left and right images correspond to the low-noise and high-noise scenarii, respectively.

Table 1 .
Left: Comparison of Bayesian neural network penalties on weights W and units U for varying layer depths.Right: Graphical representation of the unit penalties implied at the unit level for varying layer depths.