Some recent developments in Markov Chain Monte Carlo for cointegrated time series

We consider multivariate time series that exhibit reduced rank cointegration, which means a lower dimensional linear projection of the process becomes stationary. We will review recent suitable Markov Chain Monte Carlo approaches for Bayesian inference such as the Gibbs sampler of [41] and the Geodesic Hamiltonian Monte Carlo method of [3]. Then we will propose extensions that can allow the ideas in both methods to be applied for cointegrated time series with non-Gaussian noise. We illustrate the efficiency and accuracy of these extensions using appropriate numerical experiments.


Introduction
The study of multivariate times series displaying the feature of reduced rank cointegration is an important topic within a spectrum of fields related to econometrics and statistics; [27], [25], [36] and [24]. The concept of cointegration was generally developed in the works of [27] and [10]. Since these early developments, there has been a wide investigation of cointegration in econometrics and finance, see [11] and more recently [16]. Fundamentally, cointegration is a property of multivariate time series whereby a lower dimensional linear transformation of a non-stationary process becomes stationary. The resulting projected time series are often referred to as "spread series" or the "cointegration portfolio".
However, in practice both the basis of projection that yields a stationary process and its dimension are unknown quantities that need to be estimated. Before presenting any models or estimation techniques a basic notion of cointegration can be described by considering a collection of time series, y i,t where i = 1, . . . , n. If for some i, y i,t are not stationary but there exists real valued coefficients, denoted by b, such that the linear combination is stationary, then the series are said to be cointegrated. The linear combination vector b = (b 1 , b 2 , . . . , b n ) T shown in (1) is called a cointegrating relationship; [12]. A cointegrating relationship may be seen as a long-term equilibrium phenomenon which allows the cointegrating variables, y i,t , to deviate from their relationships in the short term, but retains their long term associations. This property is of particular relevance in financial applications as it allows to construct a mean reverting portfolio by taking positions proportional to the cointegration relationships b; for a reference on algorithmic trading applications see [70], [52] and [54]. The main challenge in data analysis for such types of processes lies in the detection and estimation of the cointegration relationships. It is clear that the stationary property is invariant under linear combinations of cointegration relations, i.e. if b and c are valid relations, then so is a 1 b + a 2 c for any a 1 , a 2 ∈ R. Therefore linearly independent relations form a basis for a subspace which is referred to as a cointegration space.
The dimension of the cointegration space will be referred to as a cointegration rank, r, and is actually the number of linearly independent vectors b ∈ R n whose inner product with observable series y t yields stationary process z t . In this paper, we are particularly interested in the estimation of the cointegration space under various modelling assumptions of the underlying processes y t,i . When casting the inference problem for this space, certain likelihood based identification problems can arise ( [39,Section 3]). We will elaborate on this issue later in Sections 2.2 and 2.3.

Contributions and Organization
This paper details the challenges in the estimation of parameters determining cointegration in a time series model. We will present some recent developments on simulation based Bayesian inference for such models and therefore the paper serves as a combination of both a survey on particular aspects of Bayesian estimation and Markov chain Monte Carlo (MCMC) methods for cointegration modelling as well as presenting several novel developments in this setting.
In terms of the Monte Carlo sampler we consider, we note that we have confined our presentation mainly to two MCMC simulation approaches: the Gibbs sampler of [41] and the GMC method of [3]. These methods originate from different research communities in econometrics and statistics, but we believe it is useful for practitioners to combine these different ideas and demonstrate how they can be incorporated for the estimation of Bayesian cointegration models. In addition to this review material, we also propose extensions by combining ideas from both methods. To the best of our knowledge this is the first application of GMC to cointegrated time series. We believe there is great potential in extending GMC for cointegration, in terms of being able to address non-linear, non-Gaussian models and using different specifications for the priors. In the numerical examples we will make comparisons related to the efficiency and accuracy of the basic methodology and proposed extensions.
The contributions developed in this manuscript can be summarised as follows: • we bridge the gap in the MCMC literature between GMC and Bayesian cointegration models. We provide an explanation of how to define a MCMC sampler for cointegration parameters on a Stiefel manifold and in particular to define the class of Hamiltonian Monte Carlo (HMC) and Geodesic Monte Carlo (GMC) samplers in this context.; • we develop a cointegration model extension that incorporates into the Vector Error Correction Model (VECM) the ability to work with more flexible multivariate student t errors and to develop a Bayesian model in such a context. Sampling from the resulting generalised VECM Bayesian model is then achieved in Gaussian error and the Student-t error cases via efficient samplers. In the case of the Student-t errors they exploit the scale mixture structure of a random vector with Student-t law.; • we provide a novel algorithm which we demonstrate equally efficient as current state-of-the art samplers and furthermore can be applied in more generic scenarios for the assumed VECM driving noise random vectors distribution.; • we consider Singular Value Decomposition (SVD) representations and priors on parameters. These are useful from a financial applications perspective to determine jointly the mean reversion and the projection basis parameters. In addition it enables us to apply random scan Gibbs sampling, which can be used to save computation effort by skipping the computation of certain geodesics for the Stiefel manifold.
Details of these contributions will be made explicit throughout the manuscript.
The organization of this paper is as follows: Section 2 introduces the error correction model (ECM) representation that is widely used in cointegration analysis and formulates the Bayesian inference problem. Then in Section 3 we present MCMC simulation techniques. In Section 4, we discuss the issue of cointegration space point estimation and in Section 5 we present various simulation studies. In Section 6 we conclude with a brief discussion.

Notation and Background Ancillary Material
This preliminary section serves the purpose of both introducing notations that we will adopt throughout the manuscript and also providing for a general statistical and econometric audience with a brief set of basic principles and background for key quantities upon which the results of this paper are based. Though mildly technical, it is useful as we focus on a particular sub-set of results that pertain directly to the relevant background in forming our model developments. We believe it will be useful to present this basic background as in the statistics literature there is still a need to link some of the concepts starting to be adopted to these general well known results that arise from topology, algebraic topology and measure theory for practitioners. With this objective in mind, we only really present a core set of key results required in this regard.
We denote a Gaussian random (n×T ) matrix by Y ∼ N n,T (µ, Σ, Ψ) with row dependence in (n×n) covariance matrix Σ and column dependence in (T × T ) matrix Ψ which shall be understood as the covariance between the respective rows or columns of Y . N (µ, Σ) will be a multivariate normal with mean vector µ ∈ R n×1 and covariance matrix Σ ∈ R n×n ; I r is an identity matrix of dimension r × r.
We say that a time series y t is integrated of order 1 and denote it as I(1) if ∆y t is weakly stationary process, i.e. integrated of order 0, I(0). Here we assume the weak notion of stationarity, i.e E(∆y t ) = g, Cov(∆y t , ∆y t−h ) = m(h), where ∆y t = y t − y t−1 and y t is an I(1) time series. For a time series sequence y 1 , y 2 , ..., y T , we will also use the concise notation y 1:T .
V ec(A) denotes the matrix vectorization operator which transforms a matrix A into a column vector in which columns of A are successively stacked. Furthermore, we denote the Kronecker product or tensor product between two matrices by ⊗ and Kronecker sum as ⊕. The space spanned by the columns of any n × r matrix A is denoted as col(A); if A is of full column rank r < n, then A ⊥ denotes n × (n − r) matrix of full column rank satisfying A T ⊥ A = 0. For any square matrix, A, ||A|| F is a Frobenius norm, ||A|| 2 = tr{A T A}, and ρ(A) its spectral radius (that is, the maximal absolute value of the eigenvalues of A). The cardinality of a set B is denoted by |B| and ∇[·] denotes a matrix/vector of partial derivatives of an appropriately defined function.
The next two sections provide core background mathematical details assumed throughout the paper, as such the following two more technical sections in the manuscript may be skipped by practioners as they are not required for applications of our method.
We include these two sections in order to make sure all concepts are clearly and accurately defined and presented from a mathematical perspective. It also serves as a more technical reference to some concepts that are applied in later sections of the manuscript but are often not explained or detailed carefully in other MCMC literature on this topic. In general, in this manuscript we will refer to M as a differentiable manifold of dimension n. We may then define a Riemannian metric G which for every point on the manifold q ∈ M defines the scalar product of tangent vectors in the tangent space, denoted by T q M, smoothly depending on the point q. This means that in every co-ordinate system (x 1 , . . . , x n ) a metric G = g ik dx i dx j is defined by a matrix valued smooth function g ik (x) for i, k ∈ {1, . . . , n} such that for any two vectors A, B the i-th component is given by, which are tangent to the manifold M at the point q with co-ordinates x = (x 1 , . . . , x n ) i.e. for A, B ∈ T q M the scalar product is equal to where the metric will satisfy that Hence, we can always consider a Riemannian metric on M as a family of (positive definite) inner products generically denoted by such that, for all differentiable vector fields X,Y on M one has, which will define a smooth function between M and R. Then we can be sure that when endowed with this metric G, we may consider the differentiable manifold (M, G) as a Riemannian manifold. Furthermore, the ideas we present in this manuscript will in general be restricted to a sub-space topology in R n×d which will correspond to a choice of M given by the compact Stiefel manifold that we shall denote more specifically in this case by the notation V n,r := {V ∈ R n×r : V T V = I r } n ≥ r, and when n = r one obtains the orthogonal group.
We will make more explicit reference to the form of G in the case of a Stiefel manifold (for cointegration model settings) in latter sections. We note that it is well known that the Stiefel manifold will become a Riemannian manifold by introducing an inner product in its tangent spaces. In this regard, we have two natural choices that can be considered for the inner products for tangent spaces of Stiefel manifolds: the Euclidean inner product and the canonical inner product. The choice adopted can affect the computational efficiency of the resulting GMC sampler we design and so we explicitly explain this detail in a latter section. Next we briefly provide a remark on defining curves. Remark 1. For practitioners the aforementioned material, though important for formal setup of our problem may appear a little abstracted. To aid in gaining some basic intuition for this in order to later set up of the Haussdorf measure structure, we first note the following basic properties for measuring distances on a curve embedded on a Riemannian manifold. To provide such basic intuition for aspects of these quantities defined above it will suffice to introduce a simple illustrative example to calculate relevant quantities explicitly. Consider a curve γ : x i = x i (t) for i ∈ {1, . . . , n} with a ≤ t ≤ b on a Riemannian manifold (M, G). This may be a path one wish to follow when proposing a state dynamic in a MCMC sampler for instance. Now, at every point of the curve the velocity vetor (tangent vector) is given by v(t) = dx 1 dt , . . . , dx n dt . Then the length of the velocity vector V ∈ T x M at location x on the manifold M is given by Then the length of a curve is defined by the integral of the length of velocity vector given by Intuitively, these concepts can then be extended to define the unit of space measure change from Lebesgue measure to Hausdorff measure presented later in (12) when working with particular Riemannian manifolds, such as the case we will develop for Stiefel manifolds.
Now we can consider the formal definition of a geodesic curve on a Riemannian manifold. A geodesic is a curve γ : [t 1 , t 2 ] → M which is intuitively considered as a trajectory of motion for a point particle without external force, travelling with constant speed. Locally, the geodesic gives the shortest path that will connect two points and it is therefore naturally related to the Riemannian metric and generally can be described by the Euler-Lagrange equations as the solution to the variational problem given by δ 1 2 L γ = 0 where L γ is given by (7) such that a = t 1 and b = t 2 and the solution geodesic curve γ satisfies appropriate initial conditions, see details in [51].
Finally, we complete this section with some fundamentals that allow us to define accurately a distribution on general metric spaces that also applies in our Stiefel manifold setting. This is important, as in later sections of the paper we develop concepts of Markov chain Monte Carlo for target distributions with support on a manifold, with particular application to Stiefel topological structures that arise from the class of cointegration models we study.

Distributions on Manifolds: the Hausdorff Measure on a Metric Space and its Representative Form for
Riemannian Manifolds.
Consider any metric space denoted generically in this stand alone ancillary sub-section by (X, G). Then for any subset of this space S ⊂ X one may define the diameter according to the metric G as follows: diam S := sup{G(x, y)|x, y ∈ S}, diam ∅ := 0.
If we then consider any subset S ⊂ X and real constants δ > 0 and d ≥ 0, we may construct all countable covers of S by sets satisfying U i ⊂ X and diamU i < δ that can then be used to define the metric outer measure as the infimum over all countable covers given by One can then define the limiting metric outer measure according to This particular definition of the Hausdorff measures is just a special case of a more general construction due to Caratheodory. In this context, if we consider measurable sets in a Caratheodory sense, such that they satisfy Caratheodory's criterion for the Lebesgue outer measure on R n denoted by λ, see further details in [13]. That is, we consider sets E ⊆ R n satisfying that for Lebesgue outer measure, the sets E will be Lebesgue measurable if and only if for every A ⊆ R n , where A is not required to be a measurable set itself. Then it will make sense to consider the restriction of this metric outer measure to the σ-field of Caratheodory-measurable sets and the result will be a well defined measure which is generically referred to as the d-dimensional Hausdorff measure of S. Consequently, as a result, the properties of the metric outer measure in this context will indeed ensure that all Borel subsets of X are H d measurable. It is important to realise that in this set-up the definition of covering sets is somewhat arbitrary. With this in mind one could of course describe different general forms of Haussdorf measures by considering different restrictions on the class of admissible coverings, see details in [47] and [58]. For instance one can use coverings by balls in such case the resulting outer measure is often referred to as the spherical Hausdorff measure or by cylinders and one obtains the cylindrical Hausdorff measure. In this manuscript we will consider the case preferred in [3] corresponding to open sets of X.
Although brief, this basic background suffices for our applications in this manuscript and we may now consider the crux of the formulation relating to developing a probability measure on a manifold with a well defined metric. In particular we may now refer to a family of probability measures that can be defined on the Riemannian manifold via what is known as the Haussdorff outer measure (henceforth measure) see [26]. For the remainder of the manuscript, unless otherwise specified all densities are w.r.t the Lebesgue measure L(·) on the appropriate space and H (·) denotes Hausdorff measure on an appropriate manifold, as detailed above. Now, if we consider a Riemannian manifold, the Hausdorff measure is related to the Lebesque measure according to the expression: which is a Lebesque measure scaled by the volume element G(M ), where G is a Riemannian metric on the manifold.

Bayesian Inference for Cointegrated Time Series
A range of different parametrizations and model constructions have previously been proposed in the literature on cointegration time series modelling, these include: the Triangular form (see [56]); the Common Trends representations (see [59]); and the class of Vector Error Correction Models (VECM) models.
In this manuscript we will focus on the class of VECM model structures. Throughout all this section, we assume that the cointegration rank r is fixed and known. In fact, it is a crucial parameter and needs to be estimated. A brief review on the rank estimation techniques is presented subsequently in Section 4.3.

Vector Error Correction Models for Cointegration
In this section we present one of the most widely utilised characterization of multivariate time series models for cointegration settings. This is based on cointegration time series models that are based on Vector Autoregressive (VAR) structures. Such models have been widely studied in the Econometrics literature, see [10] and [65]. A popular representation is the Error Correction Model (ECM), see [20], [63] and the overview of [39]. Due to the substantial popularity of this particular form of cointegration model specification, the remainder of manuscript will focus on this framework in a Bayesian model estimation context. We will first present the structural form of the ECM form of cointegration.
An ECM is written based on a VAR process of order p, as follows: where y t ∈ R n denote observed returns, Π ∈ R n×n is the long-run multiplier matrix, Γ i ∈ R n×n the i-th lag matrix and D t ∈ R n is an exogenous covariate for the observation y t . The appeal of the ECM formulation is that it combines flexibility in dynamic specification with desirable long-run properties; see [7] for a discussion.
The cointegration properties of the ECM depend on the rank r of the long-run multiplier matrix Π. If r = 0, then the ECM does not exhibit any cointegration relationship and it can be estimated as a stationary process in first differences. If r = n, that is, if the matrix Π is of full rank, then the V AR model itself is stationary and can therefore be estimated via standard stationary process techniques in multivariate settings. If, however, the rank r is intermediate, 0 < r < n, then the ECM process exhibits cointegration.
In the context of cointegration, we can write the matrix Π as the product Π = αβ T where both α,β ∈ R n×r are full rank matrices. The r columns of the matrix β are the cointegrating vectors of the process. In addition, β T y t reflects common trends while α contains their loading factors. For simplicity, we will consider a simplified model, where y t is marginally integrated of order 1, I(1), with r linear cointegration relationships and r is assumed to be known. The observation equation is given by: where t = 1, ..., T and t form an i.i.d. noise sequence. The most commonly used case for the noise distribution is t ∼ N (0, R), however in this paper we will also consider multivariate In what follows more types of noises can be treated similarly, as long as the corresponding densities can be evaluated point-wise up to a proportionality constant and the log densities are differentiable.
Remark 2. It is important to note that the decomposition of the long-run multiplier matrix Π into αβ T (and hence the cointegration relations) are not unique. In fact, for every non-singular matrix Q ∈ R r×r , we can define α = αQ T and β = βQ −1 and get Π = α β T .
If cointegration exists, the ECM representation will generate better forecasts than the corresponding representation in first-differenced form, particularly over medium and long-run horizons. Indeed, under the cointegration property, z t in (1) will maintain a finite forecast error variance, whereas other linear combinations of the forecasts of the individual series in y t could have increasing variance, see [11] for examples.
In the remainder of the paper, we focus on the ECM characterization in the simple form of (13). This model will be sufficient to demonstrate estimation properties related to the cointegration rank and basis. For simplicity we do not include autoregressive lags, however we note that analogously to [71], they can be incorporated in the simulation methodology.

Matrix and Vectorized Representations
Different matrix or vector representations for the time series model in (13) will be used throughout. These expressions will prove particularly useful (later in Section 3) for performing likelihood evaluations, and manipulating posterior densities and their gradients. One possibility is to write (13) compactly as: with Y = [∆y 1 , ∆y 2 , ..., ∆y T ], Z = [y 0 , y 1 , ..., y T −1 ], E = [ 1 , ..., T ]. In addition, one can also use vectorization operations and treat Y and V ec(Y ) as equivalent random variables and choose the form that is most convenient for implementation. (14) can be expressed as: . Note that this does not hold for the Student-t case as diagonal covariance structure (i.e. no correlation) does not imply cross-sectional independence. Henceforth, we need to resort to the computation of likelihood as a product of independent terms. We will return to this point in Section 2.4.2.

Cointegration Spread Series
We may now define the so called "Spread Series" as z t = β T y t . The dynamics of this process can be trivially derived from the standard ECM representation: Therefore, the spread process under the ECM representation (13) is a r−dimensional V AR(1) process. Note that the necessary spectral radius condition for the stability of the ECM can be written as |ρ(I + β T α)| < 1.

Identification Considerations
In the Bayesian modelling context of cointegration models there have been a number of papers developing frameworks to deal with the lack of identification in the likelihood and studying its influence in the posterior with different classes of priors on α and β. We refer the reader to [39] for a thorough review.
For a fixed record of observations, it is clear that we can find a pair of parameters (α, β) = (α , β ), such that p(y 1 , . . . , y t |α, β) = p(y 1 , . . . , y t |α , β ). This is often referred to as non-identifiability in the literature of point estimation (see [44, pages 24, 57]). This is a general issue that appears often in point estimation or Maximum Likelihood methods and is not unique to cointegration models. Since cointegrating vectors are not unique (see Remark 2), identifying restrictions must be imposed to allow their estimation. This can be achieved either by imposing normalizations on particular coefficients or by using an eigenvalue-eigenvector method of identification first developed by [1] for the reduced rank regression model and then used by [33,34] for cointegrating ECM.
One standard approach to globally overcome the identification issue illustrated in Remark 2 is to impose constraints in the form of linear normalizations as mentioned earlier: for instance via a non-unique identification constraint of r 2 restrictions as follows This method has been successfully applied in works such as [66], or more recently [52] and [54]. The implementation of linear restrictions is based on the prior knowledge of which r rows of β will be linearly independent. More generally, one can partition β = (β T 1 , β T 2 ) T , where β 1 ∈ R r×r and impose the normalization by choosing a matrix Q such that Qβ is invertible; then use β(Qβ) −1 instead of β.
A challenge with this approach is that inappropriately specified Q may lead to Qβ being singular ([39, Section 3]). Furthermore, it might also restrict the number of important models to be considered in the cointegration analysis. A second issue that may occur with such approaches is that at the regime where α is close to 0, β does not enter the model ( [38]). This results in a local non-identification and consequently an improper posterior under a diffuse prior for β in the Bayesian setting.
Choosing priors for cointegration is widely studied topic; see [39, Section 4] for a review. In this paper we we will follow a popular approach to dealing with identification issues by imposing priors directly on the cointegration space, which assumes a distribution on a Grassman manifold. This is discussed in [57] and [62], both of which are related to earlier work of [45].

Background on Bayesian Approaches on Manifolds in Cointegration Contexts
In [60], a Bayesian inference procedure is presented which allows for unconditional inference on the structural features of VAR process. The novelty of this paper was the development of a probability measure on a Grassman manifold to elicit uniform priors on subspaces defined by particular structural features of VARs. [63] developed similar uninformative and informative priors for the cointegrating space, which allows one to develop sampling schemes such as MCMC or approximations schemes such as Laplace approximations to perform numerical integration with respect to such cointegration Bayesian models when undertaking estimation.
Furthermore, these authors provide careful elicitation of the prior distribution on the model coefficients from a prior on the cointegrating space. They also provide an outline of the identification restrictions which will then naturally arise or be implied by their specification of model structure. In [71], a Bayesian reference prior is presented with the property of distributing its probability mass uniformly over all cointegration spaces for a given rank and which is invariant to the choice of normalizing variables for the cointegration vectors. Several methods for computing the posterior distribution of the model parameters conditional on the cointegration rank, r, are proposed, whereby all inferences are determined from approximate samples of the posterior.
Following these works, there has been a shift of paradigm from direct prior specifications on the parameter space to the priors on the space spanned by the cointegrating vectors ( [42,43,71]) or [61] for some recent extensions. This approach has benefited from efficient posterior simulation methods, such as the Gibbs sampler [41], which will be presented in more detail in Section 3.1.
Given the recent interest in this re-specification of the class of Bayesian cointegration models, we will focus the remainder of the paper on these new classes of model specification. That is, we will follow this approach of modelling priors on the cointegrating space under the ECM. The cointegration space (or equivalently the span of β, col(β)) is a r-dimensional hyperplane in a n-dimensional space. The identification restriction β T β = I r can be used, where r is the cointegration rank. This restricts β to belong to the following Stiefel manifold: which is compact, so the uniform distribution is a proper prior.
Remark 3. In the linear normalization case with β = [I r , β * ], if β * follows matrix variate t−distribution, then col(β) has uniform distribution on the Grassman manifold. A drawback of this prior choice is that the second and higher moments of the posterior distribution do not exist for r > 1; see [39, Section 5.1] for details.

Prior Model Consideration
In this manuscript we are particular interested in the cointegration parameters (α, β), but we note that of-course the ECM contains additional parameters, θ, such as the parameters related to t . We will use a Gibbs sampling approach alternating simulation of p(α, β|y 1:T , θ) and p(θ|y 1:T , α, β).
When direct simulation is possible this will lead to a standard Gibbs sampler, but when this is not possible one could use instead a small number of iterations from MCMC sampler kernels invariant to p(α, β|y 1:T , θ) and p(θ|y 1:T , α, β) respectively. We will present below independent priors between (α, β) and θ that will be used later for comparisons between different MCMC algorithms.

Priors for α, β
We will choose the prior distribution for β ∈ V n,r to be the matrix angular central Gaussian distribution defined over the Stiefel manifold with respect to the Hausdorff measure: where , with H ∈ V n,r acts as a hyper parameter on the cointegration space. In this class of priors, P τ determines the central location of the distribution on col(β), which in this case is col(H) and τ the amount of the dispersion around the central location. If τ = 1, then P τ = I n and (16) defines uniform prior on the manifold. In turn, the value of hyperparameter τ = 0, expresses prior assigning the cointegration space to be col(H). Note that one can introduce a further hierarchical structure using hyperparameters τ and hyper-priors with support restricted to [0, 1].
For α we will use the shrinkage prior used in [71]: where G is a symmetric positive definite matrix that is often chosen to be the noise covariance matrix R. The shrinkage parameter ν can be fixed or random by defining hierarchical priors. Following, the Bayes rule, the joint posterior density of α, β is given by dp(α, β|y 1: with the density being π(α, β) ∝ p(y 1:T |α, β, θ)p(β)p(α|β). Sampling from p(α, β|y 1:T , θ) is not possible directly, so in Section 3 we present MCMC methods appropriate for this task.

Priors and Conditional Posterior Distributions for the Parameters of t
As presented in the ECM framework, the time series is driven by a i.i.d. white-noise sequence of random vectors. In this paper we consider a two classes of driving error stochastic noises: multivariate Gaussian and multivariate Student-t.
In the Gaussian case, θ = R and standard conjugacy results motivate using an Inverse Wishart distribution for the prior of the covariance matrix R, see details in [52] and [54]. For simplicity in this paper we will use R ∼ W −1 (n + 2, I) for the prior. As a result for the full conditional we have In the Student-t case θ = {R, ω}, since we will express the driving error random vector according to a well known scale mixture form (see examples in [73] and [2]), which lends itself naturally to a sampling scheme based on a standard data augmentation approach.
In this case, one can show that integrating out λ will give t ∼ t ω (0, R).
We can consider then instead the parameterization θ = {R, ω, λ 1:T } and set priors for R, ω and then derive a full conditionals for each R, ω, λ 1:T that should be used in sequence when sampling. Inference for ω whilst possible is challenging and needs careful choice for priors, see [14,15,18] for details. Typically, improper priors are used for ω, but for simplicity we will consider here the case where it is fixed and known.
The methodology presented below can be extended using the priors in [14,15,18], but due to the full conditional being intractable it needs to be replaced with an appropriate MCMC procedure. For R we will use the same prior as in the Gaussian case, so for the full conditionals we get p(R|y 1:T , α, β, λ 1: and p(λ t |y 1:T , α, β, R, ω) = IG( ω + n 2 , ω + y t − I + αβ T y t−1 R −1 y t − I + αβ T y t−1 2 ).

MCMC Approaches to Bayesian Cointegration
In this section we restrict our attention to parameters α, β and present different MCMC approaches for simulating from p(α, β|y 1:T , θ) or other equivalent conditional posterior distributions under different parameterizations. Also we will often drop for simplicity the conditioning on θ in the notation.
Below we present the Gibbs sampler proposed in [41], which can be considered the "state of the art" method for this problem in the Gaussian case. Then we extend the Geodesic Monte Carlo (GMC) sampler of [3] to the ECM Bayesian cointegration model framework. This allows us to significantly generalise the class of models for which we may develop efficient MCMC samplers for that extend well beyond the sampler restrictions to the Gaussian case of [41], whilst maintaining the sampler performance of this state of the art method.
We emphasize that these are not the only options available. The problem of sampling from matrices with rank restrictions, has long attracted interest due to its relevance in principal components analysis and other settings, see discussion in [67]. Recent developments that are relevant to sampling from matrices belonging to a Stiefel manifold are [6,29], where appropriate transformations are combined with column-wise updates in a Gibbs framework. These methods can perform well (see also [3,Section 5]), but in the interest of brevity we will not present them here nor include them in our numerical examples.
Other reasons behind this omission are that the efficiency of the Gibbs sampler in [41] makes the use of columnwise updates in β less desirable in practice and GMC is more generic so has potential to be applied in wider variety of settings.

Gibbs Sampling for Bayesian ECM Posterior Distributions
From a Gibbs sampling perspective the conjugacy of Gaussian distributions is attractive, but it is hard to derive conditionals for β and α (the semi-orthogonality restriction implies that the conditional posterior of β is non-standard).
In [41] the authors exploit instead the polar decomposition ([5, p. 19]), α = Aκ where it is useful to notice that B is unrestricted. The following proposition establishes equivalent priors for (A, B). Lemma 1. Given the hierarchical prior on (α, β) as in (16)-(17). Then the prior for A and B is given by: where The proof can be found in technical appendix of [41].
In Algorithm 1 we present the Gibbs sampler developed in [41] for the Gaussian case when t ∼ N (0, R). To simulate a MCMC transition leaving p(α, κ, B|y 1:T , R) invariant, they alternate sampling between distributions p α and p B defined below in (22)-(23) and update the value for κ deterministically.
In [41] the authors justify the method as a partially collapsed Gibbs sampler, which could explain the resulting efficiency. Given the Lemma 1 and the choice of model Gaussian conjugate priors (16) The respective means and variances of multivariate normals are: and All the technical derivations and based on the simple multivariate Gaussian conjugate properties and can be found in [41]. The algorithm describing this sampler is provided below.
Algorithm 1 One iteration of the Gibbs sampler of [41] for cointegration ECM in (13).

Geodesic and Hamiltonian Monte Carlo
In this section we will begin by explaining how to utilise the formulation of Hamiltonian equations of motion to develop a proposal kernel for a Markov chain sampler that will produce a sampler known in the literature as Hamiltonian Monte Carlo (HMC). We will first explain the development of this sampler in the familiar case of Euclidean space, then we will consider the generalization of this framework to a manifold in a Riemannian metric space, for which we may define and evaluate geodesics to define movements on the manifold and quantify the distance over such geodesics via a well defined Riemannian metric that will allows us to work with distributions on such spaces via the specification of the Hausdorff measure.

Basics of Hamiltonian Monte Carlo (HMC) on Euclidean Space
Hamiltonian dynamics were originally introduced in molecular simulation and later used within a MCMC framework in [8] leading to the so-called Hybrid or Hamiltonian Monte Carlo, which was popularized in Statistics in [48] and [49]. Here we present a very short summary of the method. Following the standard HMC convention, let q be the variable we wish to infer (in our case it is (α, β)). HMC is a a MCMC method to sample from a posterior distribution with density π(q) by introducing an auxiliary variable, p called momentum variable, and then targeting the joint distribution: where P(q) is a positive definite matrix. The log of this joint posterior is interpreted as a the Hamiltonian function, that is a sum of a potential and kinetic energy function given by − log π(q) and 1 2 p T P(q) −1 p respectively. The advantage of this interpretation is that one can design proposals within a MCMC algorithm using artificial Hamiltonian dynamics with respect to a fictitious time τ : When this system of motion equations can be solved exactly (27) the resulting solution will leave H invariant. Furthermore, the solution will possess interesting properties such as volume preservation and time reversibility. These two particular properties make this dynamic framework particularly relevant for MCMC sampling methods as these properties will guarantee the resulting constructed Markov chain will satisfy detailed balance when used within MCMC, see [49] for details. The problem is that typically one cannot solve the system of equations for the Hamiltonian dynamics in (27) exactly. Consequently, one typically resorts to the use a numerical integration method to find a solution.
Consquently, H will no longer be invariant. However, in the context of MCMC sampler proposal design, this can be overcome through the use of a Metropolis accept-reject correction step.
Let (q , p ) be the numerical solution of equation (27) after some chosen time and starting from (q, p), then this numerical solution will be accepted as the new proposed movement for the Markov chain state via a Metropolis-Hastings Accept-Reject probability with acceptance probability min(1, exp(−H(q , p ) + H(q, p)), otherwise the proposed numerical solution movement according to the Hamiltonian motion is rejected and the Markov chain remains at state (q, p).
It is important to note that in order for this MCMC scheme to preserve detailed balance the numerical integration method needs to be time reversible and volume preserving (or symplectic). Different choices of integration method may be considered to achieve this objective such as: the Newmark-beta method [50], Stormers method [22], Verlet's method [69], the Velocity Verlet method or the closely related leapfrog integration framework, see discussions in surveys such as [64].
The most popular of these in MCMC applications involves the class of leapfrog integration methods. A leapfrog integration method is a numerical approach to integrating differential equations of the formẍ = d 2 x dt 2 = f (x) such as the system of Hamiltonian dynamics described in (27). The approach of leapfrog integration takes its name from the fact that the method interleaves or alternates between two solution steps: an updating of positions x(t) and then an updating of velocitiesẋ(t) at interleaved time points. These points of update are set-up in such a manner that the solution systematically "leapfrogs" over each previous solution in pursuit of the next solution point.
Note, in particular, when P is a constant matrix (i.e. not a function of q) it is common in MCMC literature to utilise the leapfrog method and then optimize performance by tuning the choice of P, the number of steps and step size of the leapfrog integration. If however, we consider the more general context in which P is nolonger a constant and in particular we consider P(q) to vary with q we may construct a more general framework for MCMC sampler design in this class of methods, which becomes in some sense locally adaptive. That is, taking into account local behaviour of π brings more flexibility in terms of tuning and is advantageous from a performance point of view; see [19] for a review.
In this case numerical integration requires using splitting techniques ( [23]) that treat the potential and kinetic parts of H as separate Hamiltonians. Let H 1 = − log π(q) and H 2 = 1 2 p T P(q) −1 p. The Hamiltonian equations for H 1 areq = 0,ṗ = −∇ q H 1 (29) and can be solved exactly: Then if there is a symplectic integrator or numerical solver foṙ then one could use the two integrators together in a time-symmetric manner to generate MCMC proposals. Starting from (q τ , p τ ) a single iteration of the composed integrator could be performed as follows: • Compute q τ + 2 = q τ , p τ + 2 = p τ + 2 ∇ q log π(q)| q=qτ , • Solve (30) for a time interval equal to starting with initial condition being (q τ + 2 , p τ + 2 ) to get This could be iterated L times and then one would need to apply a Metropolis accept-reject step to (q τ +2L , p τ +2L ) compared to (q τ , p τ ) as before with a similar acceptance ratio min(1, exp(−H(q τ +2L , p τ +2L , q τ +2L , p τ +2L ) + H(q τ , p τ )).
Remark 5. The presentation so far does not make any reference to q being a point on a manifold except when defining π w.r.t the Hausdorff measure. [19] provide a detailed review for this case when P(q) −1 is a Riemannian metric tensor and discuss on how the Hamiltonian equations in (30) should be solved in detail. In this case, H 2 defines a metric and one can use co-geodesic flows (q τ , p τ ) withq = P(q) −1 p that follow a trajectory that leaves H 2 constant.
Remark 6. If q = (α, B) the variable of interest is not defined on a manifold, a few iterations of the above HMC procedure can be applied within Algorithm 1, each time for π being p α (·|β, y 1:T ) and p B (·|A, y 1:T ). This can be useful when direct Gibbs Sampling is not possible, e.g. when t is not Gaussian.

Geodesic Monte Carlo: Extending HMC to Connected Riemannian Manifolds
In this section we work with the fact that a connected Riemannian manifold carries the structure of a metric space whose distance function is the arc length of a minimizing geodesic. For brevity we note that the arc length is defined for a simple illustration in the preliminaries section in Remark 7 and the geodesic path generally in Section 1.2.1. In particular, we will now concentrate on a class of Monte Carlo sampler, aptly named for the fact that it exploits the aforementioned property of traversing geodesics, and is known as GMC (see [3]). This is nothing more than a Monte Carlo sampler in which the path follows a geodesic flow, such as those defined in earlier optimization contexts in [51] and the references contained therein.
In the context of Monte Carlo methods the GMC framework extends earlier works by authors such as [51] and [19] and HMC for simulation from a distribution defined on a manifold. The idea of GMC is to develop a sampler for distributions defined on a manifold, that will traverse the mass of the distribution via paths through the support of the distribution on the manifold defined via shortest distance paths known as geodesics. In this way, the Markov chain should traverse efficiently by following geodesic paths on the surface of the manifold around the target mass of the distribution on the manifold. The intended purpose of such a construction is to efficiently explore the support of the target distribution defined on the manifold via a Markov chain constructed to move along geodesics on the manifold.
More concretely, let M be a manifold q ∈ M and suppose π is the density of interest w.r.t the to the Hausdorff measure. One can define the Hamiltonian as before H(q, p) = − log π(q) + 1 2 p T P(q) −1 p, and then one needs to choose P and design Hamiltonian flows across M through its tangent space, as defined in the preliminary notations in Section 1.2. To achieve this [3] propose working with special cases that can be formed by embeddings of M in a Euclidean space.
In this case, they assume that one can obtain a bijective map ξ : M → R n that maps every open set of M to an open set in R n and let x = ξ(q). Note these covering sets need not be open sets, but it suffices to restrict to this case.
The embedding proves very useful in characterizing the tangent space at a point q, namely T q and choosing a convenient P, see the example and construct of such a tangent as explained with an illustration in Section 1.
and one can write Hamiltonian dynamics equations for H 1 and H 2 as before. Noting that ∇ x = M T q ∇ q , the solution flow in the embedded phase space corresponding to H 1 is given by As a result the evolution of the velocity term v t can be described only using q t ∈ M. This is an important observation as we have effectively re-parameterised again the problem and work with (q, v = P q p). The solution of the Hamiltonian equations corresponding to H 2 is given by the geodesic flow on the manifold leaving H 2 constant when starting at q 0 with the initial velocity of v 0 . In some cases, this can be written explicitly and more details on this be found in [3].
When this geodesic flow is available, GMC uses a proposal (within MCMC) that is constructed starting from a given q 0 by simulating a new value for v 0 ∼ N (0, P q ) (due to p ∼ N (0, P(q))) and then iterating the following steps: • Compute q 2 = q 0 , v /2 = v 0 + 2 P q0 ∇ q log π(q)| q=q0 . • Solve (q τ , v τ ) according to an appropriate geodesic flow for H 2 for an interval starting with initial condition being (q 2 , v 2 ) leading to (q 3 2 , v 3 2 ).
A Metropolis accept-reject step should then follow as usual.

GMC for Stiefel Manifolds
In order to implement GMC we need to be able to compute the following key quantities: (1) the projection P; (2) the log densities log π together with its gradients; and (3) the geodesic flows related to H 2 . Recall in our previous specification we assumed the existance of a bijective map ξ : M → R n that maps every open set of M to an open set in R n and let x = ξ(q).

Remark 7.
A key observation in this section is that as far as P is concerned often one is able to compute it without requiring knowledge of ξ. Whilst embedding theorems guarantee existence of ξ for smooth or Riemannian manifolds, it is in general unknown and non-trivial to find such mappings. Fortunately in many cases, such as Stiefel manifolds, P can be described explicitly as I − U U T where U is an orthonormal basis of the normal to the tangent space, so knowledge of ξ is not required.
To understand this remark more concretely, we must first recall our previous remark that the Stiefel manifold becomes a Riemannian manifold by introducing an inner product in its tangent spaces. In the context in which we work in this section, it is convenient to work with the choice of metric constructed from the "cannonical inner product", rather than the Euclidean inner product when considering the Riemannian structure of the Stiefel Manifold. Then one may obtain the results discussed in [3].
Precisely, the result can be developed by explicitly utilising the selection of the canonical inner product on the tangent space to define the Riemannian embedding of the Stiefel manifold. In this case, the canonical inner product will "weigh" the coordinates equally, where as explained in [32], the concept of the canonical inner product is to attempt to find a matrix A of some tangent vector say Z and weigh it by 1/2 in the inner product, something akin to what is achieved when the Euclidean inner product is adopted. This is performed in general for a Stiefel manifold by considering first representing the tangent vector by where the matrix A = X T Z and one may write XA = XX T Z. Consequently this produces (I − 1 2 XX T )Z = Z − 1 2 XX T Z = XA + X ⊥ B − 1 2 XA = 1 2 XA + X ⊥ B and one obtains after some substitutions of the above identities the form given by which allows one to then formally define the cannonical inner product according to the expression and the resulting canonical metric Z, Z c which is sometimes referred to as the Killing metric in the case of the orthogonal group (when n = r for the Stiefel manifold). Clearly, this metric product will satisfy the conditions required in the preliminary material in Section 1.2 for properties of metrics required for the set-up. Recall our goal is to be able to define flows on the Stiefel manifold. To achieve this we also need to consider the representation for differentials on this space. For instance, consider X∈ V n,r for a Stiefel manifold and consider a function F from R n×r to R and matrices X, Z ∈ R n×r . Now, denote the differential of F by DF X which is the derivative of F in the Z direction at X and is given by where we denote by matrix D = ∂F ∂Xi,j ∈ R n×r . Hence, given a point X on the Stiefel manifold X∈ V n,r the resulting differential DF X is a linear functional on the tangent space T X ∈ V n,r . In this case, when working with the cannonical inner product, one can consider a vector AX and the resulting action of DF X on the tangent space T X ∈ V n,r is represented by choice A = (DX T − XD T ), see a proof of this result in [32]. In general one can then refer to the vector AX = (DX T − XD T )X by the more classical vector calculus type notation given by ∇ c F , where this choice of notation simply provides an analogy to suggest that it is the gradient of F under the canonical metric. Note, in this construction the matrix A is a skew symmetric n × n matrix.
From this result, we may now return back to the application in our context to construct the GMC framework for the sampler of our Bayesian cointegration model. Hence, we can now be precise when we refer to the Stiefel manifold case. If we let X∈ V n,r be a matrix satisfying the required conditions, i.e X T X = I. Then the above results tell as the following regarding the projection, for an arbitrary matrix W ∈ R n×r , the projection onto V n,r at X which can be represented by see additional discussion in [3].
In addition, given this structure we can then also obtain in this case an explicit formulae for the geodesic flows on V n,r which are given by the following expression where D = X(τ ) T v X (τ ) is constant over the geodesic and S(0) = v X (0) T v X (0); see [9] and [51] for explicit details. So for a given (differentiable w.r.t X) density π(X) one can implement GMC as described in Algorithm 2.
Algorithm 2 One iteration of the GMC of [3] for sampling from a π(X) with X∈ V n,r .
(1) Sample v X ∼ N (0, I n·r ) and apply projection at for τ = 1, ..., T do: (a) Compute v X ← v X + 2 ∇ X log π and apply projection v X ← P X (v X ).

GMC for Cointegration Parameters
To avoid identifiability issues we have restricted β to lie on a Stiefel manifold and follow β T β = I r . We have already presented how to implement GMC for β but have not looked at the densities and their gradients. The other variable of interest, α ∈ R n×r , is unrestricted, which can be interpreted as an element of Euclidean space or flat manifold. In this case, there is no need to perform any projections ξ = I and the geodesic flows are just straight lines: see Section 4.4 of [3] for more details on extending GMC for target distributions products of manifolds. Given expressions for the densities π(α, β) are available it is possible to proceed with GMC implementations targeting the cointegration parameters. We will look at two particular cases below.

GMC Targetting Jointly α, β
When one is interested in sampling directly from p(α, β|y 1:T , R) in (18), then the variable of interest (α, β) lies on the Cartesian product of a Euclidean space and a Stiefel manifold, so (α, β) ∈ R n×r × V n,r , which is itself an embedded manifold. Geodesics (and tangent vectors) on R n×r × V n,r are simply the Cartesian product of the geodesics (and tangent vectors resp.) on R n×r and V n,r and similarly for the orthogonal projections P. A detailed description of the GMC algorithm for p(α, β|y 1:T , R) is presented in supplementary material found in an earlier version of this paper, [46], together with derivations for the gradients for Gaussian t .

Using GMC within a Gibbs Sampler Approach
To pursue greater efficiency than targetting jointly (α, β), we propose using GMC or HMC samplers within a Gibbs approach and perform an HMC sampler for p(α|β, y 1:T ) and GMC for p(β|α, y 1:T ). This is a fairly straightforward extension to what has been presented so far, which in Section 5 shows very good performance.
One could aim to extend using samplers within a Gibbs approach for different parameterizations of Π. The motivation comes from trying to extend the efficiency found in [41] for the cases where conjugate sampling from full conditionals is not possible. The latter can be potentially replaced with HMC or GMC samplers invariant to them. The key in [41] was the particular parameterization used and a partial collapsing step for κ. One can explore various parameterizations, but we emphasize that when a Metropolis sampler replaces a full conditional in a partially collapsed Gibbs sampler, care must be taken, otherwise the algorithm might not be valid anymore; see [68] for details. For example, when plain HMC updates are used in Algorithm 1 instead of direct sampling from p(α|β, y 1:T ) and p(B|A, y 1:T ) then the algorithm is not valid and in numerical results not shown here we confirmed that such an approach does not converge to the right stationary distribution.
We will consider using a singular value decomposition (SVD) for Π ( [5, p. 20]). Let Π = USV T , then β enters the model indirectly through V. A nice feature of SVD is that the singular values can be viewed as mean reversion parameters and can be used to influence the rank when choosing priors. Such priors have been proposed in [37,38], so when combined with priors on the spaces of matrices U, V, then GMC samplers can be particularly useful. In addition, to ensure that this singular value parameterization is unique ([5, p. 20]), we will restrict U to have all its first row elements positive (and denote the corresponding space asṼ n,n ). The parameterization from the SVD should follow: U ∈Ṽ n,n , S = diag(s 1 , . . . , s r , 0, . . . , 0) with s 1 > s 2 > . . . > s r > 0, and V := [β, β ⊥ ] ∈ V n,n .
A MCMC within Gibbs method for this parameterisation will alternate samplers targetting p(β|S, V, y 1:T ), p(U|S, β, y 1:T ) and p(S|U, β, V, y 1:T ). The first two conditionals can be sampled using GMC and for the latter any choice of MCMC sampler on p(s 1 , . . . , s r |U, S, β, y 1:T ) can be used. A single iteration of the algorithm is presented in Algorithm 3, which outputsΠ =Ũdiag(s 1 , . . . ,s r )β T . In step 2 (a) we present a sign-flipping method that ensures that the same unique SVD parameterization is preserved and U has only positive elements in the first row. For the particular implementation in Algorithm 3 this step is optional as the conditional U|S, β, y 1:T does not require a unique parameterization: U is unique once S, β are fixed. We include this step as it can be useful in various extensions, such as a random scan Gibbs sampler. Similarly, the unique SVD decomposition requires s 1 > s 2 > . . . > s r > 0, which in Algorithm 3 is again optional, but if required can be imposed by either the MCMC proposals or by reordering of columns of U, S, V.
Algorithm 3 One iteration of the GMC within Gibbs sampler for cointegration ECM.
(1) Iterate Algorithm 2 for π(β) ∝ p(β|U, V, y 1:T ) to getβ (2) Iterate Algorithm 2 for π(U) ∝ p(U|Ṽ, S,β, y 1:T ) to getŨ. Remark 8. SVD can be employed also only for α instead of Π. Using then α = USV T β T requires using an extra (redundant) parameter, but this can be useful for comparing with other methods or parameterizations, when the priors are specified for α, β and one can derive appropriate prior distributions for U, S, V. We will perform such comparisons later in Section 5.
Remark 9. Another approach could consider the polar decomposition for α, whereby Π = Aκ 1/2 β T . Here κ takes values in the space of positive semidefinite matrices, and A, β ∈ V n,r . A GMC within Gibbs approach could alternate a small number of iterations between GMC samplers targetting p(β|A, κ, y 1:T ), p(κ|β, A, y 1:T ) and p(A|β, κ, y 1:T ). The GMC sampler for p(κ|β, A, y 1:T ) can be designed using a projection P based on eigendecompositions or optimization (see [28,Section 4.2.3]) and geodesics found in [31] together with a similar GMC approach. In the interest of brevity we will not pursue this further.

Practical Estimation of Cointegration Space
In this section, we describe some additional details on estimation methods for the basis of cointegration space, such as how to assess similarity of estimated cointegration spaces and rank estimation. Whilst our main motivation comes from using MCMC in a Bayesian framework, the tools presented here can be used when comparing point estimates from different methods.

Measure of Cointegration Similarity
Suppose we have two estimates for the cointegration matrix, β 1 and β 2 . As we will require a distance for comparison diagnostics, we will use the approach of [43]. Suppose we can decompose β 2 as follows: where β 1⊥ is the orthogonal complement of β 1 , γ 1 ∈ R r×r and γ 2 ∈ R (n−r)×r . These matrices can be explicitly written as γ 1 = β T 1 β 2 and γ 2 = β T 1⊥ β 2 . A distance measure between β 1 and β 2 is Note that a distance between β 1 and β 2 is equivalent to measuring a dissimilarity of col(β 1 ) and col(β 2 ). Also we have, d s (β 1 , β 2 ) ≤ min (r, (n − r)) , which is useful for interpretation.

Bayesian Point Estimation
Obtaining point estimates of the cointegration space needs attention when the posterior distribution is defined on a Stiefel manifold. Following [72], we use the Frobenius norm as the loss function, where β and β * are semi-orthogonal. To provide a Bayesian point estimate of the posterior distribution for col(β), we use the Posterior Mean Cointegration Space Estimator (PMCS) proposed in in [72]. The PMCS estimator is defined asβ def = arg miñ β∈Vn,r E[ l(β,β) y 1:T ].
In [72], it was showed that the PMCS estimator can be computed aŝ where v i is the eigenvector of E[ ββ T y 1:T ] corresponding to the i-th largest eigenvalue. Given the Stiefel manifold is a compact space, all finite moments of the elements of β exist in the orthonormal normalization, which implies existence of E[ ββ T y 1:T ]. A closed form analytic expression for E[ ββ T y 1:T ] is not available, so we resort to MCMC to estimate it. After N iterations of a MCMC procedure one can use 1

Measure of Posterior Variation
As a tool to assess the variation of posterior cointegration space distribution from the output of MCMC sampler, we will use the projective Frobenius span variation introduced in [72]. The projective Frobenius Span , whereβ is the PMCS estimate of β. We can estimate this

Rank Estimation
While we have been considering r known so far, the cointegration rank has be to estimated. This can be done using a variety of approaches: Bayesian, frequentist or using a spectral methods. In a Bayesian setup, estimation of r is performed using Bayes factors, which in the context of ECMs are computed by means of Sevage-Dickey density ratio; see [39], [17] for an overview and [53] and [40] for financial applications. In frequentist settings, choosing r is routinely performed by means of hypothesis testing. Two most commonly used test statistics give rise to the "Trace Test" and "Maximum Eigenvalue Tests"; see [35] for details. In the work [55] the authors also study the effect of cointegration miss-specification of the rank and the role and influence of the identification constraints. They show that being conservative and overestimating an uncertain cointegration rank is often practically more robust in the model estimation and has less influence on Bayesian model parameter estimation compared to under-estimating the rank. Finally, one could use non-parametric or spectral approaches to estimate r (as well as the cointegration space), see [74] for a recent approach.

Numerical results
In the simulation results presented below, we use simulated data with n = 4, r = 3, (so d s (β,β) < 1.) In each case (Gaussian or Student-t), T = 240 the parameter values generating y 1:T are the following: For the Student-t case we shall use ω = 20. For the priors we use ν = 1, P τ = I n , and G = I r . In order to quantify the and accuracy of MCMC samplers, we will compute the diagnostics based on their resulting chain output, e.g.

The Gaussian case
For the Gaussian case, in Tables 1, 2 we present the results comparing the Gibbs Sampler of [41], a Gibbs approach with GMC for β and the full conditional update for α, a Gibbs approach with GMC for β and HMC for α as well as GMC targetting jointly (α, β). The perfect Gibbs sampler and the Gibbs method with GMC for β and a perfect Gibbs update for α, show very good performance. In some cases the slight improvement when   Table 3. IACT for each entry of Π for Student-t case with data augmentation using GMC for β can be attributed to the additional computational effort. In addition, using GMC for α and HMC for β shows good performance that is in par with the Gibbs sampler in terms of the comparison in Table  2 but with a slower mixing as indicated by the IACT in Table 1. Finally, using GMC to target jointly (α, β) has very slow mixing and posterior exploration, but at least it manages to result to accurate point estimates.

The Student-t case
For the Student-t case we will consider two different cases. The first one would be to use the data augmentation (DA) approach outlined in 2.4.2. As the full conditionals of the auxiliary variables λ t are tractable, a Gibbs sampler can be implemented (by alternating Algorithm 1 with p(R|y 1:T , α, β, λ 1:T , ω) and p(λ t |y 1:T , α, β, R, ω) for each t). Also, when GMC is used in this case either as part of a Gibbs update or on its own, a Gaussian likelihood p(y 1:T |α, β, R, ω, λ 1:T ) will be used in the implementation of 2. In the second case, we will apply the GMC samplers implemented with a Student-t distributed likelihoods p(y 1:T |α, β, R, ω) instead. The rest of the updates for the remaining parameters will be as before, so one can view this approach as a partial collapsing scheme.
In Tables 3,4, 5 we present the analogous results as in the Gaussian case. When data augmentation is used we compare the same samplers as in the Gaussian case together with an implementation of GMC for the SVD decomposition of α. The latter is included for comparison purposes as for this choice of prior for α one can derive analytically the prior for the matrices and singular values in the SVD decomposition ([5, Theorem 1.5.4]). As mentioned in Remark 8, using SVD directly for Π is more parsimonious and hence one should expect better results (see below). Not surprisingly, the different methods compared behave similarly for the Student-t case with data augmentation to what has been seen in the Gaussian case. In Figure 1 we also present trace plots for the consecutive MCMC realizations of β and the components of polar decompositions of estimates of α, that iŝ A,κ.
For the case where the Student-t likelihood and partial collapsing is used, we only compare the GMC based methods. The results are similar to the data augmentation case, but they can be used as numerical evidence of validity of the GMC/HMC schemes in the non-Gaussian case. From all the results it is apparent that GMC for β with HMC for α performs best among the more generic samplers (that do not use the full conditionals). InTable 5, it shows similar performance as the Gibbs sampler and it is only inferior in terms of the IACTs, which admit higher values than the Gibbs sampler but are still reasonably good values. In addition, we note that the use of SVD for α results in an improvement in efficiency from targetting jointly (α, β) with GMC, but both samplers are much slower in the exploration of the posterior than the other methods.

Using SVD for Π
In this section, we examine the model reparameterisation based on SVD decomposition imposed directly on the long-run multiplier matrix Π. This case is treated separately as the priors used here are not equivalent with those used before. To our best knowledge, this has not been considered before in the context of Bayesian cointegration, however GMC enables us to pursue this direction. For convenience, we use p(Π) ∝ N n,n (0, I n , I n )1 rank(Π)=r . From [5, Theorem 1.5.4] we obtain the equivalent expression for the priors defined on the matrices and singular values in the SVD decomposition: U, V are uniform distributions onṼ n,n , V n,r resp. and For the MCMC sampler targetting p(S|U, β, V, y 1:T ) we use a Metropolis-Hastings approach, with the proposal being a random walk on the log space of each s l with a step size of 0.1 and the noises being independent standard normals sorted in descending order. For p(β|S, V, y 1:T ), p(U|S, β, y 1:T ) GMC samplers are used. For samplers targetting U we use simpler to compute geodesic flows as presented in [3,9]. The results are presented in Tables  6, 7 for a GMC within Gibbs approach in Algorithm 3 and a random scan implementation that updates β with probability 0.1 and U, S with 0.45 each. From the IACTs we can observe an improvement in the mixing over the sampler that used SVD for α earlier.
The motivation behind using a random scan sampler is to reduce the computational cost. The results in Tables 6, 7 are similar with the the systematic scan. As the update for β is the most expensive step and hence is used less often. [4] show how geodesics for U can be computed more efficiently as opposed to the geodesics on V n,r . On the downside, one ends up with sampler mixing properties, but this can be improved by increasing the probability of update of β.  Table 7. Performance of GMC using SVD for Π for different likelihoods

Discussion
Bayesian methods are favorable to use in situations when estimation of the parameters uncertainty is required, but requires advanced simulation techniques like MCMC. In this paper we presented the Gibbs Sampler of [41] together with the GMC algorithm of [3] that can be used to sample from a wide class of distributions defined on manifolds. We combined these two approaches and presented different approaches for Bayesian estimation of the cointegration space. The Gibbs sampling method is the current state of the art for this problem, but can be implemented only when full conditionals are available. On the other hand, GMC is generic and can be used in wider range of model setups, that also seem more realistic in finance, such as heavy tailed noises, time varying model parameters as well as different choices of prior distributions. In terms of performance in our numerical results we saw GMC performed very well and efficiently when used within a Gibbs update scheme (i.e. GMC for β and HMC for α), but when applied naively to target (α, β) jointly it turned out to be rather inefficient.
A question that may arise is whether Bayesian methods and sampling techniques will be accurate in practical contexts. We believe this is a topic demanding a detailed discussion, but we performed a few numerical experiments to motivate the work presented so far. These are contained in the supplementary material in [46], where we compared the Gibbs sampler of [41] with the spectral method of [74] under various model parameterization setups with Gaussian noise. Both methods seemed accurate overall with the Gibbs sampler performing better, but on the other hand the spectral method of [74] uses only a fraction of the computational cost. In both cases, accuracy seemed to be robust to the level of additive noise, e.g. the trace of R, but more sensitive to a mean reversion parameter of the implied spread process. Using only the spectral method of [74] we found that this sensitivity of the estimated cointegrated space on the mean reversion seemed to hold also when a non-Gaussian noise was used with a variety of features in terms of skewness, kurtosis, or heavy-ness of the tails.
In terms of Bayesian estimation of cointegration, when non-Gaussian (differentiable) densities with heavier tails are used the advanced GMC methods can be very valuable to assess the uncertainty around point estimates. In addition, GMC can be implemented also directly on SVD decompositions of Π. This lead to accurate estimates, but further work is required to improve the mixing of Algorithm 3. Here a simplification choice was made to apply Step 2 (a) of Algorithm 3 after a GMC iteration defined on V n,r , whereas Step 2 (a) should have been emdedded in the GMC procedure (and the implementation of Algorithm 2). More importantly from a modelling perspective SVD is appealing as the singular values act in a similar manner to mean reversion parameters mentioned earlier, so we believe that their accurate estimation will be critical for cointegration space estimation. In addition, given that computational tools like GMC are available and can perform well within a Gibbs scheme, this opens up the possibility of adopting a wider class of priors (and parameterizations such as SVD) than what described in Section 2.4.1. Priors could be defined on parameters resulting from SVD or polar decompositions directly and past works on defining priors in different contexts (e.g. [37,38]) can be very relevant.
We conclude with some more possible aspects for further investigations from a computational perspective. In GMC and HMC we used a simple numerical integration scheme for the Hamiltonian dynamics numerical integrator, so there is plenty of room for improvements such as the recent work of [30]. Finally, interesting questions arise when n is very large. Is it possible to perform estimation efficiently and are the conclusions we reached here still valid? In numerical results not shown we have applied the above methods up to n = 10, and our conclusions seemed to hold. In order to use much higher values of n a significant amount of developments is needed.