Task-based parallelization of an implicit kinetic scheme

In this paper we present and implement the Palindromic Discontinuous Galerkin (PDG) method in dimensions higher than one. The method has already been exposed and tested in [4] in the one-dimensional context. The PDG method is a general implicit high order method for approximating systems of conservation laws. It relies on a kinetic interpretation of the conservation laws containing stiff relaxation terms. The kinetic system is approximated with an asymptotic-preserving high order DG method. We describe the parallel implementation of the method, based on the StarPU runtime library. Then we apply it on preliminary test cases.


Introduction
In this work we consider the time discretization of compressible fluid models that appear in gas dynamics, biology, astrophysics or plasma physics for tokamaks. These models can be unified in the following form where w : R D × [0, T max ] −→ R m is the vector of conservative variables, q k (w) : R m −→ R m is the flux and s : R D × R × R m −→ R m is a source term. D represents the physical space dimension and m the number of unknowns. In many physical applications such as MHD flows, low Mach Euler equations, Shallow-Water with sedimentation, the model presents several time scales associated to the propagation of different waves. When the time scale of fast phenomena, which constrains the explicit CFL condition, is very small compared to the time scale of the most relevant phenomena, it becomes necessary to switch to implicit schemes. However classical implicit schemes are very costly in 2D or 3D because they require the resolution of linear or non-linear systems at each time step. In addition, the matrices associated with the hyperbolic systems are generally ill-conditioned.
In this paper, we propose to follow another approach for avoiding the resolution of complicated linear systems. Instead of solving the full fluid model (1.1) directly, we replace it by a simpler kinetic interpretation made of a set of transport equations coupled through a stiff relaxation term [1,3,7]. See also [4] and included references. The kinetic system is then solved by a splitting method where the transport and relaxation stages are treated separately. The method is then well adapted to parallel optimizations. The method is already presented in [4] in the one-dimensional case. Here we present its implementation in higher dimensions. We particularly focus our presentation of the massive parallelization of the method with the StarPU runtime system [2]. The outlines are as follows. First we recall that it is possible to provide a general kinetic interpretation of any system of conservation laws. The interest of this representation is that the complicated non-linear system is replaced by a (larger) set of scalar linear transport equations that are much easier to solve. The transport equations are coupled through a non-linear source term that is fully local in space. Then, we detail the approximation which allows to solve the transport equation in an efficient way. We adopt a Discontinuous Galerkin (DG) method based on an upwind numerical flux and Gauss-Lobatto quadrature points. Thanks to the upwind flux, the matrix of the discretized transport operator has a block-triangular structure.
The main part of this work is devoted to the task parallelization based on the StarPU library for treating the transport and relaxation steps efficiently. We use the MPI version of StarPU, which allows to address clusters of multicore processors. We also describe the domain decomposition and the macrocell approach that we have used to achieve better performance. Finally, we give some numerical results for measuring the efficiency of the parallelism and the ability of the method to compute realistic problems.

Kinetic model
We consider the following kinetic equation The unknown is a vectorial distribution function f (x, t) ∈ R nv depending on the space variable x = (x 1 . . . x D ) ∈ R D and time t ∈ R. g(x, t, f ) is a vectorial source term, possibly depending on space, time and f . The partial derivatives are noted , coupled through a stiff BGK relaxation, and with an optional additional source term. We denote by V · ∂ = D k=1 V k ∂ k the transport operator, and by Nf = (f eq (f ) − f )/τ the BGK relaxation term (also called the "collision" term). Generally, this kinetic model represents an underlying hyperbolic system of conservation laws. The macroscopic conservative variables w(x, t) ∈ R m are obtained through a linear transformation where P is a m × n v matrix. Generally the number of conservative variables is smaller than the number of kinetic data: m < n v . The equilibrium (or "Maxwellian") distribution f eq (f ) is such that Pf = Pf eq (f ), and which states that the equilibrium actually depends only on the macroscopic data w. We could have used the notation f eq = f eq (w) = f eq (Pf ), but we have decided to respect a well-established tradition. When τ → 0, the kinetic equations provide an approximation of the system of conservation laws where the flux is given by The flux is indeed a function of w only because of (2.3). Similarly the source term is given by System (2.1) has to be supplemented with conditions at the boundary ∂Ω of the computational domain Ω. We denote by n = (n 1 . . . n D ) the outward normal vector on ∂Ω. For simplicity, we shall only consider very simple imposed and time-independent boundary conditions f b . We note A natural boundary condition, which is compatible with the transport operator V · ∂, is It states that for a given velocity v i , the corresponding boundary data f b i is used only at the inflow part of the boundary. Let us point out that the programming optimization that we propose in this paper rely in an essential way on the nature of the boundary condition (2.6). For other boundary conditions, such as periodic or wall conditions, additional investigations are still needed.

Numerical method
3.1. Discontinuous Galerkin approximation. For solving (2.1) we shall treat the transport operator V · ∂ and the collision operator N efficiently, thanks to a splitting approach. This allows to achieve a better parallelism. Let us start with the description of the transport solver. For a simple exposition, we only consider one single scalar transport equation for f (x, t) ∈ R at constant velocity v The general vectorial case is easily deduced. We consider a mesh M of Ω made of open sets, called "cells", In the most general setting, the cells satisfy In each cell L ∈ M we consider a basis of functions (ϕ L,i (x)) i=0...N d −1 constructed from polynomials of order d. We denote by h the maximal diameter of the cells. With an abuse of notation we still denote by f the approximation of f , defined by The DG formulation then reads: find the f L,j 's such that for all cell L and all test function ϕ L,i In this formula (see Figure 4.1): • R denotes the neighboring cell to L along its boundary ∂L ∩ ∂R, or the exterior of Ω on ∂L ∩ ∂Ω. • n = n LR is the unit normal vector on ∂L oriented from L to R. • f R denotes the value of f in the neighboring cell R on ∂L ∩ ∂R.
• If L is a boundary cell, one may have to use the boundary values instead: is the standard upwind numerical flux encountered in many finite volume or DG methods.
In our application, we consider hexahedral cells. We have a reference cell We assume that τ L is invertible and we denote by τ L its (invertible) Jacobian matrix. We also assume that τ L is a direct transformation det τ L > 0.
In our implementation τ L is a quadratic map based on hexahedral curved "H20" finite elements with 20 nodes. The mesh of H20 finite elements is generated by gmsh [6].
On the reference cell, we consider the Gauss-Lobatto points ( They are obtained by tensor products of the (d + 1) one-dimensional Gauss-Lobatto (GL) points on ] − 1, 1[. The reference GL points and weights are then mapped to the physical GL points of cell L by In addition, the six faces of the reference hexahedral cell are denoted by F , = 1 . . . 6 and the corresponding outward normal vectors are denoted byn . A big advantage of choosing the GL points is that the volume and the faces share the same quadrature points. A special attention is necessary for defining the face quadrature weights. If a GL pointx i ∈ F , we denote by µ i the corresponding quadrature weight on face F . We also use the convention that µ i = 0 ifx i does not belong to face F . A given GL pointx i can belong to several faces when it is on an edge or in a corner ofL. Because of symmetry, we observe that if µ i = 0, then the weight µ i does not depend on .
We then consider basis functionsφ i on the reference cell: they are the Lagrange polynomials associated to the Gauss-Lobatto point and thus satisfy the interpolation propertŷ The basis functions on cell L are then defined according to the formula In this way, they also satisfy the interpolation property In this paper, we only consider conformal meshes: the GL points on cell L are supposed to match the GL points of cell R on their common face. Dealing with non-matching cells is the object of a forthcoming work.
Let L and R be two neighboring cells. Let x L,j be a GL point in cell L that is also on the common face between L and R. In the case of conformal meshes, it is possible to define the index j such that Applying a numerical integration to (3.2), using (3.3) and the interpolation property (3.4), we finally obtain We have to detail how the gradients and normal vectors are computed in the above formula. Let A be a square matrix. We recall that the cofactor matrix of A is defined by The gradient of the basis function is computed from the gradients on the reference cell using (3.6) In the same way, the scaled normal vectors n on the faces are computed by the formula We introduce the following notation for the cofactor matrix The DG scheme then reads On boundary GL points, the value of f R,i is given by the boundary condition For practical reasons, it is interesting to also consider f R,i as an artificial unknown in the fictitious cell. The fictitious unknown is then a solution of the differential equation In the end, if we put all the unknowns in a large vector F(t), (3.7), (3.8) read as a large system of coupled differential equations In the following, we call L h the transport matrix. The transport matrix satisfies the following properties: • Let F be such that the components corresponding to the boundary term vanish. Then This dissipation property is a consequence of the choice of an upwind numerical flux [9]. • In many cases, and with a good numbering of the unknowns in F, L h has a triangular structure. This aspect is discussed in Subsection 4.1.
As stated above, we actually have to apply a transport solver for each constant velocity v i . Let L be a cell of the mesh M and x i a GL point in L. As in the scalar case, we denote by f L,i the approximation of f in L at GL point i. In the sequel, with an abuse of notation and according to the context, we may continue to note F(t) the big vector made of all the vectorial values f L,j at all the GL points j in all the (real or fictitious) cells L. We may also continue to denote by L h the matrix made of the assembly of all the transport operators for all velocities v i . With a good numbering of the unknowns it is possible in many cases to suppose that L h is block-triangular. More precisely, because in the transport step the equations are uncoupled, we see that L h can be made block-diagonal, each diagonal block being itself block-triangular. See Section 4.1.

Palindromic time integration.
We can also define an approximation N h of the collision operator N. We define by F eq (F) the big vector made of all the f eq (f L,i ), Similarly we note G h the discrete approximation of the kinetic source term g.
The DG approximation of (2.1) finally reads We use the following Crank-Nicolson second order time integrator for the transport equation: Similarly, for the collision integrator, we use Because during the collision step, the conservative variables w = Pf do not change, the collision integrator is only apparently implicit. We have the explicit formula: The source operator is also approximated by a Crank-Nicolson integrator requiring to solve a nonlinear local equation whenever g depends on f . If τ > 0, we observe that the operators T 2 and C 2 are time-symmetric: if we set This property implies that O 2 is necessarily a second order approximation of the exact integrator [10,8]. When τ = 0, we also remark that and then C 2 does not satisfy (3.13) anymore. For τ > 0, the Strang formula permits us to construct a five steps second order time-symmetric approximation and a three step one in the source-less case. However this formula is no more a second order approximation of (2.1) when τ → 0. Indeed, when As explained in [4] it is better to consider the following method, which remains second order accurate even for infinitely fast relaxation: .
By palindromic compositions of the second order method M kin 2 it is then very easy to achieve any even order of accuracy (see [4]). However, in this paper, we concentrate on the parallel optimization of the method and we shall only present numerical results at second order for the limit system 2.4. To that end it is sufficient to use the method M 2 , as P M 2 (0) properly converges towards identity on the macroscopic variable space when τ → 0.

Optimization of the kinetic solver
In this section, we describe the optimizations that can be applied in the implementation of the previous numerical method. 4.1. Triangular structure of the transport matrix. Because of the upwind structure of the numerical flux, it appears that the transport matrix is often block-triangular. This is very interesting because this allows to apply implicit schemes to (3.9) without the costly inversion of linear systems [11]. We can provide the formal structure of L h through the construction of a directed graph G with a set of vertices V and a set of edges E ⊂ V × V. The vertices of the graph are associated to the (real or fictitious) cells of M. Consider now two cells L and R with a common face F LR . We denote by n LR the normal vector on F LR oriented from L to R. If there is at least then the edge from L to R belongs to the graph: In (3.7) we can distinguish between several kinds of terms. We write with We can use the following convention Γ L←L contains the terms that couple the values of f inside the cell L. They correspond to diagonal blocks of size (d + 1) D × (d + 1) D in the transport matrix L h . Γ L←R contains the terms that couple the values inside cell L with the values in the neighboring upwind cell R. If R is a downwind cell relatively to L then µ i v · C L,in − = 0 and Γ L←R = 0 is indeed compatible with the above convention (4.1). Once the graph G is constructed, we can analyze it with standard tools. If it contains no cycle, then it is called a Directed Acyclic Graph (DAG). Any DAG admits a topological ordering of its nodes. A topological ordering is a numbering of the cells i → L i such that if there is a path from L i to L j in G then j > i. In practice, it is useful to remove the fictitious cells from the topological ordering. In our implementation they are put at the end of the list. Once the new ordering of the graph vertices is constructed, we can construct a numbering of the components of F by first numbering the unknowns in L 0 then the unknowns in L 1 , etc. More precisely, we set Then, with this ordering, the matrix L h is lower block-triangular with diagonal blocks of size (d + 1) D × (d + 1) D . It means that we can apply implicit schemes to (3.9) without inverting large linear systems.
As stated above, we actually have to apply a transport solver for each constant velocity v i . In the sequel, with another abuse of notation and according to the context, we continue to note F the big vector made of all the vectorial values f L,j at all the GL points j in all the (real or fictitious) cells L.
We may also continue to denote by L h the matrix made of the assembly of all the transport operators for all velocities v i . With a good numbering of the unknown it is still possible to suppose that L h is block-triangular. More precisely, as in the transport step the equations are uncoupled, we see that L h can be made a block-diagonal matrix, each diagonal block being itself block-triangular.

4.2.
Parallelization of the implicit solver. In this section, we explain how it is possible to parallelize the transport solver. Here again we consider the single transport equation (3.1) and the associated differential equation (3.9). We apply a second order Crank-Nicolson implicit scheme. We have explained in Section 3.2 how to increase the order of the scheme. We compute an approximation F n of F(n∆t). The implicit scheme reads (4.2) (I − ∆tL h )F n+1 = (I + ∆tL h )F n .
As explained above, the matrices (I − ∆tL h ) and (I + ∆tL h ) are lower triangular. It is thus possible to solve the linear system explicitly cell after cell, assuming that the cells are numbered in a topological order. It is possible to perform further optimization by harnessing the parallelism exhibited by the dependency graph. Indeed, once the values of f in the first cell are computed, it is generally possible to compute in parallel the values of f in neighboring downwind cells. For example, as can be seen on Figure 4.1, once the values in cells 0, 1 and 2 are known, we can compute independently, and in parallel, the values in cells 2, 4 and 6.
We observe that at the beginning and at the end of the time step, the computations are "less parallel" than in the middle of the time step, where the parallelism is maximal. Implementing this algorithm with OpenMP or using pthread is not very difficult. However, it requires to compute the data dependencies between the computational tasks carefully, and to set adequate synchronization points in order to get correct results. In addition, a rough implementation will probably not exhibit optimized memory access. Therefore, we have decided to rely on a more sophisticated tool called StarPU 1 for submitting the parallel tasks to the available computational resources.
StarPU is a runtime system library developed at Inria Bordeaux [2]. It relies on the data-based parallelism paradigm. The user has first to split its whole problem into elementary computational tasks. The elementary tasks are then implemented into codelets, which are simple C functions. The same task can be implemented differently into several codelets. This allows the user to harness special acceleration devices, such as vectorial CPU cores, GPUs or Intel KNL devices, for example. In the StarPU terminology these devices are called workers.
For each task, the user has also to describe precisely what are the input data, in read mode, and the output data, in write or read-write mode. The user then submits the task in a sequential way to the StarPU system. StarPU is able to construct at runtime a task graph from the data dependencies. The task graph is analyzed and the tasks are scheduled automatically to the available workers (CPU cores, GPUs, etc.). If possible, they are executed in parallel into concurrent threads. The data transfer tasks between the threads are automatically generated and managed by StarPU, which greatly simplifies the programming. When a StarPU program is executed, it is possible to choose among several schedulers. The simplest eager scheduler adopts a very simple strategy, where the tasks are executed in the order of submission by the free workers, without optimization. More sophisticated schedulers, such as the dmda scheduler, are able to measure the efficiency of the different codelets and the data transfer times, in order to apply a more efficient allocation of tasks. Recently a new data access mode has been added to StarPU: the commute mode. In a task, a buffer of data can now be accessed in commute mode, in addition to the write or read-write modes. A commute access tells to StarPU that the execution of the corresponding task may be executed before or after other tasks containing commutative access. This allows StarPU to perform additional optimizations. There exists also a MPI version of StarPU. In the MPI version, the user has to decide an initial distribution of data among the MPI nodes. Then the tasks are submitted as usual (using the function starpu_mpi_insert_task instead of starpu_insert_task). Required MPI communications are automatically generated by StarPU. For the moment, this approach does not guarantee a good load balancing. It is the responsibility of the user to migrate data from one MPI node to another for improving the load balancing, if necessary.

4.3.
Macrocell approach. StarPU is quite efficient, but there is an unavoidable overhead due to the task submissions and to the on-the-fly construction and analysis of the task graph. Therefore it is important to ensure that the computational tasks are not too small, in which case the overhead is not amortized, or not too big, in which case some workers are idle. For achieving the right balance, we have decided not to apply directly the above task submission algorithm to the cells but to groups of cells instead. The implementation of the whole kinetic scheme has been made into the schnaps software 2 . schnaps is a C99 software dedicated to the numerical simulation of conservation laws. In schnaps we construct first a macromesh of the computational domain. Then each macrocell of the macromesh is split into subcells. See Figure 4.2. We also arrange the subcells into a regular sub-mesh of the macrocells. In this way, it is possible to apply additional optimizations. For instance, the subcells L of a same macrocell L can now share the same geometrical transformation τ L , which saves memory.
In schnaps we have also defined an interface structure in order to manage data communications between the macrocells. An interface contains the faces that are common to two neighboring macrocells. We do not proceed exactly as in Section 4.1 where the vertices of graph G were associated to cells and the edges to faces. Instead, we construct an upwind graph whose vertices are associated to macrocells, and edges to interfaces. This graph is then sorted, and the macrocells are numbered in a topological order.
For solving one time step of one transport equation (4.2), we split the computations into several elementary operations: for all macrocell L taken in a topological order, we perform the following tasks: (1) Volume residual assembly: this task computes in the macrocell L the part of the right hand side of (4.2) that comes from the values of f inside L; (2) Interface residual assembly: this task computes, in the macrocell L, the part of the right hand side of (4.2) that comes from upwind interface values; (3) Boundary residual assembly: this task computes, in the macrocell L, the part of the right hand side of (4.2) that comes from upwind boundaries values. (4) Volume solve: this task solves the local transport linear system in the macrocell. (5) Extraction: this task copies the boundary data of L to the neighbor downwind interfaces.
Let us point out that in step 4 above, the macrocell local transport solver is reassembled and refactorized at each time step: we have decided not to store a sparse linear system in the macrocell for each velocity v i , in order to save memory. The local sparse linear system is solved thanks to the KLU library [5]. This library is able to detect efficiently sparse triangular matrix structures, which makes the resolution quite efficient. In practice, the factorization and resolution time of the KLU solver is of the same order as the residual assembly time.
In schnaps, we use the MPI version of StarPU. The macromesh is initially split into several subdomains and the subdomains are distributed to the MPI nodes. Then the above tasks are launched asynchronously with the starpu_mpi_insert_task function. MPI communications are managed automatically by StarPU.
It is clear that if we were solving a single transport equation our strategy would be very inefficient. Indeed, the downwind subdomains would have to wait for the end of the computations of the upwind subdomains. We are saved by the fact that we have to solve many transport equations in different directions. This helps the MPI nodes to be equally occupied. Our approach is more efficient if we avoid a domain decomposition with internal subdomains, because these subdomains have to wait the results coming from the boundaries.
In our approach it is also essential to launch the tasks in a completely asynchronous fashion. In this way, if a MPI node is waiting for results of other subdomains for a given velocity v i it is not prevented from starting the computation for another velocity v j .

4.4.
Collisions. In this section we explain how is computed the collision step (3.12). The computations are purely local to each GL point, which makes the collision step embarrassingly parallel. However it is not so obvious to attain high efficiency because of memory access. If the values of F are well arranged in memory in the transport stage, it means that the values of f attached to a given velocity v i are close in memory, for a better data locality. On the contrary, in the collision step at a given GL point, a better locality is achieved if the values of f corresponding to different velocities are close in memory. Additional investigations and tests are needed in order to evaluate the importance of data locality in our algorithm.
In our implementation, we have adopted the following strategy. We have first identified the following task: (1) Reduction task for a velocity v i : this task is associated with one macrocell L. It computes the contribution to w of the components of f that have been transported at velocity v i with formula (2.2). The StarPU access to the buffer containing w is performed in read-write and commute modes. In this way the contribution from each velocity can be added to w as soon as it is available. (2) Relaxation task for a velocity v i : this task is associated to one macrocell L. Once w is known, it computes the components of f eq corresponding to velocity v i . Then it computes the relaxation step (3.12) for the associated component of f .
In step 2 we can separate the computations for each velocity because the collision term (3.10) is diagonal. Some Lattice Boltzmann Methods rely on non-diagonal relaxations. It can be useful for representing more general viscous terms for instance. For non-diagonal relaxation we would have to change a little the algorithm.
We can now make a few comments about the storage cost of the method. In the end, we have to store at each GL point x i and each cell L the values of f L,i and w L,i . We do not have to keep the values of the previous time-step, f L,i and w L,i can be replaced by the new values as soon as they are available. In this sense, our method is "low-storage". As explained in [4] it is also possible to increase the time order of the method, without increasing the storage.

Scaling test.
For all performance tests presented in this section, we used standard models from the family Lattice-Boltzmann-Method (LBM) kinetic models devised for the simulation of Euler/Navier Stokes systems. We will not give a detailed description of their properties from the modeling point of view: we simply take them as good representative of the typical workload of kinetic relaxation schemes. The most relevant feature impacting the performance of our algorithm is the discrete velocity set of the kinetic model, which determines the task graph structure of the transport step when combined with a particular mesh topology. In standard LBM models, velocity sets are usually built-up from a sequence of pairs of opposite velocities with an additional zero velocity node. On Figure 4.3 we show the two representative velocity sets of the 2DQ9 and 3DQ27 LBM models. We first tested the multithread performance of our implementation for the full (transport + relaxation) scheme for the standard D2Q9 model. All tests were performed on a single node of the IRMA-ATLAS cluster, with 24 available cores. We considered several square meshes build-up from 1 to 64 macrocells. The number of geometric degrees of freedom per element has been kept constant with a value of 3375 points per macrocell, so that the workload per macrocell did not change. For each mesh, we allowed StarPu to use from 1 to the full 24 cores of the node and measured the total wall time. The results for this first batch of performance measurements are given in Fig. 4.4. First we verify that for 1 unique macrocell, parallel performance saturates when the number of cores roughly equals the number of velocities in the model. This is to be expected, as no topological parallelism can be exploited in that case. Increasing the number of macro-element allows to take advantage of topological parallelism. For that workload, parallel efficiency saturates at about 80, which is quite good. On Fig. 4.5, we considered on the same cubic mesh three different models differing by the number of velocity values. Those cases exhibit a large amount of potential parallelism, due to the large number of velocities combined with the macrocell decomposition. On an ideal machine, they could in theory scale perfectly up to 24 cores. The observed saturation, still around 80 efficiency, is still quite good and comes from the unavoidable concurrency in memory access between the various cores and the scheduling overhead. It is important to note when considering those results that the bulk of the computational cost occurs in the transport step of the algorithm: though the collision step forces synchronization between all the fields corresponding to individual velocities for the computation of the macroscopic fields, its actual cost is negligible in regard of the transport step.
4.5.2. MPI scaling: D3Q15 in a torus. Having verified the good multithread performance of our code on a single node, we now check whether for larger problem sizes the workload can be distributed among several computing nodes. To that end, accounting for the fact that we aim notably at performing simulations for Tokamak physics, we considered a toroidal mesh subdivided into 720 macrocells. The workload distribution across nodes was made using a standard domain decomposition approach: the mesh was partitioned statically into as many sub-domains as computing nodes, ranging from 1to 4 for our experiment on the IRMA-ATLAS cluster. From an implementation point of view, the transition from a multithreaded code to a hybrid MPI/multithread one is made fairly easy by StarPU. When declaring data to StarPU, one simply has to specify the MPI process owning the data. At runtime, each MPI process hosts a local scheduler instance which acts only on data relevant to the local execution graph. All MPI communications are handled transparently by the local scheduler when inter-node data transfers are necessary. In table 1 we show the wall time for a hundred iterations of the full scheme for the D3Q15 model. The number of available threads per node was set to 14, matching the number of velocities actually participating in transport (there is one null velocity in the model). We observe a super-linear scaling when the load is spread from 1to 4 node. This is not surprising for such an experiment with fixed total problem size as both the memory load and size of the local task graph for each decrease when the number of sub-domains increases.
The conservative variables vector is thus w = [ρ, ρu x , ρu y ] t . The kinetic model is the standard D2Q9 one with nine velocities The equilibrium distribution function is given by with c = λ/ √ 3, and the weights w 0 = 4 9 , w i = 1 9 for i = 1, . . . , 4, w i = 1 36 for i = 5, . . . , 8. The stationary solution for a fluid at rest in the gravity field is For this test case, the numerical scheme is made up of three stages: a transport step (T), a source step (S) where the source is applied on the equilibrium part of the distribution function, and the collision step (C). Due to the absence of explicit time dependency and the linearity in w of the source, the local nonlinear resolution of the source operator converges in one Picard iteration. All steps are implemented as weighted implicit schemes, parametrized by a weight θ and a time step∆t. We compared several 1 st and 2 nd order splitting schemes built up from either fully implicit (θ = 1) first order or Crank-Nicolson (θ = 1 2 ) steps: • Lie first order splitting scheme M s 1 = T 1 (∆t)S 1 (∆t)C 1 (∆t) with first order building blocks. • Lie first order splitting scheme M s 1,2 = T 2 (∆t)S 2 (∆t)C 2 (∆t) with second order building blocks, for which the order loss comes from the splitting error.
• a palindromic second order Strang scheme M s 2 = T 2 ( ∆t 2 )S 2 ( ∆t 2 )C 2 (∆t)S 2 ( ∆t 2 )T 2 ( ∆t 2 ) . • a collapsed version of the second order Strang scheme M s 2 for which the last transport substep of each global step and the first transport substep of the next one are fused in a single T 2 (∆t) substep except obviously for the first and last time steps of the simulation.  We performed time order convergence tests on a 2D square mesh partitioned into 4 × 4 macrocells with 1024 points per macrocell. As shown by Fig. 5.1, we obtain the expected convergence orders for each of the splitting schemes used. with w s = [1.0, 0, 0] t the target fluid state in the "solid" part of the domain and the relaxation frequency K(x) is given by with K s = 300, x c = [−4, 0] t and κ = 40. The net effect is a very stiff relaxation towards a flow with zero velocity and the reference density near the center x c of the frequency mask. The effective diameter of the cylinder for this simulation is about 0.5. The initial state, which is also applied at the duct boundaries for the whole simulation is ρ = 1, u x = 0.03, u y = 0. Accounting for the fact that for this model the sound speed is 1/ √ 3, the Mach number of the unperturbed flow is approximately 0, 017. The simulation was performed on a macromesh with 16 × 16 macrocells stretched with a 1 : 5 aspect ratio to match the domain dimension; each macrocell contains 12 × 60 integration points. On figure 5.2 we show the vorticity norm at t = 83, when turbulence is well developed in the wake of the obstacle.

Conclusion
In this paper, we have presented an optimized implementation of the Palindromic Discontinuous Galerkin Method for solving kinetic equations with stiff relaxation. The method presents the following interesting features: • It can be used for solving any hyperbolic system of conservation laws.
• It is asymptotic-preserving with respect to the stiff relaxation.
• It is implicit and thus is not limited by CFL conditions. • Despite being formally implicit, it requires only explicit computations.
• It is easy to increase the time order with a composition method.
• It presents many opportunities for parallelization and optimization: in this paper we have presented the parallelization of the method with the aid of the MPI version of the StarPU runtime system. In this way we address both shared memory and distributed memory MIMD parallelism.
Our perspectives are now to apply the method for computing MHD instabilities in tokamaks. We will also try to extend the method to more general boundary conditions.