Some problems are defined in terms of linear operations over arrays of data (e.g. vectors and matrices) the elements of which are mostly zeros (sparse arrays). How do you optimize these operations to minimize storage and maximize performance?
Many problems are expressed in terms of operations on vectors and matrices. These problems are diverse and include problems such as solving systems of simultaneous linear equations (e.g. operations research), finding eigenvalues (e.g. structural mechanics), transforming between different representations of a system (image processing), and many other problems too numerous to list.
For an important class of these problems, most of the elements of the system are zero. This arises from the symmetry of the system or due to the fact that different subcomponents of the system are independent of each other. When the fraction of zeros is significantly large, enough so that there are benefits to explicitly take these zeros into account when solving the problem, these problems are called Sparse Linear Algebra problems.
There are two key benefits to a sparse linear algebra problem. First, the amount of storage required is greatly reduced. Only the non-zero elements and their indices need to be stored which can cut down total storage by orders of magnitude. Second, the computational cost is reduced since the results of arithmetic operations with zeros are known without actually doing the computation.
The challenge in sparse linear algebra is to balance storage, computational cost, and stability to create an effective solution. While the approaches are well constrained for dense linear algebra, the case for sparse linear algebra is much more complicated.
Tradeoff the cost of recomputing intermediate results as opposed to storing them and fetching them later.
Overlapping computation with data movement requires software constructed around the details of a particular memory hierarchy. But this is in direct conflict with the need to write portable software.
A linear algebra operation may dictate a distinct layout of data to achieve optimal performance. These data layouts, however, may conflict with those needed by other operations forcing the programmer to balance data copy overheads with potentially sub-optimal algorithms.
Developing software “from scratch” to solve sparse linear algebra problems usually requires in-depth understanding of the algorithms being implemented. Much care must be taken to ensure numerical stability, algorithmic convergence, and a number of other concerns which are far beyond the scope of this pattern. Developing good algorithms is “more art than science” and is not usually the task of a application developer, even though developers may need to understand some numerical issues specific to their problems. When facing a sparse linear algebra problem, a programmer should always try to find standard library solutions to sparse linear algebra problems. Comprehensive lists of publicly available high-quality software are available online, and the References section contains links to some of these lists.
Linear algebra libraries usually contain many routines which solve the same problem (e.g. LU decomposition, or Least-Squares) but are optimized for particular classes of matrices. Selecting the best routine requires a user to understand certain properties of the matrices being used, including the patterns of the location of non-zeroes as well as several numerical characteristics.
Real matrices contain only real-valued entries, as opposed to Complex matrices, whose entries may be complex numbers. A Symmetric matrix is equal to its transpose. That is, each entry ai,j is equal to the element opposite w.r.t. the diagonal, aj,i. Hermitian matrices are the complex analogue of symmetric matrices: ai,j is the complex conjugate of aj,i. Matrices may be banded, with all non-zero entries residing in diagonal bands. A Positive Definite matrix is a symmetric (hermitian, if the matrix is complex) matrix whose eigenvalues are all positive. It also satisfies that vT M v > 0 for all non-zero vectors v. The matrix is Positive Semidefinite if vT M v ≥ 0, for all non-zero v.
As works cited in the References section of this paper provide detailed guides to selecting particular algorithms, we will discuss here only high-level algorithmic concepts.
Probably the most important is to choose between a direct solver and an iterative solver. As the name implies, the direct solver factors the matrix explicitly using a Cholesky, LU, QR, or some other factorization. Effective direct solvers evaluate interactions between rows and columns of the matrix that fill-in the matrix with non-zero values; hence producing a matrix that is less sparse (and hence more expensive to compute and store). The key is to symbolically factor the matrix to understand these interactions and then reorder the problem to minimize fill-in. In some cases, the result is a block factored matrix that can be solved as a sequence of smaller dense linear algebra problems. Direct solvers are relatively safe: they always work when the matrix is well conditioned. Unfortunately, they are also slow.
Hence, most people working with sparse solvers use iterative methods. These methods construct an approximation to the solution, compute an error term (a residual), update the approximation and then continue until the resulting error terms are small enough. The key kernel involved with a sparse iterative solver is a sparse-matrix, dense-vector multiplication (SpMV) routine which is itself expressed as a sequence of vector reduction operations.
Convergence rates vary widely depending on the iterative algorithm used and the properties of the sparse matrix. To accelerate convergence, an approximate inverse is used to transform the matrix into one that is better behaved numerically (a preconditioner). Selecting the right preconditioner is where much of the “art” of working with sparse iterative solvers arises.
For more details about sparse solvers, consult references included with this pattern or the documentation to the sparse solver library you are using.
While most high-performance implementations of most interesting algorithms are freely available, a programmer may still need to understand the data-structures used to implement sparse matrices. For direct solvers, the programmer must only understand how the library expects the matrix to be represented. For iterative solvers, frequently the library requires that the user provide an SpMV routine. The library may provide a default SpMV implementation, but as SpMV is a performance-critical operation the user may wish to optimize it. The rest of this Solution will focus on the implementation of Sparse Matrix-Vector multiplication, and data structures that support its optimization.
High-Level Optimization Approach As the space of optimizations for SpMV is large, it is important to direct the search by understanding which optimizations are likely to be fruitful. SpMV optimizations target three different aspects: memory bandwidth improvement, data-structure size reduction, and instruction-throughput improvement.
As interesting sparse matrices are frequently very large, it is common for the SpMV computation to be memory-bound. If the matrix is larger than the amount of cache present on the target processor, memory bandwidth is likely to be the larger performance bottleneck. In this case, one should focus on data-structure size reduction and memory-bandwidth improvement via cache blocking. However, there are situations in which instruction-throughput is more important. For example, the L2 caches on some processors are tens of megabytes, and can hold the entire working sets of many linear algebra problems. In those cases, memory bandwidth is less important than taking advantage of SIMD and multi-core parallelism to increase instruction throughput.
Sparse Matrix Data Structures Sparse matrix representations do not explicitly store the matrix’s zero-elements. As most of the matrix’s entries are zero, the total memory required for the matrix is reduced significantly, frequently by an order of magnitude. The most basic data structure is Compressed Sparse Row (CSR), illustrated in the next figure. CSR stores the matrix entries in a single linear array (val) (two arrays, if the entries are complex), along with a separate array of integers (col)) which store the column indices for each non-zero. A third array, (row), with one entry per row of the matrix, stores the offset into the val and col arrays at which each row begins.
For banded matrices, it is frequently advantageous to represent the diagonals in linear vectors, obviating the column index vectors and decreasing the size of the overall data structure. If the band-structure of the matrix is well-understood and predictable, it is possible to eliminate the index vectors altogether.
Frequently, sparse matrices contain many locally dense regions. That is, small (e.g. 4×4 or 8×8) sub-matrices scattered throughout the matrix may be entirely (or mostly) non-zero. An optimization (frequently referred to as register blocking in the literature) is to represent the matrix as a collection of small dense matrices, rather than as a collection of individual matrix entries. The transformation improves performance in several ways: the amount of index data required to store the resulting representation is substantially smaller; a smaller fraction of the SpMV computation is spent on manipulating the data-structure; and the operations on the dense sub-matrices are more regular and easier to optimize. However, as the sub-matrices are usually not entirely zero, register blocking requires the explicit storage of some zero values. The resulting increase in data-structure size and computation on zero-values must be balanced against the benefits of the transformation. The two most common approaches to register blocking are Block Coordinate (BCOO) and Block Compressed Sparse Row (BCSR). BCOO explicitly stores the row and column indices of the upper-left element of each densified block, while BCSR compresses the row-index array in the same manner as CSR.
Parallelism in SpMV is somewhat obvious. Each row of the matrix is involved in a sparse dot-product with the dense vector, and independent of each other row’s dot-product. Additionally, the sparse dot-product for each row contains reduction-style parallelism. The reduction may be difficult to implement efficiently, however, due to the irregularity of a matrix’s row structure. In practice, the dot products are frequently serialized, as the parallelism between rows in large matrices is sufficient to saturate most parallel targets. SpMV’s parallelism can be exploited using the techniques described in the Data Parallelism pattern; frequently implementations utilize the SPMD and SIMD implementation patterns.
Load-balancing a parallel SpMV can be important to performance. As each row of the matrix may contain a different number of non-zero elements. Thus when parallelizing among matrix rows, the amount of work assigned to a Unit of Execution depends on the structure of the matrix.
Additionally, on a distributed-memory machine, parallelization can require communication between nodes if the source x vector is distributed: A node with a non-zero in column k will need to load the kth element of x, which may reside on another node. It is desirable to find a domain-decomposition (i.e. assignment of matrix rows to compute nodes) which minimizes communication. This problem is frequently formulated as a graph-partitioning problem. The graph of a Sparse Matrix contains one edge corresponding to each non-zero entry; all edges are undirected and of equal weight. The desired partitioning assigns equal-sized subsets of rows to each Unit of Execution, while minimizing the number of edges between partitions. Graph partitioning is an NP-complete problem, but many efficient heuristics exist for finding approximate solutions. Algorithms and strategies for solving the graph partitioning problem are discussed in the Graph Algorithms pattern.
Cache and TLB Blocking attempt to exploit re-use of the dense source vector, reducing TLB and cache misses. Each element of the matrix itself is only used once, and there is no re-use to exploit. However, each element xj of the source vector is used for each non-zero in the jth column of the matrix. Exploiting re-use of x between rows will also require loading a larger portion of the source vector y at once. The most effective strategy for cache-blocking is to re-organize the matrix into a set of smaller matrices. The approach is similar to register blocking, but the sub-matrices for cache-blocking are still stored sparsely. Each sub-matrix should require portions of the x and y vectors which fit in the cache.
Additionally, it it sometimes worthwhile to incorporate prefetching. As the memory-access pattern of SpMV is dependent upon the pattern of non-zeroes in the matrix’s rows, the prefetchers incorporated into the memory hierarchies of most processors may have trouble predicting accesses to the x vector. Note that the matrix is stored in linear arrays, and accessed linearly
The discrete representation of the problem (that leads to the matrices and vectors) accurately represents the original problem and produces well conditioned matrices.
The final result approximates the mathematical definition to within tolerated round-off error.
Power Method Iteration This example will demonstrate how Sparse Matrix-Vector Multiplication (SpMV), which has been the focus of much of this pattern, arises as an important kernel in Sparse Linear Algebra. Due to space limitations, we must refer the reader to external sources for definitions of basic concepts such as eigenvectors and eigenvalues. Consider the eigenproblem:
Ax = λx
Where A is a matrix, x an eigenvector, and λ an eigenvalue. The simplest method which can approximate λ and x is the power method (specifically, it approximates the largest eigenvalue and its correponding eigenvector). While the power method is too simple to be useful in practice, many realistic algorithms for solving the sparse eigenproblem can be understood as refinements on the idea underlying the power method. The power method:
Given a matrix A and an initial guess for an eigenvector u0,
For i = 0 … do:
u = Aui
ui+1 = u / ||u||2
λi+1 = uTi+1 A ui+1
As i increases, λi converges to the largest eigenvalue of A. The only computation performed by the power method is the matrix-vector multiplication in step 1, the normalization of the u vector in step 2, and in step 3 the dot product between uT and A u. As the vector normalization and dot product are relatively inexpensive operations, essentially all of the time spent in power method iterations is in the matrix-vector multiplications.
In realistic linear algebra problems, there are many requirements that the power method does not address. For example, it is frequently desirable to find several eigenvalues, rather than just the largest. In these cases, it is necessary to re-orthogonalize the eigenvector approximations between iterations. Additionally, other algorithms converge in fewer iterations than the power method.
Sparse linear algebra is at the heart of a most partial differential equation solvers and hence they are extremely common in the computational sciences. Finance problems, structural mechanics, data mining, operations research … the list of problems based on sparse linear algebra is extensive.
Solutions to the sparse linear algebra problem often use the graph algorithm and task parallelism patterns. For matrices represented as hierarchical data structures, the divide and conquer pattern is indicated.
Sparse iterative solver templates/decision-aide: www.netlib.org/templates
Direct solver decision-aide: crd.lbl.gov/~xiaoye/SuperLU/SparseDirectSurvey.pdf
Sparse eigensover: www.cs.utk.edu/~dongarra/etemplates
Jim Demmel and Mark Hoemmen, CS294 lecture on linear algebra.
List of freely available Linear Algebra Software: http://www.netlib.org/utk/people/JackDongarra/la-sw.html
Tim Mattson and Mark Murphy