Contents
Problem
Some problems are defined in terms of linear operations over arrays of data (e.g. vectors and matrices) the elements of which are mostly zeros (sparse arrays). How do you optimize these operations to minimize storage and maximize performance?
Context
Many problems are expressed in terms of operations on vectors and matrices. These problems are diverse and include problems such as solving systems of simultaneous linear equations (e.g. operations research), finding eigenvalues (e.g. structural mechanics), transforming between different representations of a system (image processing), and many other problems too numerous to list.
For an important class of these problems, most of the elements of the system are zero. This arises from the symmetry of the system or due to the fact that different subcomponents of the system are independent of each other. When the fraction of zeros is significantly large, enough so that there are benefits to explicitly take these zeros into account when solving the problem, these problems are called Sparse Linear Algebra problems.
There are two key benefits to a sparse linear algebra problem. First, the amount of storage required is greatly reduced. Only the nonzero elements and their indices need to be stored which can cut down total storage by orders of magnitude. Second, the computational cost is reduced since the results of arithmetic operations with zeros are known without actually doing the computation.
The challenge in sparse linear algebra is to balance storage, computational cost, and stability to create an effective solution. While the approaches are well constrained for dense linear algebra, the case for sparse linear algebra is much more complicated.
Forces
Tradeoff the cost of recomputing intermediate results as opposed to storing them and fetching them later.
Overlapping computation with data movement requires software constructed around the details of a particular memory hierarchy. But this is in direct conflict with the need to write portable software.
A linear algebra operation may dictate a distinct layout of data to achieve optimal performance. These data layouts, however, may conflict with those needed by other operations forcing the programmer to balance data copy overheads with potentially suboptimal algorithms.
Solution
Developing software “from scratch” to solve sparse linear algebra problems usually requires indepth understanding of the algorithms being implemented. Much care must be taken to ensure numerical stability, algorithmic convergence, and a number of other concerns which are far beyond the scope of this pattern. Developing good algorithms is “more art than science” and is not usually the task of a application developer, even though developers may need to understand some numerical issues specific to their problems. When facing a sparse linear algebra problem, a programmer should always try to find standard library solutions to sparse linear algebra problems. Comprehensive lists of publicly available highquality software are available online, and the References section contains links to some of these lists.
Linear algebra libraries usually contain many routines which solve the same problem (e.g. LU decomposition, or LeastSquares) but are optimized for particular classes of matrices. Selecting the best routine requires a user to understand certain properties of the matrices being used, including the patterns of the location of nonzeroes as well as several numerical characteristics.
Real matrices contain only realvalued entries, as opposed to Complex matrices, whose entries may be complex numbers. A Symmetric matrix is equal to its transpose. That is, each entry a_{i,j} is equal to the element opposite w.r.t. the diagonal, a_{j,i}. Hermitian matrices are the complex analogue of symmetric matrices: a_{i,j} is the complex conjugate of a_{j,i}. Matrices may be banded, with all nonzero entries residing in diagonal bands. A Positive Definite matrix is a symmetric (hermitian, if the matrix is complex) matrix whose eigenvalues are all positive. It also satisfies that v^{T} M v > 0 for all nonzero vectors v. The matrix is Positive Semidefinite if v^{T} M v ≥ 0, for all nonzero v.
As works cited in the References section of this paper provide detailed guides to selecting particular algorithms, we will discuss here only highlevel algorithmic concepts.
Probably the most important is to choose between a direct solver and an iterative solver. As the name implies, the direct solver factors the matrix explicitly using a Cholesky, LU, QR, or some other factorization. Effective direct solvers evaluate interactions between rows and columns of the matrix that fillin the matrix with nonzero values; hence producing a matrix that is less sparse (and hence more expensive to compute and store). The key is to symbolically factor the matrix to understand these interactions and then reorder the problem to minimize fillin. In some cases, the result is a block factored matrix that can be solved as a sequence of smaller dense linear algebra problems. Direct solvers are relatively safe: they always work when the matrix is well conditioned. Unfortunately, they are also slow.
Hence, most people working with sparse solvers use iterative methods. These methods construct an approximation to the solution, compute an error term (a residual), update the approximation and then continue until the resulting error terms are small enough. The key kernel involved with a sparse iterative solver is a sparsematrix, densevector multiplication (SpMV) routine which is itself expressed as a sequence of vector reduction operations.
Convergence rates vary widely depending on the iterative algorithm used and the properties of the sparse matrix. To accelerate convergence, an approximate inverse is used to transform the matrix into one that is better behaved numerically (a preconditioner). Selecting the right preconditioner is where much of the “art” of working with sparse iterative solvers arises.
For more details about sparse solvers, consult references included with this pattern or the documentation to the sparse solver library you are using.
While most highperformance implementations of most interesting algorithms are freely available, a programmer may still need to understand the datastructures used to implement sparse matrices. For direct solvers, the programmer must only understand how the library expects the matrix to be represented. For iterative solvers, frequently the library requires that the user provide an SpMV routine. The library may provide a default SpMV implementation, but as SpMV is a performancecritical operation the user may wish to optimize it. The rest of this Solution will focus on the implementation of Sparse MatrixVector multiplication, and data structures that support its optimization.
HighLevel Optimization Approach As the space of optimizations for SpMV is large, it is important to direct the search by understanding which optimizations are likely to be fruitful. SpMV optimizations target three different aspects: memory bandwidth improvement, datastructure size reduction, and instructionthroughput improvement.
As interesting sparse matrices are frequently very large, it is common for the SpMV computation to be memorybound. If the matrix is larger than the amount of cache present on the target processor, memory bandwidth is likely to be the larger performance bottleneck. In this case, one should focus on datastructure size reduction and memorybandwidth improvement via cache blocking. However, there are situations in which instructionthroughput is more important. For example, the L2 caches on some processors are tens of megabytes, and can hold the entire working sets of many linear algebra problems. In those cases, memory bandwidth is less important than taking advantage of SIMD and multicore parallelism to increase instruction throughput.
Sparse Matrix Data Structures Sparse matrix representations do not explicitly store the matrix’s zeroelements. As most of the matrix’s entries are zero, the total memory required for the matrix is reduced significantly, frequently by an order of magnitude. The most basic data structure is Compressed Sparse Row (CSR), illustrated in the next figure. CSR stores the matrix entries in a single linear array (val) (two arrays, if the entries are complex), along with a separate array of integers (col)) which store the column indices for each nonzero. A third array, (row), with one entry per row of the matrix, stores the offset into the val and col arrays at which each row begins.
For banded matrices, it is frequently advantageous to represent the diagonals in linear vectors, obviating the column index vectors and decreasing the size of the overall data structure. If the bandstructure of the matrix is wellunderstood and predictable, it is possible to eliminate the index vectors altogether.
Frequently, sparse matrices contain many locally dense regions. That is, small (e.g. 4×4 or 8×8) submatrices scattered throughout the matrix may be entirely (or mostly) nonzero. An optimization (frequently referred to as register blocking in the literature) is to represent the matrix as a collection of small dense matrices, rather than as a collection of individual matrix entries. The transformation improves performance in several ways: the amount of index data required to store the resulting representation is substantially smaller; a smaller fraction of the SpMV computation is spent on manipulating the datastructure; and the operations on the dense submatrices are more regular and easier to optimize. However, as the submatrices are usually not entirely zero, register blocking requires the explicit storage of some zero values. The resulting increase in datastructure size and computation on zerovalues must be balanced against the benefits of the transformation. The two most common approaches to register blocking are Block Coordinate (BCOO) and Block Compressed Sparse Row (BCSR). BCOO explicitly stores the row and column indices of the upperleft element of each densified block, while BCSR compresses the rowindex array in the same manner as CSR.
Parallelism in SpMV is somewhat obvious. Each row of the matrix is involved in a sparse dotproduct with the dense vector, and independent of each other row’s dotproduct. Additionally, the sparse dotproduct for each row contains reductionstyle parallelism. The reduction may be difficult to implement efficiently, however, due to the irregularity of a matrix’s row structure. In practice, the dot products are frequently serialized, as the parallelism between rows in large matrices is sufficient to saturate most parallel targets. SpMV’s parallelism can be exploited using the techniques described in the Data Parallelism pattern; frequently implementations utilize the SPMD and SIMD implementation patterns.
Loadbalancing a parallel SpMV can be important to performance. As each row of the matrix may contain a different number of nonzero elements. Thus when parallelizing among matrix rows, the amount of work assigned to a Unit of Execution depends on the structure of the matrix.
Additionally, on a distributedmemory machine, parallelization can require communication between nodes if the source x vector is distributed: A node with a nonzero in column k will need to load the k^{th} element of x, which may reside on another node. It is desirable to find a domaindecomposition (i.e. assignment of matrix rows to compute nodes) which minimizes communication. This problem is frequently formulated as a graphpartitioning problem. The graph of a Sparse Matrix contains one edge corresponding to each nonzero entry; all edges are undirected and of equal weight. The desired partitioning assigns equalsized subsets of rows to each Unit of Execution, while minimizing the number of edges between partitions. Graph partitioning is an NPcomplete problem, but many efficient heuristics exist for finding approximate solutions. Algorithms and strategies for solving the graph partitioning problem are discussed in the Graph Algorithms pattern.
Cache and TLB Blocking attempt to exploit reuse of the dense source vector, reducing TLB and cache misses. Each element of the matrix itself is only used once, and there is no reuse to exploit. However, each element x_{j} of the source vector is used for each nonzero in the j^{th} column of the matrix. Exploiting reuse of x between rows will also require loading a larger portion of the source vector y at once. The most effective strategy for cacheblocking is to reorganize the matrix into a set of smaller matrices. The approach is similar to register blocking, but the submatrices for cacheblocking are still stored sparsely. Each submatrix should require portions of the x and y vectors which fit in the cache.
Additionally, it it sometimes worthwhile to incorporate prefetching. As the memoryaccess pattern of SpMV is dependent upon the pattern of nonzeroes in the matrix’s rows, the prefetchers incorporated into the memory hierarchies of most processors may have trouble predicting accesses to the x vector. Note that the matrix is stored in linear arrays, and accessed linearly
Invariant
The discrete representation of the problem (that leads to the matrices and vectors) accurately represents the original problem and produces well conditioned matrices.
The final result approximates the mathematical definition to within tolerated roundoff error.
Examples
Power Method Iteration This example will demonstrate how Sparse MatrixVector Multiplication (SpMV), which has been the focus of much of this pattern, arises as an important kernel in Sparse Linear Algebra. Due to space limitations, we must refer the reader to external sources for definitions of basic concepts such as eigenvectors and eigenvalues. Consider the eigenproblem:
Ax = λx
Where A is a matrix, x an eigenvector, and λ an eigenvalue. The simplest method which can approximate λ and x is the power method (specifically, it approximates the largest eigenvalue and its correponding eigenvector). While the power method is too simple to be useful in practice, many realistic algorithms for solving the sparse eigenproblem can be understood as refinements on the idea underlying the power method. The power method:
Given a matrix A and an initial guess for an eigenvector u_{0},
For i = 0 … do:

u = Au_{i}

u_{i+1} = u / u_{2}

λ_{i+1} = u^{T}_{i+1} A u_{i+1}
As i increases, λ_{i} converges to the largest eigenvalue of A. The only computation performed by the power method is the matrixvector multiplication in step 1, the normalization of the u vector in step 2, and in step 3 the dot product between u^{T} and A u. As the vector normalization and dot product are relatively inexpensive operations, essentially all of the time spent in power method iterations is in the matrixvector multiplications.
In realistic linear algebra problems, there are many requirements that the power method does not address. For example, it is frequently desirable to find several eigenvalues, rather than just the largest. In these cases, it is necessary to reorthogonalize the eigenvector approximations between iterations. Additionally, other algorithms converge in fewer iterations than the power method.
Known Uses
Sparse linear algebra is at the heart of a most partial differential equation solvers and hence they are extremely common in the computational sciences. Finance problems, structural mechanics, data mining, operations research … the list of problems based on sparse linear algebra is extensive.
Related Patterns
Solutions to the sparse linear algebra problem often use the graph algorithm and task parallelism patterns. For matrices represented as hierarchical data structures, the divide and conquer pattern is indicated.
References

Sparse iterative solver templates/decisionaide: www.netlib.org/templates

Direct solver decisionaide: crd.lbl.gov/~xiaoye/SuperLU/SparseDirectSurvey.pdf

Sparse eigensover: www.cs.utk.edu/~dongarra/etemplates

Jim Demmel and Mark Hoemmen, CS294 lecture on linear algebra.

List of freely available Linear Algebra Software: http://www.netlib.org/utk/people/JackDongarra/lasw.html
Authors
Tim Mattson and Mark Murphy