13:30 Scalable Implementation Techniques for Sparse Iterative Solvers on GPU Clusters – A. Cevahir (Tokyo Tech)

Motivated by high computation power and low price per performance ratio of GPUs, GPU accelerated clusters are being built for high performance scientific computing. We propose scalable implementation techniques for sparse iterative solvers, particularly a Conjugate Gradient solver in this work, for unstructured matrices on a multi-GPU extended cluster. Each cluster node has multiple GPUs. Basic computations of the solver are held on GPUs and communications are managed by the CPU. For sparse matrix-vector multiplication, which is the most time-consuming operation, solver automatically selects the fastest between several high performance GPU algorithms proposed by NVIDIA and ourselves. In a GPU-extended cluster, it is more difficult than traditional CPU clusters to obtain scalability, since GPUs are very fast compared to CPUs. Hence, they demand faster 
communication. To achieve scalability, we adopt hypergraph-partitioning-based models, which are state-of-the-art models for communication reduction and load balancing for parallel sparse iterative solvers. We explain a hierarchical partitioning model which better optimizes underlying heterogeneous system. In our 
experiments, we obtain up to 152 Gflops of CG performance on 16 nodes with 2 NVIDIA Tesla GPUs on each.