9:45 Multi-GPU scalability of mesh-based HPC applications – T. Aoki (Tokyo Tech)

Some of HPC applications are successfully accelerated on GPU. In order to execute large-scale HPC applications beyond local VRAM limitation, GPU-to-GPU communications are required over nodes through the PCI-Express bus and the interconnections. These communications cost comparable to GPU computation and the overlapping technique between computation and communication is required to sustain the linear strong scaling. We demonstrate two mesh-based applications.


One is the Lattice Boltzmann method with 2000x1000x1000 mesh showing the multiple-GPU scalability on Tokyo Tech TSUBAME grid cluster. Another is a phase transition studied by using the Phase-Field model. A dentritic solidification of such pure metal as steel can be examined by solving the Allen-Cahn equation coupled with the thermal conduction. We have developed a GPU simulation code in the CUDA programming and achieved 171 GFlops (single precision) on single NVIDIA GeForce GTX 285 GPU. By introducing the overlapping technique, the 60 GPUs of Tesla S1070 showed 10 TFlops for the computation of the 2400x2400x2400 mesh.