13:15 GPU Acceleration of Weather Prediction Model – T. Shimokawabe (Tokyo Tech)
ASUCA is a next-generation high resolution meso-scale atmosphere model that is being developed by Japan Meteorological Agency. The numerical simulation of the atmosphere requires solving complicated governing equations with a short run time to perform precise weather forecasts for everyday. While utilizing hundreds of CPUs is certainly the most common method of performing such large-scale simulations, the use of a GPU as a massively parallel computing platform is another one of the solutions of these.
The governing equations for the dynamics in ASUCA are the compressible nonhydrostatic equations written in flux form. In ASUCA, the Lorenz coordinate is used on the Arakawa C grid. The terrain follows the vertical coordinate transformation. The 3rd Runge-Kutta method given by Wicker (2002) is adopted in time integration. Because the vertical grid spacing is much smaller than the horizontal, the sound speed in the vertical direction determines the time step. To avoid this, horizontally explicit and vertical implicit (HEVI) scheme with time-splitting method is introduced in ASUCA.
We have implemented the dynamics in ASUCA on the NVIDIA Tesla S1070 by using the CUDA programming and have performed in comparison with CPU. In order to achieve high performance with GPU, all variables needed for calculations of the dynamics are allocated on GPU, which makes possible to run the simulation without frequent data transfer between GPU and the host computer. The performance of 44.3 GFlops in single precision for 320 x 256 x 48 mesh on a single GPU has been achieved. It is found that the ASUCA implemented on the GPU runs 83.4 times faster than the original code for CPU performed by the serial implementation in Fortran on AMD Opteron 2.4 GHz. In the case of the computation in double precision, the speedup results in 26.3x. For multi-GPUs computation, an overlapping method between communication and computation is introduced. Using 120 GPUs on TSUBAME Supercomputer at the Tokyo Institute of Technology, the performance of 3.2 TFlops in single precision for 3164 x 3028 x 48 has been achieved.