The governing equations for the dynamics in ASUCA are the compressible nonhydrostatic equations written in flux form. In ASUCA, the Lorenz coordinate is used on the Arakawa C grid. The terrain follows the vertical coordinate transformation. The 3rd Runge-Kutta method given by Wicker (2002) is adopted in time integration. Because the vertical grid spacing is much smaller than the horizontal, the sound speed in the vertical direction determines the time step. To avoid this, horizontally explicit and vertical implicit (HEVI) scheme with time-splitting method is introduced in ASUCA.
We have implemented the dynamics in ASUCA on the NVIDIA Tesla S1070 by using the CUDA programming and have performed in comparison with CPU. In order to achieve high performance with GPU, all variables needed for calculations of the dynamics are allocated on GPU, which makes possible to run the simulation without frequent data transfer between GPU and the host computer. The performance of 44.3 GFlops in single precision for 320 x 256 x 48 mesh on a single GPU has been achieved. It is found that the ASUCA implemented on the GPU runs 83.4 times faster than the original code for CPU performed by the serial implementation in Fortran on AMD Opteron 2.4 GHz. In the case of the computation in double precision, the speedup results in 26.3x. For multi-GPUs computation, an overlapping method between communication and computation is introduced. Using 120 GPUs on TSUBAME Supercomputer at the Tokyo Institute of Technology, the performance of 3.2 TFlops in single precision for 3164 x 3028 x 48 has been achieved.