9:00 A CUDA implementation of the Himeno benchmark on a cluster with GPUs – M. Fatica (Nvidia)
This talk will illustrate the use of CUDA to accelerate the Himeno benchmark on clusters with GPUs. The implementation, designed to optimize memory bandwidth, achieves over 50 GFlops per GPU. The optimizations required to achieve this level of performance and the implementation details of utilizing MPI alongside CUDA will be presented.