9:00 A CUDA implementation of the Himeno benchmark on a cluster with GPUs – M. Fatica (Nvidia)


This talk will illustrate the use of CUDA to accelerate the Himeno benchmark on clusters with GPUs.  
The implementation, designed to optimize memory bandwidth, achieves  over 50 GFlops per GPU.  The optimizations required to achieve this level of performance and the implementation details of utilizing MPI alongside  CUDA  will be presented.