27/04/2011: Applying software-managed caching and CPU/GPU task scheduling for accelerating dynamic workloads


Applying software-managed caching and CPU/GPU task scheduling for
accelerating dynamic workloads

April 27, 2011, 16:15 -17:00, CAB F84, ETH Zurich

Mark Silberstein


 
In this talk we address two problems frequently encountered by GPU developers:  optimizing memory access for kernels with complex input-dependent access patterns, and mapping the computations to a GPU or a CPU in  composite applications with multiple dependent kernels. Both require dynamic adaptation and tuning of execution policies to allow high performance for a wide range of inputs.

We first describe our methodology for solving the memory optimization problem via software-managed caching by efficiently exploiting the fast scratchpad memory. This technique outperforms the cache-less and the texture memory-based approaches on pre-Fermi GPU architectures as well as on the one that uses the Fermi hardware cache alone.

We then present the static scheduling algorithm for minimizing the total running time of a complete application comprising multiple  kernels with tree data dependencies. Both a GPU and a CPU can be used to execute the kernels, but the performance varies greatly for different inputs.  The algorithm presents a graph-based approach to optimizing the running time by evaluating the performance of all the assignments jointly, including the communication overhead due to the data dependencies between the kernels. This algorithm can be also applied for minimizing energy consumption   at the expense of  higher runtimes, in which case the algorithm provides provably optimal solution.

We demonstrate these techniques by applying them to a real application for  computing probability of evidence in probabilistic networks. The combination of memory optimization and dynamic assignment  results in up to three-fold runtime reduction over the non-optimized version on real inputs, and up to five-fold over a highly optimized parallel version running on Intel's latest dual quad-core 16-thread Nehalem machine.