Tutorial - GPU 1: Game of Life on GPU

Overview

In this tutorial, we will implement Conway's Game of Life, the algorithm which made cellular automata popular on graphics processing units (GPUs). This tutorial runs on NVIDIA GPUs with CUDA but can also run on AMD GPUs using HIP. waLBerla fully supports both libraries. For a basic understanding of the GPU support in waLBerla please read Introduction to GPU Programming with waLBerla first.

This tutorial is an extension of Tutorial - Basics 3: Writing a Simple Cellular Automaton in waLBerla to GPUs.

Creating Fields

To run a simulation on a graphics processing unit (GPU), we have to allocate data on the GPU and write a kernel that operates on this data. In this tutorial we first allocate a field on the GPU and learn about functionality to transfer data between CPU and GPU fields.

Since initialization and output routines are usually not time critical, they are implemented for CPU fields only. In waLBerla we set up the complete simulation using CPU fields, copy the initialized fields over to the GPU, do the complete computation there, and, in the end, copy everything back to do the output from the CPU field. So only the time critical kernels have to be written for GPU.

Thus the setup code of the GPU GameOfLife program is very similar to its CPU version, which was implemented in a previous tutorial ( Tutorial - Basics 3: Writing a Simple Cellular Automaton in waLBerla ). One difference is, that fields which are often transferred from/to the GPU should be allocated with a different field allocator: gpu::HostFieldAllocator . This allocator uses gpuHostAlloc() instead of "new" , such that the memory is marked "pinned", which means that it is always held in RAM and cannot be swapped out to disk. Data transfer from pinned memory is faster than from normal memory. The usage of this allocator is not mandatory, the data transfer functions work (slightly slower) also with normally allocated fields.

auto hostFieldAllocator = make_shared< gpu::HostFieldAllocator<real_t> >();
BlockDataID const cpuFieldID =field::addToStorage< ScalarField >(blocks, "CPU Field", real_c(0.0), field::fzyx, uint_c(1), hostFieldAllocator);

Now we initialize the CPU field just like in the previous tutorial tutorial_basics03 . Then two GPU fields are created: "source" and "destination" field. The helper function gpu::addGPUFieldToStorage() creates a gpu::GPUField field of the same size and layout of the given CPU field:

BlockDataID gpuFieldSrcID = gpu::addGPUFieldToStorage<ScalarField>( blocks, cpuFieldID, "GPU Field Src" );
BlockDataID gpuFieldDstID = gpu::addGPUFieldToStorage<ScalarField>( blocks, cpuFieldID, "GPU Field Dst" );

The contents of the new GPU fields are initialized with the contents of the given CPU field.

Writing and calling CUDA kernels

After reading this page you should know what a FieldAccessor is and how to call GPU. So we can now start with writing a kernel for the Game of Life algorithm. We place this in a separate file with ".cu" extension (This is basically the only part that is different between CUDA and HIP). The build system then automatically detects that this file should be compiled with the CUDA C++ compiler.

The kernel gets two field accessors as arguments, one for the source and one for the destination field. Both accessors have to be configured using the variables blockIdx and threadIdx from the CUDA or HIP library, such that afterwards the get() and getNeighbor() functions of the accessor class can work correctly.

__global__ void gameOfLifeKernel( gpu::FieldAccessor<double> src, gpu::FieldAccessor<double> dst )
{
src.set( blockIdx, threadIdx );
dst.set( blockIdx, threadIdx );
int liveNeighbors = 0;
if ( src.getNeighbor( 1, 0,0 ) > 0.5 ) ++liveNeighbors;
if ( src.getNeighbor( -1, 0,0 ) > 0.5 ) ++liveNeighbors;
// normal Game of Life algorithm ....
// ...
}

To call this kernel we create a gpu::FieldIndexing object that receives a pointer to GPU fields. With this the blockDim and gridDim can be obtained as well as gpuAccess objects that contain the neighbouring information needed inside the GPU kernel. The kernel can be called normally with the three angle brackets.

auto srcCudaField = block->getData< gpu::GPUField<real_t> > ( gpuFieldSrcID_ );
auto dstCudaField = block->getData< gpu::GPUField<real_t> > ( gpuFieldDstID_ );
auto srcIndexing = gpu::FieldIndexing<real_t>::xyz( *srcCudaField );
auto dstIndexing = gpu::FieldIndexing<real_t>::xyz( *dstCudaField );
auto srcAccess = srcIndexing.gpuAccess();
auto dstAccess = dstIndexing.gpuAccess();
const dim3 gridDim = srcIndexing.gridDim();
const dim3 blockDim = srcIndexing.blockDim();
gameOfLifeKernel<<<gridDim, blockDim, 0, nullptr >>>(srcAccess, dstAccess );
srcCudaField->swapDataPointers( dstCudaField );

All the computations are done on the GPU. The CPU field is not updated automatically! It was just used for setup reasons.

VTK Output

To see if our kernel works, we create a VTK writer. The VTK writer works on the CPU field. Thus it works exactly as in other examples. However, since our data is on GPU we need a addBeforeFunction that copies our data from host to device. This is done using the gpu::fieldCpyFunctor. Note that copying data is costly and thus we don't want to do this in every timestep usually. In this example it is only done every second timestep.

const uint_t vtkWriteFrequency = 2;
if (vtkWriteFrequency > 0)
{
auto vtkOutput = vtk::createVTKOutput_BlockData(*blocks, "vtk", vtkWriteFrequency);
vtkOutput->addBeforeFunction(gpu::fieldCpyFunctor<ScalarField, GPUField >(blocks, cpuFieldID, gpuFieldDstID));
auto dataWriter = make_shared< field::VTKWriter< ScalarField > >(cpuFieldID, "output");
vtkOutput->addCellDataWriter(dataWriter);
timeloop.addFuncBeforeTimeStep(vtk::writeFiles(vtkOutput), "VTK Output");
}

Communication

For this tutorial we use the gpu::communication::UniformGPUScheme that first collects all data in a buffer and sends only one message per communication step and neighbor. For the PackInfo we use the MemcpyPackInfo. It receives a buffer located on the GPU and fills it using memcpy operations If the GPU library is build with MPI support this buffer can be send to other GPUs without a copy to the CPU. Otherwise the copying will be done in the back by the communication class.

using CommScheme = gpu::communication::UniformGPUScheme<stencil::D2Q9 > ;
using Packing = gpu::communication::MemcpyPackInfo<GPUField> ;
const bool sendDirectlyFromGPU = false;
CommScheme commScheme(blocks, sendDirectlyFromGPU);
commScheme.addPackInfo( make_shared<Packing>(gpuFieldSrcID) );

Running the simulation

To run the simulation we would like to point out a few common pitfalls to avoid. Basically it works very similar than the CPU equivalent. Since all Sweeps and Function calls are registered by the timeloop we can run the simulation using timeloop.run();. However, it is important to point out that kernel calls are asynchronous. Thus for time measurement purpose we need to make sure that all kernels are executed before stopping the timer. This can be done using gpuDeviceSynchronize. For good measure we also run this function right before starting the timer.

WcTimer simTimer;
WALBERLA_GPU_CHECK( gpuDeviceSynchronize() )
simTimer.start();
timeloop.run();
WALBERLA_GPU_CHECK( gpuDeviceSynchronize() )
simTimer.end();
WALBERLA_LOG_INFO_ON_ROOT("Simulation finished")
auto time = real_c(simTimer.last());
WALBERLA_LOG_RESULT_ON_ROOT("Game of life tutorial finished. Elapsed time " << time)
VTKOutput::Write writeFiles(const shared_ptr< VTKOutput > &vtk, const bool immediatelyWriteCollectors=true, const int simultaneousIOOperations=0, const Set< SUID > &requiredStates=Set< SUID >::emptySet(), const Set< SUID > &incompatibleStates=Set< SUID >::emptySet())
Definition: VTKOutput.h:703
timing::Timer< timing::WcPolicy > WcTimer
Definition: Timer.h:597
Definition: MemcpyPackInfo.h:17
__global__ void gameOfLifeKernel(gpu::FieldAccessor< real_t > src, gpu::FieldAccessor< real_t > dst)
shared_ptr< VTKOutput > createVTKOutput_BlockData(const StructuredBlockStorage &sbs, const std::string &identifier=std::string("block_data"), const uint_t writeFrequency=1, const uint_t ghostLayers=0, const bool forcePVTU=false, const std::string &baseFolder=std::string("vtk_out"), const std::string &executionFolder=std::string("simulation_step"), const bool continuousNumbering=false, const bool binary=true, const bool littleEndian=true, const bool useMPIIO=true, const uint_t initialExecutionCount=0)
Definition: VTKOutput.h:581
@ fzyx
Value-sorted data layout (f should be outermost loop)
Definition: Layout.h:34
#define WALBERLA_LOG_INFO_ON_ROOT(msg)
Definition: Logging.h:669
std::size_t uint_t
Definition: DataTypes.h:133
static FieldIndexing< T > xyz(const GPUField< T > &f)
Definition: FieldIndexing.impl.h:149
Iterator end()
Definition: Create.h:71
#define WALBERLA_LOG_RESULT_ON_ROOT(msg)
Definition: Logging.h:615
#define WALBERLA_GPU_CHECK(ans)
Definition: ErrorChecking.h:34
Definition: UniformGPUScheme.h:49