Okay, ready? Here's another post in the series about CUDA and GPU acceleration. (And yes, sadly, this is the last one, at least for the moment...)
We've tackled basic CUDA, multithreading, grids, and memory models. Now let's talk about streams!
This diagram shows using a GPU with a grid of blocks and threads:
Back in the grids discussion we showed we could run 120,000 blocks of 1,000 threads, vastly accelerating our processing with massive parallelism. So what can be done beyond that?
As you can see in this diagram, the whole grid is invoked as a single kernel with a single call. The Device goes off and does whatever it is asked to do - organized as blocks of threads - and then completes. But what if you want to run more than one kernel at the same time? This isn't really a way to get more parallelism of the same task, its more a way to do different things at the same time.
When a kernel is invoked, it is always as part of a stream. There is an implicit "stream zero", and all the things we've done so far in our examples have used this stream. If we want to run different kernels at the same time, we can define more streams, and put each one in a different stream. The operations in each stream are serialized with respect to that stream, but asynchronous with respect to other streams. (There are ways to synchronize to a stream, as we shall see.)
Here's what using a GPU w multiple streams looks like:
In the Host thread, we make multiple kernel invocations into different streams. Each stream runs in parallel on the Device. The Host can synchronize with all streams or with any one stream. And as we'll see, it's also possible for a Device to signal the Host from within a stream.
Let's break up the processing of our example program still further. Let's provide for the possibility to have multiple streams. Here's hello7.cu which implements this (as before, that's a link to the source code if you want to try this yourself):
The changes are highlighted above. First, at the top, highlighted in yellow, we've added yet another parameter to specify the number of streams, with a default of 1.
In the main() function, also highlighted in yellow, we've added a loop. For each stream we call cudaStreamCreate() which returns a handle of type cudaStream_t. These handles are stored in an array named cs. On the kernel invocation, we've added a new fourth parameter to specify the stream. We didn't do this before, and the default was "stream zero". Specifying a stream means all the processing done for that kernel - the whole grid of blocks of threads - all happens within that stream. Invoking another kernel into a different stream means they will run asynchronously relative to each other, sharing the Device resources.
As before, after all the kernels are invoked into all the streams, we call cudaDeviceSynchronize() to wait for them all to complete. If we had wanted there is also a function cudaStreamSynchronize() which synchronizes the Host to a given steam. In this case wanted to wait for all of them. And then finally we have to call cudaStreamDestroy() to clean up the resources for each stream.
Okay so that's great, we're running multiple streams, but now we have to divide up the work between the streams, otherwise they'll all just do the same thing. These changes are highlighted in green.
We've added two new parameters to the domath() global function which is the main line of the kernel, to pass the stream index, and the number of streams. These values are then used in the calculation of index and stride inside the domath() loop. With this logic each stream processes only its portion of the blocks.
One other change, not directly related to streams, is highlighted in turquoise at the top of main(). We added a call to cudaGetDeviceProperties(), so we could display some interesting information about our particular GPU, like the kind of Device, amount of Graphics Memory, number of stream processors, and number of cores per processor. This call can also be used to determine if the Device supports Host and Managed memory models.
Okay, well that wasn't too difficult. Let's try running hello7, what will happen?
The first of these runs used one stream, so this is exactly like when we ran hello6 before. The second run broke up the processing into 10 streams. You can see, the number of blocks in each stream is one-tenth of the number of blocks in the first run, so we had 10 streams of 55,458 blocks of 1,024 threads. This was faster, by 1.1s.
Once again we can see, the allocation and initialization of the arrays, and copying them back and forth to the Device takes far longer than the computation itself. Marvel at the fact that in the second run, the entire computation on three 5GB arrays took 0.26s. This is GPU acceleration in action!
Also note the information displayed about the device. This particular GPU is a NVidia GeForce RTX 3080, it has 16GB of Graphics Memory, 58 stream processors, and 1,536 cores per processor. No wonder it can process so many operations in parallel!
There's one other thing I wanted to share with you. CUDA has the ability to call Host functions inline in a Device stream. This is useful for a bunch of reasons, such as if you want to know when a kernel has completed processing of some data, or needs more input.
The following sample program hello8.cu includes a simple example of this:
I haven't included the entire source listing, because the changes are only at the bottom. Highlighted in yellow, we've added a call to cudaLaunchHostFunc(). This call specifies a stream, and a Host function to be called, in this case streamdone(). You can also pass a parameter which is in turn passed to the Host function. The streamdone() function simply writes a message, but of course you could do more, such as signaling a Host thread to take some action.
Now when we run hello8, it looks like this:
As planned, when each stream completed its kernel invocation, the next stem in the stream was calling the Host function, which wrote out a message. In this case the streams completed in the order they were started, but that wouldn't necessarily be the case if the workloads were different.
So that's it! At least for now :) Hope you enjoyed this series, and please let me know if you have comments or questions, and especially corrections... good luck parallelizing!
|