<<< Comments?


comments on comments >>>

Hello CUDA: memory models - GPU series #5

Wednesday,  01/25/23  11:09 PM

Hi all, another post in the series about CUDA and GPU acceleration. (Next post here)

So far we've considered a simple example of doing some math on large arrays, migrating from C++ to CUDA, multithreading, and implementing grids of blocks of threads.  We made this operation a lot faster using the GPU to parallelize the processing.  In all this, we've sort of ignored memory.  This time we're going to dive into the details.

As this diagram shows, the Host has Main Memory, accessed by the CPU cores, and the Device has Graphics Memory, accessed by the GPU SP cores.  As we've seen, CUDA makes it pretty easy to ignore this; we allocate storage with cudaMallocHost() instead of malloc(), and yay it is magically shared between the Host and Device.

There are actually three ways to share information between the Host and the Device.  One way is as we've seen; use "Host pinned" memory, via the CUDA API:

Host memory is allocated in Main Memory on the Host, and simultaneously in Graphics Memory on the Device.  The CUDA driver takes care of synchronizing the data between the Main and Graphics memory.  When a page is accessed, if it is "dirty" on the other side (has been modified), a copy of the page is transferred over the Bus.  The Bus is fast but not that fast, so this can slow down processing.  For our particular test we initialized the arrays on the Host side, processed them on the Device side, and then accessed the results back on the Host; not too much back and forth.

A drawback of using Host memory is that all the memory is allocated on both the Host and the Device, and on the host the memory is "pinned", that is, it cannot be swapped out to disk.  Depending on the workload of the Host this is undesirable, especially if the Device processing is long running.

More recent computers and NVidia GPUs support another mechanism which is nicer: "managed" (aka "unified") memory:

In this model memory is allocated on both the Host and the Device, but it is the same memory.  The address space is configured so both the Host and Device point to the same place.  This is super clean because no copying across the bus is required.  (Of course, memory accesses to the memory do take place over the Bus...)

To make use of managed memory is easy, simply call cudaMallocManaged() to allocate it, and then call cudaFree() to free it.  Operationally everything is the same as using Host memory, and it will run [at least a bit] faster.  Also, it will use less memory, and the memory won't be "pinned".  The main drawback is that this is a newer approach that is not supported on every computer with every GPU.  So in practice you might have to test whether Unified memory is available, and if not use Host memory instead.

There is a third approach to Host / Device memory which is older, harder, and more complicated.  So yay.  But it does have some advantages:

In this model, you explicitly allocate memory on the Host (with malloc()), and separately, explicitly allocate memory on the Device (with cudaMalloc()).  The memory allocated in this way is not at the same address and not cross-addressable.  Explicit data transfers using cudaMemcpy() must be performed to copy data from Host to Device, and vice-versa.  So there is more bookkeeping and more complexity.  But.

The advantages of using separate memory are speed - both the Host and Device are allocating local memory - and all transfers are explicit.  And a big advantage is the Main memory is only needed while the Host is using it, and the Graphics memory is only needed while the Device is using it.  So for long-running processes less memory is used overall, freeing more memory for use by other processes.

Let's take our example program from last time and modify it further to support Host or Managed memory, here's hello5.cu:

As before the changes are highlighted.  There's a new global enum gpumem to hold which kind of memory we want to use, and some logic to take this in as a fourth parameter: "h" for Host memory, and "m" for Managed memory.

The storage allocation function (getarray()) and deallocation function (freearray()) have been changed to test the gpumem enum and make different CUDA calls correspondingly: cudaMallocHost() and cudaFreeHost() for Host memory, and cudaMallocManaged() and cudaFree() for Managed memory.

Now when we compile and run hello5, we can compare:

For this comparison, I increased the scale of the test to 567,891,234, which is 4.3BG per array.  (The test uses three arrays, so that's a total of 13GB.)  You can see the Host allocation took 1.3s second longer - 6.2s vs 4.9s - and the processing was slightly longer too - 1.4s vs .9s.  And once again we see that the allocation takes longer than the processing. 

From this we conclude that yes, Managed memory is better when you can use it.  (When the computer and GPU architecture implement Unified storage.)  It's especially better if the Host has other things to do because the memory is swappable.

So how about the Separate memory model?  Okay ... here's hello6.cu which implements that possibility:

Yep, lots of changes, as highlighted.  Using Separate memory is more complicated.  Let's go through all of them. 

First, at the top, in yellow, we've expanded the gpumem enum for Separate memory, and the corresponding parsing of the fourth parameter to support "s" for separate.  That's the easy part.

We've also expanded the logic of gethostarray()  to use a simple malloc() when allocating on the host, and similarly expanded freehostarray() to use a simple free().  Note these functions have been renamed; before we called them getarray() and freearray() becauase the memory was on both Host and Device, now they are explicitly allocating memory on the Host only.

The green highlighting shows new code for handling the arrays on the device side.  New functions getdevarray() and freedevarray() allocate and deallocate memory on the Device.  And the new functions copyhosttodev() and copydevtohost() use the CUDA runtime function cudaMemcpy to copy data between the Host and Device.  There's also a small change in the main() function to pass the Device array pointers on the kernel invocation - dA, dB, and dC.  Note that with Separate memory the Host and Device memory will be at different locations in their respective address spaces.

The biggest changes are in the two new functions highlighted in turquoise - setupsourcearrays() and setupdestarrays().  They each implement a different logic for Separate memory than for Host or Managed memory.  This example is complicated because we are doing both; in any one program, you might decide to use one specific model and then you only have to have that logic.

Here's the high-level logic for both cases:

Host or Managed Memory
- Allocate source arrays on Host
- Initialize source arrays
- Allocate destination array on Host (which is also on Device)
- Invoke kernel
- Deallocate source arrays on Host (which are also on device)
- (process destination array)
- free destination array on Host

Separate Memory
- Allocate source arrays on Host
- Initialize source arrays
- Allocate source arrays on Device
- Copy source arrays to Device
- Free source arrays on Host
- Invoke kernel
- Allocate destination array on Host
- Copy destination array to Host
- Free all device arrays
- (process destination array)
- free destination array on Host

The parts which are common are in red, so you can see the similarity. The parts which are different are blue, as implemented in the functions highlighted in turquoise above.  Yep it's more involved, for sure.  But as a reminder, there are benefits: local memory on both Host and Device, and not trying up memory on both sides when unnecessary.  To see the practical effect of all this complexity, let's run hello6:

Okay!  The Host case - first run - is a bit worse than the Managed case - second run - but not by much.  We saw this above with hello5.  The Separate case - third run - is however quite a  bit faster.  We have more steps because the arrays have to be copied back and forth, and you can see this copying took longer than the actual computation.  But the memory allocations were faster, the memory initialization was faster, and the actual computation was faster, resulting in an overall net which was over a second in savings.

There would be a further benefit if the Host was doing other things while the Device was doing the computation.  During the computation itself no Host memory is allocated, all the resources used are only on the Device.

The conclusion is - it depends.  If you want simple and want to run anywhere, use the Host memory model.  If you know you can use Unified memory, then use Managed memory model.  And if you don't mind the complexity and want top performance, use the Separate memory model.  As a final thought about this, you probably aren't coding CUDA directly unless you don't mind the complexity and want top performance - otherwise you'd probably just be using a library!

So that's it, right?  No there's more ... please stay tuned for the next installment when we talk about ... streams!