<<< Wednesday, January 18, 2023 10:14 AM

Home

Saturday, January 21, 2023 11:47 AM >>>


Hello CUDA - GPU series #1

Saturday,  01/21/23  09:28 AM

Hi all, this is the first of a series of posts about CUDA and GPU acceleration.  (Next post here)

For some time I've been aware of GPU acceleration, and NVidia, and CUDA, but it was a bit of a black box.  Recently I've been working on a cool project which has enabled me to double-click on this to understand what's inside the box.

Maybe it would be good to start with an introduction: what is a GPU, why GPU acceleration, who are NVidia, and what is CUDA.



What is a GPU

GPU is an acronym for Graphics Processing Unit.  The diagram at right shows the overall architecture of a modern workstation (aka PC). 

There's a CPU, with several "cores" (maybe 10 or so), and a GPU, with many many cores (maybe 100 or so).  The cores on a GPU are often called Stream Processors, or SPs, for reasons that will be apparent a bit later.

In the parlance of GPUs, the CPU is referred to as the Host, and the GPU is called the Device*.

* A number of terms in the GPU world are "overloaded"; they mean one thing in general, but a specific different thing in this world.  I'll try to call out the specific uses of these terms as we go along.

In addition to the CPU cores the Host has Main Memory (maybe 16GB or so).  This memory is somewhat more complicated than a simple box, but for now we'll treat it as a big blob of storage for data.  The Device also has its own Graphics Memory (maybe 16GB or so, maybe more).  Again, it's more involved than a box, but to start we'll treat it as such.  The Device also has a video interface for connecting one or more monitors.  This was the original reason for the existence of GPUs, but as we'll see more recently they've been used for other purposes.

The CPU and GPU (or we shall say Host and Device) communicate over a Bus.  The Bus is fast (currently about 200MB/s), but not nearly as fast as Main Memory (about 2GB/s) or Graphics Memory (similar, about 2GB/s).


Evolution of GPUs

The history goes back to the earliest days of workstations (PCs).

GPUs began as simple graphics adapters.  CPUs had one or a small number of Cores and Main Memory, and Graphics Adapters simply added a Video Interface to display things on a monitor.  The Video Interface accessed the Main Memory over the Bus.


 

Specific Graphics Memory was added to Graphics Adapters to offload the CPU.  This enabled much faster display of graphics, but also, left the CPU and Main Memory free to perform tasks in parallel to displaying information on a Monitor.


 

GPUs added processors for simple operations, like rendering textures, and implementing "sprites" (small regions of images which moved against a background).  These processors were simple and slow compared to CPUs, but they speeded the graphics experience, and further offloaded CPUs.


 

GPUs added more processors, enabling more complex video operations to be performed on the GPU instead of the CPU.  Unlike CPUs, the emphasis was on performing multiple simple operations in parallel.


 

At the same time, CPUs began adding more cores too, enabling more parallel processing. 

This was partly a reflection of the more complex workloads being performed by workstations - many programs running at the same time - and partly due to computer architecture - it became easier to add more cores than to make individual cores faster.


 

As the complexity of CPU workloads continued to increase and as GPUs became more capable, they began to be used for non-video processing.  They were good for offloading compute tasks which were highly parallel.

Graphics Adapter companies like NVidia began making software toolkits like CUDA to facilitate this use of GPUs.


 

As the usage of GPUs for non-video tasks increased, GPUs added even more processors.

Current GPUs have 100s or processors which can concurrently process 1,000s of threads, enabling highly parallel compute tasks to be accelerated far beyond what is possible on CPUs.


Why GPUs for computation

GPUs optimized for highly parallel computing were perfect for applications like image and video processing, and AI/ML computation.

This diagram illustrates CPU processing without a GPU.  The green arrows represent units of work, with time going from left to right.  Multiple CPU cores can process tasks concurrently, up to about 20 in parallel.  The CPU threads share Main Memory.


 

This diagram shows CPU processing with a GPU.  The blue lines show the CPU spawning multiple GPU threads - the yellow arrows - which run in parallel to the CPU threads.  Hundreds and indeed thousands and tens of thousands of parallel threads can run in this way.  The GPU threads share Graphics Memory, separate from the Main Memory used by the CPU.

The process of starting GPU processing is called "invoking a kernel", where "kernel" is used to refer to the software program running on the GPU (/Device).


NVidia and CUDA and OpenCL

NVidia was founded in 1993, making Graphics Adapters and chips.  In 1999 they shipped their first "GPU", coining the term, and heralding the evolution to come.  In 2003 they shipped a software toolkit called "Brooks" which provided a way for applications to use GPUs for parallel computing.  Then in 2006 they shipped the first version of CUDA, a whole software environment for developing applications which used GPUs.  The initial application was of course gaming, which continues to be an important use case today, but it enabled many other uses as well.  Including importantly, acceleration of the execution of neural networks, which have revolutionized AI and ML applications.

In 2009 Apple formed a consortium with other GPU manufactures like AMD and Intel, and announced the development of OpenCL, and "open" approach to GPU computing closely patterned on CUDA (which was and remains NVidia-only).  Eventually NVidia joined the OpenCL consortium also.

Today you can write applications with CUDA for NVidia GPUs [only], or you can write applications with OpenCL which will run on virtually any GPU [including NVidia].  But CUDA is optimized for NVidia, and NVidia remains the leading GPU vendor.  If you've developed an application in CUDA it isn't too difficult to migrate to OpenCL, because of the architectural similarity.

What is CUDA and how do you use it?

CUDA is an environment with four main pieces:

  • C/C++ compiler/preprocessor (nvcc, based on gcc)
  • Link libraries for many environments (Windows, MacOS, Linux, IOS, Android, ...)
  • Runtime support in GPU drivers
  • Runtime support on GPU device

CUDA programs are written in C/C++, and are compiled/preprocessed with NVidia's C++ compiler nvcc.  There are slight extensions to the language, and source programs are named as .cu instead of .c or .cpp.  A CUDA program contains some logic which runs on the CPU/Host, and some which runs on the GPU/Device.  The Host code is preprocessed by nvcc and then passed through to gcc (the Gnu C++ compiler) for code generation.  The Device code is compiled into an NVidia-specific bytecode called PTX. 

PTX is stored as data inside the generated code.  At execution time the PTX is passed to the Device by the NVidia driver, and then processed by a JIT compiler on the Device.  CUDA programs are linked with CUDA libraries, which implement the CUDA API.  A single EXE results, which contains the Host code as machine instructions and the Device code as PTX.  This architecture is simple - only one EXE file contains both the Host and Device logic - and enables support for a wide variety of devices, since the Device itself interprets and executes the Device code at execution time.

NVidia provides great integration for CUDA programs and nvcc with many development tools, including importantly Visual Studio and XCode.

It's worth mentioning that there are many libraries available which implement CUDA "under the covers" for programs written in other languages, including importantly Python (CuPy, PyNum, PyCuda, PyTorch, MXNet, and TensorFlow), C# (for Windows), Objective C (for IOS), and Java (for Android).  For many applications it isn't necessary to delve this deeply into CUDA programming; you can just pick the right library, use it, and happily take advantage of GPU acceleration.

 

But let's say you want to write CUDA directly?  How does that work?  Glad you asked .. stay tuned for the next installment of this series, where we'll work through some real-world examples.

 

Comments?