Critical Section

Archive: January 26, 2023

<<< January 25, 2023
Home
January 27, 2023 >>>

comments on comments

Thursday, 01/26/23 07:42 AM

Thanks for your comments on the comments? survey. Please vote if you haven't already.

I've dug into this a bit, and while it's possible to use Twitter as comments for a blog, it's not straightforward. Wish it was and wish Twitter cared enought to make this easy - would be a great resource for news websites as well as blogs. (Imagine if every Wikipedia page had one! :)

The perfect experience is, links to your blog posts appear on Twitter. Each of these Tweets has a Reply button. Each blog post has a Reply button, which does exactly what the Reply button on Twitter does. Each of these Tweets has a link to display the Tweet and all its Replies. Each blog post has a Comments button, which links back display the Tweet and all of its Replies.

Appears to do the parts in bold above I have to dive into the Twitter API, with oauth and application Ids and all the rest of it. So be it. I have applied to be an authorized Twitter developer. Stay tuned...

[Update: yep, trying Mastodon...]

Hello CUDA: streams - GPU series #6

Thursday, 01/26/23 08:15 PM

Okay, ready? Here's another post in the series about CUDA and GPU acceleration. (And yes, sadly, this is the last one, at least for the moment...)

We've tackled basic CUDA, multithreading, grids, and memory models. Now let's talk about streams!

This diagram shows using a GPU with a grid of blocks and threads:

GPU w grid of blocks of threads

Back in the grids discussion we showed we could run 120,000 blocks of 1,000 threads, vastly accelerating our processing with massive parallelism. So what can be done beyond that?

As you can see in this diagram, the whole grid is invoked as a single kernel with a single call. The Device goes off and does whatever it is asked to do - organized as blocks of threads - and then completes. But what if you want to run more than one kernel at the same time? This isn't really a way to get more parallelism of the same task, its more a way to do different things at the same time.

When a kernel is invoked, it is always as part of a stream. There is an implicit "stream zero", and all the things we've done so far in our examples have used this stream. If we want to run different kernels at the same time, we can define more streams, and put each one in a different stream. The operations in each stream are serialized with respect to that stream, but asynchronous with respect to other streams. (There are ways to synchronize to a stream, as we shall see.)

Here's what using a GPU w multiple streams looks like:

GPU with multiple streams

In the Host thread, we make multiple kernel invocations into different streams. Each stream runs in parallel on the Device. The Host can synchronize with all streams or with any one stream. And as we'll see, it's also possible for a Device to signal the Host from within a stream.

Let's break up the processing of our example program still further. Let's provide for the possibility to have multiple streams. Here's hello7.cu which implements this (as before, that's a link to the source code if you want to try this yourself):

The changes are highlighted above. First, at the top, highlighted in yellow, we've added yet another parameter to specify the number of streams, with a default of 1.

In the main() function, also highlighted in yellow, we've added a loop. For each stream we call cudaStreamCreate() which returns a handle of type cudaStream_t. These handles are stored in an array named cs. On the kernel invocation, we've added a new fourth parameter to specify the stream. We didn't do this before, and the default was "stream zero". Specifying a stream means all the processing done for that kernel - the whole grid of blocks of threads - all happens within that stream. Invoking another kernel into a different stream means they will run asynchronously relative to each other, sharing the Device resources.

As before, after all the kernels are invoked into all the streams, we call cudaDeviceSynchronize() to wait for them all to complete. If we had wanted there is also a function cudaStreamSynchronize() which synchronizes the Host to a given steam. In this case wanted to wait for all of them. And then finally we have to call cudaStreamDestroy() to clean up the resources for each stream.

Okay so that's great, we're running multiple streams, but now we have to divide up the work between the streams, otherwise they'll all just do the same thing. These changes are highlighted in green.

We've added two new parameters to the domath() global function which is the main line of the kernel, to pass the stream index, and the number of streams. These values are then used in the calculation of index and stride inside the domath() loop. With this logic each stream processes only its portion of the blocks.

One other change, not directly related to streams, is highlighted in turquoise at the top of main(). We added a call to cudaGetDeviceProperties(), so we could display some interesting information about our particular GPU, like the kind of Device, amount of Graphics Memory, number of stream processors, and number of cores per processor. This call can also be used to determine if the Device supports Host and Managed memory models.

Okay, well that wasn't too difficult. Let's try running hello7, what will happen?

hello 7 runs

The first of these runs used one stream, so this is exactly like when we ran hello6 before. The second run broke up the processing into 10 streams. You can see, the number of blocks in each stream is one-tenth of the number of blocks in the first run, so we had 10 streams of 55,458 blocks of 1,024 threads. This was faster, by 1.1s.

Once again we can see, the allocation and initialization of the arrays, and copying them back and forth to the Device takes far longer than the computation itself. Marvel at the fact that in the second run, the entire computation on three 5GB arrays took 0.26s. This is GPU acceleration in action!

Also note the information displayed about the device. This particular GPU is a NVidia GeForce RTX 3080, it has 16GB of Graphics Memory, 58 stream processors, and 1,536 cores per processor. No wonder it can process so many operations in parallel!

There's one other thing I wanted to share with you. CUDA has the ability to call Host functions inline in a Device stream. This is useful for a bunch of reasons, such as if you want to know when a kernel has completed processing of some data, or needs more input.

The following sample program hello8.cu includes a simple example of this:

I haven't included the entire source listing, because the changes are only at the bottom. Highlighted in yellow, we've added a call to cudaLaunchHostFunc(). This call specifies a stream, and a Host function to be called, in this case streamdone(). You can also pass a parameter which is in turn passed to the Host function. The streamdone() function simply writes a message, but of course you could do more, such as signaling a Host thread to take some action.

Now when we run hello8, it looks like this:

hello 8 run

As planned, when each stream completed its kernel invocation, the next stem in the stream was calling the Host function, which wrote out a message. In this case the streams completed in the order they were started, but that wouldn't necessarily be the case if the workloads were different.

GPU acceleration So that's it! At least for now :) Hope you enjoyed this series, and please let me know if you have comments or questions, and especially corrections... good luck parallelizing!

Thursday, 01/26/23 09:01 PM

This picture is called "straight lines". Whoa. Almost feel the pull of local gravity bending the light...

Ai designed PC cases Slashdot: what happens when an AI generates designs for PC cases? This! Excellent...

I wonder about the design constraints ... would these be buildable, and would they work (cooling, power, etc.)?

MarkTechPost: Top innovative AI powered healthcare companies. Such an explosion of these, some are doing great work and creating value, others will be gone with nary a mark left. Investor appetite for these is high.

For example: MIT researchers develop an AI model that can detect future lung cancer risk. Early detection translates into better outcomes for patients.

This makes so much sense: Smart TVs as digital health devices. All you need is a camera+microphone. The pointing device on modern remotes is more than good enough for input.

And AI in medical imaging market projected to reach $20.9B by 2030.

I think I mentioned this already: Microsoft expands 'multibillion dollar' deal with OpenAI, creators of ChatGPT. $10B. They also just laid off 10,000 people. That $10B might have paid the salaries of those 10,000 people for 10 years. But this is clearly strategic, and some of the things Microsoft decided not to keep doing - like AR/VR - are not.

Fox: California reels from string of mass shootings despite having some of the strictest gun laws in US. So, we need even more laws, right?

Jeremy Clarkson: We're in the midst of a coup. Who the hell’s behind it? Huh.

Brad Feld reviews Please report your bug here. "I hope there are a lot more books like this. It balances startup stuff with the cynicism of the experience while placing it in a fictional world. It unexpectedly merges with believable near-term science fiction, which has a delicious parallel universe theme." Sounds excellent, on the list!

Dilbert gives relationship advice. Hmmm...

Interesting, from the Financial Times: We tried to run a social media site and it was awful. They put up a Mastodon server and it was a lot of work. And also, the legal, security, and reputational risks.

Yet another reason, if any were needed, why I'd rather not host comments here.

Erik Sink continues his investigation into "native AOT" for .NET. This time he considers storage allocation and objects. Once again we see how much easier it is to have managed code and garbage collection :)

Meanwhile, Tim Bray releases Quamina, an open-source pattern-matching library. Glad to see he's keeping busy :)

From the pages of history: Gorbachev and Louis Vuitton. "Can there be any doubt who won the cold war?"

Return to the archive.

Comments?