{{announcement.body}}
{{announcement.title}}
refcard cover
Refcard #372

Getting Started With Cross-Platform Heterogeneous Computing

Heterogeneous computing is here to stay, and it is becoming more diverse and increasingly important for all programmers to understand. Fortunately, the key concepts are easy to grasp. In this Refcard, we will walk through a thorough introduction to cross-platform heterogeneous computing, its benefits, and how to get started setting up your environment.

Free PDF for Easy Reference

Brought to You By

Intel
refcard cover

Written By

author avatar James Reinders
Engineer, Intel
Section 1

Introduction

Heterogeneous computing is here to stay, and it is becoming more diverse and increasingly important for all programmers to understand.

Fortunately, the key concepts are easy to grasp — and we will walk through a perfect introduction to accessing the benefits of heterogeneous computing.

Diagram

Description automatically generated with medium confidenceHeterogeneous computing has become the norm because specialized devices can offer better performance per watt, but it is more challenging to use because the benefits are more tailored and less general.

 “The next decade will see a Cambrian explosion of novel computer architectures,” conclude industry legends John Hennessy and David Patterson as they describe how we are entering “A New Golden Age for Computer Architecture.” While this trend has been emerging for a decade or more, we are only now at the dawn of this new era in computing.

Section 2

What Is Heterogeneous Computing?

Computers are made up of processing devices. Historically, computers have most often been based on a single processing device called a central processing unit (CPU). While there have always been specialty devices in a computer, they were not used for general computation. Graphical processing devices (GPUs) spent decades improving the graphical output of a system, but not the general computations. GPUs were specialized for mathematics needed for converting images described by vectors into a series of pixels on our screens. That specialization (rasterization) allowed GPUs to be more power-efficient but so specialized they are not appropriate for tasks such as running operating systems or implementing a web browser and text editor. 

However, the mathematics of rasterization resembled other scientific and engineering problems enough that developers found ways to tap into GPUs. This allows GPUs to accelerate computations that have traditionally been done on the CPU, while often being more power-efficient. Power efficiency is increasingly a concern for everyone, but most of all when running on batteries (e.g., phones, laptops) or in a data center (“the cloud”) and supercomputers installations.

Today, the number of devices competing for our attention starts with multiple types of GPUs. But it does not end there. We will continue to see numerous innovations in hardware that hope to provide computing power in new ways. Fortunately, there are open, multivendor, multiarchitecture approaches to help us write our code to benefit from all this innovation.

Section 3

Getting Started With a Real C++ Program

A heterogeneous program needs to do two things: (1) locate an accelerator to use and (2) offload work and data to the accelerator. We will program using C++ with SYCL to do both, and we’ll use an SYCL-enabled C++ compiler to compile our application. It is important to note that heterogeneous programming is often done best by using libraries designed to help. For our first hands-on experience, we will program directly with SYCL so as to explicitly experience the two key critical pieces of heterogeneous computing (locating devices and using devices).

We choose SYCL because it is open, multivendor, and multiarchitecture. Two other open choices are OpenMP and OpenCL. OpenMP is best suited for Fortran and C. SYCL is more advanced than OpenCL; SYCL grew out of the OpenCL effort with the express purpose to embrace C++ strongly. CUDA is similar to OpenCL and SYCL but is explicitly designed for NVIDIA GPUs, and NVIDIA CUDA tools are exclusively licensed for certain NVIDIA GPUs. 

First, we’ll discuss options for where to compile and run our program. Secondly, we’ll discuss the SYCL program itself and look at results when running it.

DevCloud Option 

While obtaining and setting up a heterogeneous machine is an option (discussed next), we can simply get an account in the cloud, which has both the software and hardware we need ready to use. Intel’s DevCloud is a free place to develop, test, and run workloads for free on a cluster of the latest Intel hardware and software.

Signing up is easy, and there is a step-by-step video from Intel to help as well. Once there, the default SYCL compiler is DPC++. DPC++ is the Intel-initiated LLVM compiler project to implement SYCL. If you are “in the know,” it is easy to try two other SYCL compilers on DevCloud as well. 

For more information, check out the “Try Multiple SYCL Compilers” blog. The DevCloud is a resource from Intel and is simply one option for getting access to hardware and software quickly to try our sample program.

On Your Own Machine Option 

Almost any Linux or Windows system can compile and run an SYCL program.  You’ll need to obtain an SYCL-enabled C++ compiler, and you’ll need to make sure you have some device drivers installed to access the hardware (drivers that support OpenCL, or Level Zero, are the most common).

SYCL compilers to consider are DPC++, ComputeCPP, and HipSYCL. Each of these sites does a good job explaining how to download, build, install, and get started with their SYCL compiler.

A key tip is to make sure you have current device drivers installed that support OpenCL or Level Zero access to the hardware. You don’t have to install an accelerator in your system; drivers will use the CPU when no accelerator is present so that your SYCL code can run. Start by installing the most recent drivers for your hardware (e.g., graphics drivers), and follow instructions for verifying your installation.

Using Either Option 

We’ll assume you have succeeded at installing an SYCL compiler or getting access to a machine (like DevCloud) that has an SYCL compiler already installed.

There are a number of repositories for SYCL code examples available on the web. We’ll use SYCL Academy’s repository for our first code.  Specifically, we’ll use a vector addition example known as exercise 8 in their training.  It illustrates SYCL well and jumps us past the first seven exercises, which introduce features carefully one by one. 

Section 4

Understanding the SYCL Program

SYCL is a Khronos standard that brings support for fully heterogeneous data parallelism to C++. SYCL is a key solution to one aspect of a larger problem: How do we enable programming in the face of an explosion of hardware diversity that is coming? 

Today, SYCL solves these problems in a way that lets us target hardware from many vendors, with many architectures, usefully. SYCL supports heterogeneous programming in C++.  To get started, we really only need to understand queues and how to submit to queues.

Jump Into an SYCL Program 

We look at SYCL Academy Code Exercise #8 and see that it sets up a queue, memory, and executes a kernel on a device. The net result of this code is to perform r[i] = a[i] + b[i]; for all elements of the arrays. See the example code below:

 
#include <CL/sycl.hpp>
// The heart of the SYCL Academy Code Exercise #8 is shown 
// here with added comments
 
//set up a queue to a device that supports USM
auto usmQueue = sycl::queue{usm_selector{}, asyncHandler};

// add these lines (not on github) if you want to know
// precisely which accelerator our program is using
std::cout << "Device: " <<
usmQueue.get_device().get_info<sycl::info::device::name>() << std::endl;
 
// allocate three arrays that are on the device
auto devicePtrA = sycl::malloc_device<float>(dataSize, usmQueue);
auto devicePtrB = sycl::malloc_device<float>(dataSize, usmQueue);
auto devicePtrR = sycl::malloc_device<float>(dataSize, usmQueue);
 
// copy our local array into the device memory
usmQueue.memcpy(devicePtrA, a, sizeof(float) * dataSize).wait();
usmQueue.memcpy(devicePtrB, b, sizeof(float) * dataSize).wait();
 
// submit a kernel that adds two numbers and returns the sum
// the parallel_for asks that the kernel be applied to all array elements
usmQueue.parallel_for<vector_add>(
sycl::range{dataSize},
[=](sycl::id<1> idx) {
// these two lines run on the accelerator
auto globalId = idx[0];
devicePtrR[globalId] = devicePtrA[globalId] + devicePtrB[globalId];
}).wait();
 
// copy the device memory back
usmQueue.memcpy(r, devicePtrR, sizeof(float) * dataSize).wait();
 
// be kind and free up the memory we allocated
sycl::free(devicePtrA, usmQueue);
sycl::free(devicePtrB, usmQueue);
sycl::free(devicePtrR, usmQueue);

 

Code Line

Description

Line 1

Includes the SYCL definitions. For clarity in our example, we have not used a “using namespace sycl;” which you will commonly see in SYCL programs.

Line 6

Establishes a queue connected to a device that supports USM. We’ll come back to explaining queues construction options after we walk through this code.

Lines 10-11

Not in the original program. Add them in order to print out a brief explanation of which device was selected at runtime. For instance, on DevCloud, it printed:

Device: Intel(R) UHD Graphics P630 [0x3e96]

On other nodes, it may tell me that I was using an FPGA emulator, a CPU, or a different GPU — all without changing my code.

Lines 14-16

Allocates memory on the device. If we had declared this memory shared with the CPU, we could eliminate lines 19-20 and 33. Doing it this way shows the control we can have when we want it.

Lines 19-20

Copies our local array to device memory (allocated on lines 14-16).

Lines 24-30

Submits work to the device. The parallel_for specifies that we want to apply the kernel a certain number (dataSize) of times. The lambda specifies the actual work to be done in each kernel. The kernel is written to work on its assigned item by using the unique index that is provided to it.

Line 30

Causes the program to not proceed until the kernel has fully executed. Without this wait, the program would proceed as soon as the submission was made. We need to wait for results to be fully computed before we do the copy back on line 33.

Line 33

Copies the results back. There are ways to write the program where this would be done for us, but this program is showing that we can also have complete control.

Lines 34-36

Tidies things up by freeing the device memory we allocated in lines 14-16.

In the full program (on GitHub), there is also code before and after to initialize the arrays and check the results.

What Is an SYCL Queue? 

When an SYCL queue is constructed, it creates a connection to a single device. Our options for device selection include (a) accepting a default that the runtime picks for us, (b) asking for a certain class of devices (like a GPU, or a CPU, or an FPGA), or (c) taking full control, examine all devices available, and score them using any system we choose to program.

 
// use the default selector
queue q1{}; // default_selector
queue q2{default_selector()};
// use a CPU
queue q3{cpu_selector()}; 
// use a GPU
queue q4{gpu_selector()};
// use complex selection described in a function we write
// in the example – usm_selector{} is a custom selector
// which only grabs a device that supports USM
queue q5{my_custom_selector(a, b, c)};


Constructing an SYCL queue to connect us to a device can be done to whatever level of precision we desire.

What Is a Submit to a Queue? 

Once we have constructed a queue, we can submit work to it. In our example, we used a parallel_for to request that the kernel be applied across all (dataSize) objects. The actual kernel was specified using a lambda function, which is a common way to specify kernels because it is very readable when it remains inline like this: 

The order in which kernels are executed is left to the runtime provided it does not violate any known dependencies (e.g., data needs to be written before it is read). There is an option to ask for an in-order queue if that programming style suits our needs better.

What About Memory? 

Memory is an interesting topic in more complex systems. Simple CPU-only systems generally have one memory accessible by the entire CPU. We say that all the memory is visible to all the CPU processing cores. Caches may be local to a given core, or collection of cores, on a CPU, but they will maintain a consistent view of memory. We refer to these as being “cache coherent.”

In a heterogeneous machine, the visibility of memory and the coherency of caches may be more interesting. GPUs generally have a local memory that is not directly usable by the CPU, and the GPU may or may not be able to access some of the CPU memory. As we add more GPUs or other accelerators, the situation can become even more complex.

That is why SYCL supports rich methods to utilize memory configurations of all kinds. C++ programmers will feel completely at home with Unified Shared Memory (USM), which we used in the example previously discussed. USM allows for memory to be accessed by the CPU and accelerators just like normal memory. In the example, we added a twist of having the USM memory local to the device — to illustrate our ability to control more precisely when we wish to do so. You can read more about that in other tutorials or the book.

Since not all devices can support USM, or there may be special memories that are local to devices, SYCL also supports buffer-based models. It is good to know that SYCL is versatile in this respect, and all the tutorials and the SYCL book cover these options in detail.

Section 5

Top Free Online Tutorials to Learn More Hands-On

SYCL Book 

Data Parallel C++ is the programming book for SYCL developers. This is offered as a free download and covers all the essentials of learning to be an effective SYCL programmer. All the code examples in the book are freely available on GitHub for the book sample code (you’ll find the book errata there as well).

SYCL Academy 

The SYCL Academy open-source project contains lessons, video tutorials, and hands-on exercises to give a broad introduction to SYCL. These materials have been used to help teach SYCL many times and are an excellent resource for all.

DPC++ Tutorials 

There is an excellent series of tutorials that use the DPC++ compiler for SYCL on DevCloud. DPC++ is preinstalled on the free DevCloud environment with access to CPUs, GPUs, and FPGAs for running SYCL programs.

Section 6

C++ and SYCL

SYCL is based entirely on modern C++ capabilities, including templates and lambdas. SYCL does not require any new keywords or language features — there is no syntax to learn beyond modern C++. Extending C++ compilers with SYCL-awareness allows automatic invocation of multiple backends to create executables for arbitrarily many architectures in a single build and enables optimizations that boost performance.

C++23 is next up in the evolution of C++ in supporting parallel programming. The current direction described for "std::execution" in p2300 aims to provide foundational support for structured asynchronous execution. Understandably, C++23 will not try to solve all the challenges of heterogeneous programming.

Today, SYCL supports critical heterogeneous programming needs (that will almost certainly not be in C++23), including device discovery, device offloading, device synchronization, and management of disjoint memories. SYCL usage will inform future C++ standardization efforts.

Section 7

oneAPI and Libraries

We have focused on directly programming heterogeneous machines using SYCL. This has highlighted the key concerns of locating devices and offloading computation to them.

An industry initiative called oneAPI broadens this discussion to include libraries and developer tools (e.g., debuggers and analysis tools).

Most important in oneAPI are the libraries. Users of SYCL — and other languages — will benefit from libraries that already support heterogeneous computing. The list is continuously growing, and any serious programming for heterogeneous programming will want to keep an eye on the available library solutions, as well as the tools. The oneAPI website tracks implementations, tutorials, and other information related to oneAPI.

Section 8

Conclusion and Additional Resources

The future of computing is heterogeneous. Every application will need to take advantage of this, directly or indirectly. Now is a perfect time to get familiar with the concepts, techniques, tools, and challenges by getting hands-on experience directly using heterogeneous capabilities. Nothing beats hands-on experiences such as what you can get with SYCL today. This is especially easy if you use DevCloud-based training.

To fully appreciate the shift that heterogeneous computing represents, its importance, and inevitability, read the landmark article “A New Golden Age for Computer Architecture,” by John L. Hennessy and David A. Patterson, published in the Communications of the ACM.

Three excellent resources for learning more about heterogeneous programming with SYCL are:

Portability and performance portability are both important topics that may get even more important in a diverse heterogeneous world. For interested readers, the current state-of-the-art in measuring portability and performance portability is well summarized in “Navigating Performance, Portability, and Productivity.” This topic helps explain why open, multivendor, multiarchitecture approaches such as SYCL and oneAPI are recommended.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}