DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Related

  • Introduction to Robot Operation System (ROS)
  • Low-Level Optimizations in ClickHouse: Utilizing Branch Prediction and SIMD To Speed Up Query Execution
  • Testing the Untestable and Other Anti-Patterns
  • Data Distribution via API — Can a Single Developer Do It?

Trending

  • DGS GraphQL and Spring Boot
  • Data Quality: A Novel Perspective for 2025
  • Metrics at a Glance for Production Clusters
  • Integrating Model Context Protocol (MCP) With Microsoft Copilot Studio AI Agents

Efficient Heterogeneous Parallel Programming Using OpenMP

Look at five stages of heterogeneous parallel development and performance tuning.

By 
Chryste Sullivan user avatar
Chryste Sullivan
·
Apr. 26, 22 · Tutorial
Likes (2)
Comment
Save
Tweet
Share
3.1K Views

Join the DZone community and get the full member experience.

Join For Free

In some cases, offloading computations to an accelerator like a GPU means that the host CPU sits idle until the offloaded computations are finished. However, using the CPU and GPU resources simultaneously can improve the performance of an application. In OpenMP® programs that take advantage of heterogenous parallelism, the master clause can be used to exploit simultaneous CPU and GPU execution. In this article, we will show you how to do CPU+GPU asynchronous calculation using OpenMP.

The SPEC ACCEL 514.pomriq MRI reconstruction benchmark is written in C and parallelized using OpenMP. It can offload some calculations to accelerators for heterogenous parallel execution. In this article, we divide the computation between the host CPU and a discrete Intel® GPU such that both processors are kept busy. We’ll also use Intel VTune™ Profiler to measure CPU and GPU utilization and analyze performance.

We’ll look at five stages of heterogeneous parallel development and performance tuning:

  1. Looking for appropriate code regions to parallelize
  2. Parallelizing these regions so that both the CPU and GPU are kept busy
  3. Finding the optimal work distribution coefficient
  4. Launching the heterogeneous parallel application with this distribution coefficient
  5. Measuring the performance improvement.

Initially, the parallel region only runs on the GPU while the CPU sits idle (Figure 1). As you can see, only the “OMP Primary Thread” is executing on the CPU while the GPU is fully occupied (GPU Execution Units→EU Array→Active) with the ComputeQ offloaded kernel.

Profile of the initial code using Intel VTune Profiler

Figure 1. Profile of the initial code using Intel VTune Profiler

After examining the code, we decided to duplicate each array and each executed region so that the first copy is executed on the GPU and the second is executed on the CPU. The master thread uses the OpenMP target directive to offload work to the GPU. This is shown schematically in Figure 2. The nowait directives avoid unnecessary synchronization between the threads running on the CPU and GPU. They also improve load balance among the threads.

OpenMP parallelization scheme to keep the CPU and GPU busy

Figure 2. OpenMP parallelization scheme to keep the CPU and GPU busy

Balancing the work distribution between the CPU and GPU is regulated by the part variable that is read from STDIN (Figure 3). This variable is the percentage of the workload that will be offloaded to the GPU multiplied by numX. The remaining work will be done on the CPU. An example of the OpenMP heterogeneous parallel implementation is shown in Figure 4.

    float part;
    char *end;
    part   = strtof(argv[2], &end);

int TILE_SIZE = numX*part;

Qrc = (float*) memalign(alignment, numX, * sizeof (float));
Qrg = (float*) memalign(alignment, TILE_SIZE, * sizeof (float));

Figure 3. The coefficient of distribution work between the CPU and GPU


#pragma omp parallel
  {
        #pragma omp master
        #pragma omp target teams distribute parallel for nowait private (expАrg, cosArg, sinArg)
        for (indexX = 0; indexX < TILE_SIZE; indexX++) {
            float QrSum = 0.0;
            float QiSum = 0.0;
            #pragma omp simd private (expАrg, cosArg, sinArg) reduction(+: Orsum, QiSum)
            for (indexK = 0; indexk_< numK; indexK++) {
                expАrg = PIX2 * (GkVals[indexk].kx * xg[indexX] + GkVals[index]. Ky * yg[indexX] + GkVals[index].kz * zg[indexX]);
                cosArg = cosf(expArg);
                sinArg = sinf(expArg);
                float phi = GkVals[indexk]. PhiMag;
                Q-Sum += phi * cosArg; Qisum += phi * sinArg;
            }
            Qrg[indexX] += QrSum;
            Qig[indexX] += QiSum;
        }
        #pragma omp for private(expАrg, cosArg, sinArg)
        for (indexX = TILE_SIZE; indexX < numk; indexX++) {
            float Qrsum = 0.0;
            float QiSum = 0.0;
            #pragma omp simd private (expАrg, cosArg, sinArg) reduction(+: Orsum, QiSum)
            for (indexK = 0; indexk < numk; indexK++) {
                expАrg = PIX2 * (ckVals[indexk].kx * xc[indexX] + CkVals[index]. Ky * yc[indexX] + CkVals[indexK] .kz * zc[indexX]);
                cosArg = cosf(expArg);
                sinArg = sinf(expArg);
                float phi = CkVals[indexk].PhiMag;
                Qrsum += phi * cosArg;
                Qisum += phi * sinArg;
            }
            Qrc[indexX] += QrSum;
            Qic[indexX] += QiSum;
        }
  }

Figure 4. Example code illustrating the OpenMP implementation that simultaneously utilizes the CPU and GPU
 

The Intel® oneAPI DPC++/C++ Compiler was used with following command-line options:

‑O3 ‑Ofast ‑xCORE‑AVX512 ‑mprefer‑vector‑width=512 ‑ffast‑math
‑qopt‑multiple‑gather‑scatter‑by‑shuffles ‑fimf‑precision=low
‑fiopenmp ‑fopenmp‑targets=spir64="‑fp‑model=precise"


Table 1 shows the performance for different CPU to GPU work ratios (i.e., the part variable described above). For our system and workload, an offload ratio of 0.65 gives the best load balance between the CPU and GPU, and hence the best utilization of processor resources. The profile from Intel VTune Profiler shows that work is more evenly distributed between the CPU and GPU, and that both processors are being effectively utilized (Figure 5). While “OMP Primary Thread” submits the offloaded kernel (main: 237) for execution on the GPU, other “OMP Worker Threads” are active on the CPU.

Offload part Total time, s GPU time, s
0.00 61.2 0.0
0.20 51.6 8.6
0.40 41.0 16.8
0.60 31.5 24.7
0.65 28.9 26.7
0.80 34.8 32.6
1.00 43.4 40.7

Table 1. Hotspot times corresponding to different amounts of offloaded work (i.e., the part variable)

Profile of code with 65% GPU offload

Figure 5. Profile of code with 65% GPU offload

Figure 6 shows the run times for different values of part. Keep in mind that a part of zero means that no work is offloaded to the GPU. A part of one means that all work is offloaded. It’s clear that a balanced distribution of work across the CPU and GPU gives better performance than either extreme.

Comparing training and prediction performance (all times in seconds)

FIgure 6. Comparing training and prediction performance (all times in seconds)

OpenMP provides true asynchronous, heterogeneous execution on CPU+GPU systems. It’s clear from our timing results and VTune profiles that keeping the CPU and GPU busy in the OpenMP parallel region gives the best performance. We encourage you to try this approach.

IT Implementation Accelerator (software) Clear (Unix) Directive (programming) Distribution (differential geometry) Execution (computing) IDLE (Python) master Profile (engineering)

Published at DZone with permission of Chryste Sullivan. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Introduction to Robot Operation System (ROS)
  • Low-Level Optimizations in ClickHouse: Utilizing Branch Prediction and SIMD To Speed Up Query Execution
  • Testing the Untestable and Other Anti-Patterns
  • Data Distribution via API — Can a Single Developer Do It?

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!