OpenMP
We start from the serial code we used in the previous chapters, calculates the integral of a function over a range .
To parallelize the trapezoidal integration with OpenMP, we make use of compiler directives (pragmas) to divide the integration range among multiple threads. Each thread computes the integral over a smaller sub-range, and then the results from all threads are combined to obtain the final integral.
We first recall the serial code:
#include <stdio.h>
#include <math.h>
// Define the function to integrate: f(x) = x^3
double f(double x) {
return pow(x, 3); // pow(a,b) computes a^b
}
// Trapezoidal rule for numerical integration
double trapezoidal_rule(double (*func)(double), double a, double b, int n) {
double p = (b - a) / n; // Width of each trapezoid
double sum = 0.5 * (func(a) + func(b)); // End points contribution
for (int i = 1; i < n; i++) {
double x = a + i * p;
sum += func(x);
}
return sum * p;
}
int main() {
double a = 0.0; // Lower limit of integration
double b = 1.0; // Upper limit of integration
int n = 1000; // Number of trapezoids (higher n for better accuracy)
printf("This program performs numerical integration of f(x) = x^3 from a = %.2f to b = %.2f using %d trapezoids.\n", a, b, n);
// Check if n is a valid number of trapezoids
if (n > 0) {
printf("The number of trapezoids is positive.\n");
} else if (n < 0) {
printf("Error: The number of trapezoids is negative.\n");
} else {
printf("Error: The number of trapezoids is zero.\n");
}
// Perform numerical integration
double result = trapezoidal_rule(f, a, b, n);
printf("The integral of f(x) = x^3 from %.2f to %.2f is approximately: %.5f\n", a, b, result);
return 0;
}
The Open-MP parallelised version of the code is:
#include <stdio.h>
#include <math.h>
#include <omp.h>
// Define the function to integrate: f(x) = x^3
double f(double x) {
return pow(x, 3); // pow(a,b) computes a^b
}
// Trapezoidal rule for numerical integration with OpenMP parallelization
double trapezoidal_rule(double (*func)(double), double a, double b, int n) {
double p = (b - a) / n; // Width of each trapezoid
double sum = 0.5 * (func(a) + func(b)); // End points contribution
// Parallelize the summation over trapezoids
#pragma omp parallel for reduction(+:sum)
for (int i = 1; i < n; i++) {
double x = a + i * p;
sum += func(x);
}
return sum * p;
}
int main() {
double a = 0.0; // Lower limit of integration
double b = 1.0; // Upper limit of integration
int n = 1000; // Number of trapezoids (higher n for better accuracy)
printf("This program performs numerical integration of f(x) = x^3 from a = %.2f to b = %.2f using %d trapezoids.\n", a, b, n);
// Check if n is a valid number of trapezoids
if (n > 0) {
printf("The number of trapezoids is positive.\n", n);
} else if (n < 0) {
printf("Error: The number of trapezoids is negative.\n");
} else {
printf("Error: The number of trapezoids is zero.\n");
}
// Perform numerical integration
double result = trapezoidal_rule(f, a, b, n);
printf("The integral of f(x) = x^3 from %.2f to %.2f is approximately: %.5f\n", a, b, result);
return 0;
}
How To Compile and Run
To compile a code which makes use of OpenMP, you need to enable OpenMP support in your compiler. For gcc, you can use the -fopenmp flag:
Compile
gcc -fopenmp -o my_program my_program.c
This command compiles the code in my_program.c and produces an executable named my_program.
Run
Simply run the executable from the terminal:
./my_program
By default, OpenMP uses as many threads as there are available CPU cores. You can control the number of threads by setting the OMP_NUM_THREADS environment variable before running the program.
For example, to use 4 threads:
export OMP_NUM_THREADS=4
./my_program
Alternatively, you can set the number of threads within the code using the omp_set_num_threads() function:
#include <omp.h>
omp_set_num_threads(4); // Sets the number of threads to 4
pragma omp parallel for
The directive #pragma omp parallel for reduction(+:sum) is an OpenMP construct that combines parallel processing with a reduction operation. Here’s how it works:
- Parallelization: The
#pragma omp parallel forpart of the directive tells the compiler to parallelize the following for loop. This means that each iteration of the loop can be executed by a different thread. OpenMP automatically handles the distribution of loop iterations among available CPU threads, allowing them to work independently on their assigned iterations. - Reduction Operation: The
reduction(+:sum)clause specifies a reduction operation for the variablesum. In this case, the+symbol indicates that each thread should add its individual contributions tosum. OpenMP will create a private copy ofsumfor each thread, allowing threads to accumulate their partial results without interference or race conditions. At the end of the loop, OpenMP combines each thread’s partial sum into a single total sum, storing the final result in the original sum variable. - Thread Safety: By managing individual partial sums for each thread and combining them at the end, OpenMP ensures thread safety. This approach is efficient and avoids conflicts that could arise if multiple threads attempted to modify sum simultaneously.
This directive is particularly useful in scenarios where each iteration independently contributes to an accumulation (e.g., summing values in numerical integration). The reduction clause simplifies parallelized summation by automatically handling both private copies and final aggregation.
[!WARNING]
Race conditions
A race condition is a situation in parallel programming where multiple threads or processes access and manipulate shared data concurrently, and the program's outcome depends on the timing or sequence of their execution. This often leads to unpredictable and incorrect results, as the order of operations is not controlled, allowing threads to "race" to complete their tasks. Race conditions happen because of incomplete synchronization. When threads access shared data without a controlled sequence (e.g., without using locks or other synchronization mechanisms), they can interfere with each other’s operations, leading to inconsistent or unexpected results.
[!TIP] To prevent race conditions, programmers use synchronization techniques to control access to shared data. In OpenMP, constructs like
#pragma omp critical,#pragma omp atomic, orreduction(for certain operations) can help manage access to shared variables, ensuring correct results by controlling when threads can read or modify shared data.
Here an example of using pragma omp parallel for to parallelise the operation a[i] = b[i]*c + d[i]:
#include <stdio.h>
#include <omp.h>
#define N 1000 // Size of the arrays
int main() {
double a[N], b[N], d[N];
double c = 2.5; // Scalar multiplier
// Initialize arrays b and d with some values
for (int i = 0; i < N; i++) {
b[i] = i * 1.0; // Example values for b
d[i] = i + 1.0; // Example values for d
}
// Parallelized computation
#pragma omp parallel for
for (int i = 0; i < N; i++) {
a[i] = b[i] * c + d[i];
}
return 0;
}
Reduction
The reduction clause in OpenMP supports several operators and custom user-defined reductions. Here’s an overview of the possibilities:
1. Built-in Operators
OpenMP supports a variety of built-in operators for the reduction clause. Each operator combines the values from multiple threads into a single result.
Arithmetic Operators:
+: Sum of all values.*: Product of all values.-: Subtraction (less commonly used; may depend on context).
Bitwise Operators:
&: Bitwise AND of all values.|: Bitwise OR of all values.^: Bitwise XOR of all values.
Logical Operators:
&&: Logical AND (all values must be true).||: Logical OR (at least one value must be true).
Min/Max:
min: Finds the minimum value.max: Finds the maximum value.
2. Example Usage
Sum (addition):
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < n; i++) {
sum += array[i];
}
Product:
#pragma omp parallel for reduction(*:product)
for (int i = 1; i <= n; i++) {
product *= i;
}
Maximum:
#pragma omp parallel for reduction(max:max_val)
for (int i = 0; i < n; i++) {
if (array[i] > max_val) {
max_val = array[i];
}
}
Logical AND:
#pragma omp parallel for reduction(&&:all_positive)
for (int i = 0; i < n; i++) {
all_positive = all_positive && (array[i] > 0);
}
3. User-Defined Reductions
OpenMP allows you to define custom reductions for more complex operations.
Steps to Define Custom Reductions:
- Define the Operation: Create a function or operator that specifies how the reduction should combine values.
- Specify Initial Value: Provide an identity value for the reduction (e.g.,
0for sum,1for product). - Use
omp declare reduction: Declare the reduction in OpenMP using this directive.
Example: Custom String Concatenation
#include <string.h>
#include <stdio.h>
#pragma omp declare reduction(concat : char* : strcat(omp_out, omp_in)) \
initializer(omp_priv = strdup(""))
int main() {
char result[256] = "";
#pragma omp parallel for reduction(concat:result)
for (int i = 0; i < 4; i++) {
char temp[64];
sprintf(temp, "Thread %d ", i);
strcat(result, temp);
}
printf("Concatenated result: %s\n", result);
return 0;
}
4. Reductions with Multiple Variables
You can use reduction with multiple variables by separating them with commas.
#pragma omp parallel for reduction(+:sum, product)
for (int i = 1; i <= n; i++) {
sum += i;
product *= i;
}
[!TIP]
- Use the most specific operator for performance. For example,
maxorminis faster than manually implementing a comparison-based reduction.- Always initialize reduction variables before entering the parallel region.
- Be aware of precision issues when using floating-point arithmetic in reductions.
Fork-Join Model
The fork-join concept is a parallel programming model used in frameworks like OpenMP to manage the flow of parallel tasks. This model allows a program to create multiple threads to work on different tasks simultaneously (the "fork" part) and then synchronize or join them back together once they complete their tasks.
- Fork: When a program encounters a parallel region (like an OpenMP
#pragma omp paralleldirective), it "forks" into multiple threads. This means that the main (or "master") thread spawns a team of threads, each working independently on a portion of the task. Each thread executes the same code in parallel, and OpenMP handles the assignment of iterations or tasks to each thread. - Parallel Work: During the parallel region, each thread performs its assigned work independently. This is where the program achieves parallelism, as multiple threads execute concurrently on different CPU cores.
- Join: After the parallel section is complete, all threads are synchronized, and they "join" back into a single thread. The main thread waits for all spawned threads to finish their work before continuing with the sequential parts of the program. This ensures that all parallel computations are completed before the program moves on to the next steps, maintaining program correctness.
#include <omp.h>
#include <stdio.h>
int main() {
printf("Sequential code before parallel region.\n");
// Fork: Start parallel region
#pragma omp parallel
{
int thread_id = omp_get_thread_num();
printf("Parallel work by thread %d\n", thread_id);
} // Join: End of parallel region
printf("Sequential code after parallel region.\n");
return 0;
}
Some useful OpenMP functions
In the previous example, we used the function omp_get_thread_num(), which is one of several utility functions provided by OpenMP to manage and query information about threads. These functions are helpful in understanding and controlling parallel execution within an OpenMP program.
Here’s a breakdown of omp_get_thread_num() and other common OpenMP functions:
omp_get_thread_num(): This function returns the ID of the current thread within a parallel region. The thread ID is an integer that ranges from0up toomp_get_num_threads() - 1. Often used within a#pragma omp parallelblock to identify which thread is executing a particular portion of the code.omp_get_num_threads(): It returns the total number of threads currently in a parallel region. It is useful for dividing work dynamically based on the total number of threads or for setting up data structures.omp_get_max_threads(): It provides the maximum number of threads that could be used in parallel regions, based on the current settings. Often used to predict or set the maximum threads without necessarily entering a parallel region.omp_set_num_threads(int num_threads): Sets the number of threads that will be used for subsequent parallel regions. Used to control the number of threads dynamically, usually at the beginning of the program or before a parallel section.omp_get_num_procs(): Returns the number of processors (or CPU cores) available on the system. Useful for configuring the number of threads based on system capabilities, often combined withomp_set_num_threads().omp_in_parallel(): Returns a non-zero value if currently inside a parallel region; otherwise, returns zero. This function helps determine if a particular section of code is running in a parallel context, which can be useful for managing nested parallelism or debugging.omp_get_wtime(): Returns the elapsed wall-clock time (in seconds) since an arbitrary point in the past, used for timing and performance analysis. Commonly used to measure the execution time of code blocks in parallel programs.
Here an example in which these functions are used:
#include <stdio.h>
#include <omp.h>
int main() {
// Set number of threads to 4 for demonstration
omp_set_num_threads(4);
printf("Number of threads set to: %d\n", 4);
// Get the maximum number of threads available
int max_threads = omp_get_max_threads();
printf("Maximum threads available: %d\n", max_threads);
// Get the number of processors available
int num_procs = omp_get_num_procs();
printf("Number of processors available: %d\n", num_procs);
// Start timing the parallel section
double start_time = omp_get_wtime();
// Parallel region begins
#pragma omp parallel
{
// Check if we are inside a parallel region
if (omp_in_parallel()) {
printf("Thread %d: Currently in a parallel region.\n", omp_get_thread_num());
} else {
printf("Not in a parallel region.\n");
}
// Get the total number of threads in this parallel region
int total_threads = omp_get_num_threads();
printf("Thread %d: Total threads in parallel region: %d\n", omp_get_thread_num(), total_threads);
// Each thread prints its own ID
int thread_id = omp_get_thread_num();
printf("Hello from thread %d\n", thread_id);
}
// Parallel region ends
double end_time = omp_get_wtime();
// Print total time taken for the parallel section
printf("Time taken for parallel region: %f seconds\n", end_time - start_time);
return 0;
}
and this is the output
Number of threads set to: 4
Maximum threads available: 4
Number of processors available: 32
Thread 1: Currently in a parallel region.
Thread 1: Total threads in parallel region: 4
Hello from thread 1
Thread 2: Currently in a parallel region.
Thread 2: Total threads in parallel region: 4
Hello from thread 2
Thread 3: Currently in a parallel region.
Thread 3: Total threads in parallel region: 4
Hello from thread 3
Thread 0: Currently in a parallel region.
Thread 0: Total threads in parallel region: 4
Hello from thread 0
Time taken for parallel region: 0.001045 seconds
Critical regions and single regions
In OpenMP, critical and single are two different constructs used to control how certain code blocks are executed in a parallel region. While both are designed to prevent issues arising from concurrent access to code, they serve distinct purposes and are used in different situations. Here’s a breakdown of the differences:
#pragma omp critical- Purpose: Ensures that only one thread at a time executes the code within the critical section.
- Behavior: All threads can encounter a critical section, but only one thread is allowed to execute it at any given time. If multiple threads reach the critical section simultaneously, they will wait their turn to execute, effectively serializing access to the code block.
- Usage: Used to protect shared resources (e.g., updating a shared variable or data structure) where only one thread should modify or access the resource at a time.
#pragma omp single- Purpose: Ensures that only one thread executes the enclosed code block, but only once in total.
- Behavior: The first thread to reach the single section executes the code, while all other threads skip it entirely (rather than waiting to execute). This means that only one thread performs the work inside the single section, and it is done only once across all threads.
- Usage: Often used for initializing shared resources, performing I/O operations, or any task that only needs to be done once, rather than by every thread.
Here an example:
...
int shared_var = 0;
#pragma omp parallel
{
#pragma omp single
{
printf("Single: This only happens once.\n");
}
#pragma omp critical
{
shared_var += 1; // Critical: Ensures safe modification of shared_var
}
}
...
Data sharing attributes
In OpenMP, data sharing attributes are used to control how variables are shared or private among threads in a parallel region. These attributes determine how variables are treated during parallel execution, ensuring that data is correctly handled to avoid race conditions and other issues. The key data sharing attributes are:
1. private
- Purpose: Each thread gets its own private copy of the variable. Changes made to this variable by one thread are not visible to other threads.
- Behavior: The variable is uninitialized in each thread, meaning its value is undefined until it is explicitly set within the thread.
- Usage: Typically used for loop counters, temporary variables, or any data that should not be shared between threads.
- Example:
In this example,#pragma omp parallel for private(i) for (int i = 0; i < 10; i++) { printf("Thread %d, i = %d\n", omp_get_thread_num(), i); }iis private to each thread, so each thread gets its own version ofiduring the execution of the loop.
2. shared
- Purpose: A variable is shared among all threads. All threads in the parallel region see the same value of the variable.
- Behavior: The variable is shared across all threads, meaning any updates made by one thread are visible to all other threads.
- Usage: Typically used for global variables, accumulators, or any data that needs to be shared among threads.
- Example:
In this case,int sum = 0; #pragma omp parallel for shared(sum) for (int i = 0; i < 10; i++) { sum += i; } printf("Total sum: %d\n", sum);sumis shared across all threads. Each thread adds to the samesum, which will be updated by all threads during the execution of the loop.
3. firstprivate
- Purpose: The variable is private to each thread, but it is initialized with the value of the original variable at the start of the parallel region.
- Behavior: Each thread gets its private copy, but the initial value is the same as the value of the variable outside the parallel region.
- Usage: Useful when you want each thread to have its own copy of the variable but retain the initial value from outside the parallel region.
- Example:
Here,int x = 10; #pragma omp parallel for firstprivate(x) for (int i = 0; i < 10; i++) { printf("Thread %d, x = %d\n", omp_get_thread_num(), x); }xis private to each thread, but it is initialized to 10 in each thread, the value it had outside the parallel region.
4. lastprivate
- Purpose: Similar to
private, but the value of the variable from the last thread that executes the loop or region is copied back to the original variable after the parallel region ends. - Behavior: Each thread has its own private copy of the variable, but the value of the variable from the last thread is written back to the original variable after the parallel region completes.
- Usage: Typically used when you want the final value of a variable after the parallel region to reflect the last thread's result.
- Example:
In this example,int last_value = 0; #pragma omp parallel for lastprivate(last_value) for (int i = 0; i < 10; i++) { last_value = i; // Will be set to 9 in the end } printf("Last value: %d\n", last_value); // Prints 9last_valueis set to the value ofifrom the last thread to execute in the parallel region (thread handling the last iteration of the loop).
5. default
- Purpose: Specifies the default sharing behavior for variables in a parallel region. OpenMP supports three
defaultoptions:shared,none, andprivate. - Options:
default(shared): All variables are shared by default unless explicitly stated otherwise.default(private): All variables are private by default unless explicitly stated otherwise.default(none): No variables are implicitly shared or private. This forces the programmer to explicitly declare the data-sharing attribute for each variable used in the parallel region.
- Example:
#pragma omp parallel default(none) shared(sum) private(i) { // Each thread must declare its own data sharing explicitly. for (i = 0; i < 10; i++) { sum += i; } }
6. uniform (OpenMP 5.0)
- Purpose: Used to specify that a variable is shared but the compiler should optimize it based on the understanding that all threads will access the same value.
- Usage: This is particularly useful when passing large data objects to the parallel region and the compiler can optimize them for thread safety.
- Example:
#pragma omp parallel for uniform(arr) for (int i = 0; i < 100; i++) { arr[i] = i; }
Summary of Data Sharing Attributes:
| Attribute | Description | Example Use Case |
|---|---|---|
private | Each thread gets its own private copy of the variable. | Loop counters, thread-specific data. |
shared | All threads share the same variable, any update is visible to all threads. | Accumulators, global data. |
firstprivate | Variable is private to each thread, initialized with the value from outside. | Initialization of thread-local copies with external values. |
lastprivate | Like private, but the value from the last thread is copied back to the original. | For updating a variable based on the last thread’s result in a loop. |
default | Specifies the default data-sharing behavior (e.g., shared, private). | Used to set default sharing behavior for all variables in a parallel region. |
uniform | Optimizes data sharing for large objects, ensuring uniform access. | Optimizing memory access patterns for large shared data. |
By carefully choosing the appropriate data-sharing attributes, you can control how variables are handled in parallel regions, ensuring the correct behavior and avoiding data races.