Table Of Contents

Previous topic

Sum3 Simple Implementation (Serial)

This Page

Sum3 Multi-Core OpenMP Implementation

At this point we have seen two possible algorithms and implementations for the Sum3 problem

  1. A O(n^3) algorithm that is extremely easy to implement, but that suffers from poor execution time.
  2. A O(n^2) algorithm that is much faster in terms of execution time, but that requires care to handle sequences of repeating values correctly.

The question now whether or not we can do better. It turns out that algorithmically, we cannot...at least not yet; there are currently no known algorithms to solve the Sum3 problem faster than On^2 time. In fact, a class of problems exists called Sum3 hard problems [GOM1995CG]. Any problem that is constant time reducible to a Sum3 problem is Sum3-hard; the implication being that a sub-quadratic time solution to any problem in the Sum3-hard class provides a sub-quadratic time solution to Sum3.

Introducing Parallelism with Multiple Threads

In terms of running time, we can do better with the Sum3 problem. One way to improve running time is to utilize a multi-core CPU with a multi-threaded program. The general approach is to divide the work that must be completed among multiple cores. Ideally, if we evenly divide the work among two processors, the program should run twice as fast. Another way to express this is to say that the program should have a 2x speedup over the serial version. The concept of speedup is a useful measure to gauge the effectiveness of a particular optimization, and is defined as follows:

Definition

Let t_{old} be the running time of a program and t_{new} be the running time of the same program that has been optimized. The speedup of the new program is equal to the running time of the old program divided by the running time of the new program:

speedup = t_{old} / t_{new}

Design of a Multi-Threaded Algorithm

Designing a multi-threaded program, or a parallel algorithm for that matter, requires a different thought process than defining a serial program. When defining a serial algorithm, the focus is usually to determine the steps that must be completed in succession to achieve an answer. When developing a multi-threaded program, one must think in terms of how the work can be divided among threads. It is sometimes useful to begin with a serial program and try to modify it to divide work, and other times it is easier to simply start from scratch.

Concept

How should the work be divided among threads?

In order to effectively parallelize work, each thread must be able to do a portion of the work. Ideally, each thread will take roughly the same amount of time to execute; it is generally undesirable for a bunch of threads to complete their work quickly and sit around waiting on a single, slow thread to finish.

Lets take the cubic version of the Sum3 algorithm as an example (the core of the program is listed below). The main work of the program consists of the double nested loops that generate all possible triples of numbers from the input array. A reasonable method of dividing the work is to make 2 threads such that each thread generates roughly half of the triples that must be tested. We can achieve this by modifying the outer loop such that one thread will only compute triples in which the first number in the triple comes from the first half the array, and the second thread will compute triples in which the first number comes from the second half of the array. Such a modification requires minor changes to the core of the program.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#include <iostream>
using namespace std;

int main( )
{
        int dataSize = 5;
        int* data = new int[ dataSize ];
        data[0] = -1;
        data[1] = -2;
        data[2] = 0;
        data[3] = 2;
        data[4] = 3;
        // do the naive Sum3 computation.  O(n^3)
        int count = 0;
        for (int i = 0; i < dataSize-2; i++)
                for (int j = i+1; j < dataSize-1; j++)
                        for (int k = j+1; k < dataSize; k++)
                                if (data[i] + data[j] + data[k] == 0)
                                        count++;

        cout<< count <<endl;

}

Cubic Sum3 Using Pthreads

Pthreads are a low level threading mechanism that have been around for a long time. Pthreads are provided by a C library, and such, they do things in a very old-school, C way of thinking. The basic idea that the programmer defines a function, and then tells the pthread library to create a thread to run that function. Therefore, we have a concept of a master thread, which is the thread of the program that is started when the program is initially executed, and the concept of worker threads, which are spawned by the master thread. Typically, the master thread will spawn some worker threads, and then wait for them to complete before moving on, as illustrated in the following:

In order to achieve a pthread version of the program, we must first put the code that each thread will execute into a function. We will then call a pthread library call and tell it to use that function as the code the thread will run. The pthread library will then launch the thread. Our master thread can then launch additional threads, do other work, or wait on the launched threads to finish executing. Here are the function calls we will use... you should check your system’s man pages for the most up to date information.

Pthread Library Calls

1
2
3
int
pthread_create(pthread_t *restrict thread, const pthread_attr_t *restrict attr,
        void *(*start_routine)(void *), void *restrict arg);

Create a thread. The first argument will be assigned a pointer to a thread handle that is created. The second argument is a pointer to a thread attributes struct that can define attributes of the thread, we will simply pass a NULL to use defaults. The third argument is the function pointer to the function that this thread will execute. The final pointer is a pointer to a single argument that will be passed to the function. Note, to use the argument, the function must type cast it to the appropriate type. Also, if more than 1 arguments are required, you must pack them into a struct. Returns a

Returns a 0 on success, a non-zero error code on failure

1
2
int
pthread_join(pthread_t thread, void **value_ptr);

Join a thread. The join command will cause the calling thread to wait until the thread identified by the thread handle in the first argument terminates. We will pass NULL as the second argument

Returns a 0 in success, a non–zero error code on failure

Now that we have the ability to create and join threads, we can create a simple pthread program. The following example shows how to create threads to do some work, and pass arguments to them. Remember that pthreads require the function that will be executed as a thread to take a single argument; so we must wrap all the arguments we want to pass to the function into a single struct. We define the struct on line 23, and instantiate two instances of the struct on line 87. Because a pointer to the struct is passed to our partial3SumThread function as a void *, we must cast the pointer back to the correct type on line 33 before we can use it in the function.

We want to pass the input array to each thread, so we can simply pass it to the function.

Note

The function launched as a thread by the pthread library must take a single argument of type void pointer. This argument must be type-cast to a void to pass it into the function, and type-cast back to its original type within the function. Of course, this breaks type checking for the compiler, so be careful and double check the types yourself!

Here is the code:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
#include <iostream>
#include <sstream>

using namespace std;

int getRandInt( int *dataVec, int dataSize )
{
        // load up a vector with random integers
        int num;
        for( int i = 0; i < dataSize; i++ ) {
                // integers will be 1-100
                num = rand() % 100 +1;
                if( rand( ) % 2 == 0 ) {
                        // make some integers negative
                        num *= -1;
                }
                dataVec[i] = num;
        }
}

// define a struct to pass information to the threads
struct threadInfo{
        int myID;
        int *dataPtr;
        int dataSize;
        int count;
};


void* partial3SumThread( void* arg ) {
        // type cast the argument back to a struct
        threadInfo * myInfo = static_cast<threadInfo*>( arg );


        // each thread only works on half the array in the outer loop
        // compute the bounds based on the ID we assigned each thread.
        // remember, we only have 2 threads in this case, so we will hard code a 2
        int start = ((myInfo->dataSize / 2) * myInfo->myID );
        int stop = ((myInfo->dataSize / 2)* (myInfo->myID+1));
        if( myInfo->myID == 1 )
                stop =myInfo->dataSize-2;

        // do the naive Sum3 computation.  O(n^3)
        for (int i = start; i < stop; i++)
                for (int j = i+1; j < myInfo->dataSize-1; j++)
                        for (int k = j+1; k < myInfo->dataSize; k++)
                                if (myInfo->dataPtr[i] + myInfo->dataPtr[j] + myInfo->dataPtr[k] == 0){
                                        myInfo->count++;
                                }
}

int main(int argc, char * argv[] )
{
        int dataSize = 0;

        if( argc < 2 ) {
                std::cerr << "usage: exe  [num of nums] [optional seed value]" << std::endl;
                exit( -1 );
        }
        {
                std::stringstream ss1;
                ss1 << argv[1];
                ss1 >> dataSize;
        }
        if( argc >= 3 ) {
                std::stringstream ss1;
                int seed;
                ss1 << argv[2];
                ss1 >> seed;
                srand( seed );
        }
        else {
                srand( 0 );
        }

        // create a data vector
        int *data = new int[ dataSize ];

        // load it up with random data
        getRandInt( data, dataSize );

        // allocate thread handles
        pthread_t       worker1tid, worker2tid;

        // allocate and set up structs for 2 threads
        threadInfo info1, info2;
        info1.myID = 0;
        info1.dataPtr = data;
        info1.dataSize = dataSize;
        info1.count = 0;
        info2.myID = 1;
        info2.dataPtr = data;
        info2.dataSize = dataSize;
        info2.count = 0;

        // allocate space for a return value
        int returnVal;
        // call the worker threads
        if (  returnVal = pthread_create( &worker1tid, NULL, partial3SumThread, &info1 ) ) {
                cerr<<  "pthread_create 1: "<< returnVal <<endl;
                exit( 1 );
        }
        if (  returnVal = pthread_create( &worker2tid, NULL, partial3SumThread, &info2 ) ) {
                cerr<<  "pthread_create 2: "<< returnVal <<endl;
                exit( 1 );
        }
        // now wait on the threads to finish
        if ( returnVal = pthread_join( worker1tid, NULL ) ) {
                cerr<<  "pthread_join 1: "<< returnVal <<endl;
                exit( 1 );
        }

        if ( returnVal = pthread_join( worker2tid, NULL ) ) {
                cerr<<  "pthread_join 2: "<< returnVal <<endl;
                exit( 1 );
        }

        cout<< info1.count + info2.count <<endl;
}

The program has a relatively simple structure, the core of the Sum3 computation is done in the partial3SumThread function. 2 instances of the function are launched, and we assigned the first instance an ID of 0, and the second instance and ID of 1. These thread IDs are manually assigned and stored in a threadInfo struct that we create. The structs are created and initialized on lines 87-95. Based on the thread’s ID, we determine which half of the array a thread should use in the outer loop of the Sum3 computation (lines 39-42). A thread is launched on line 100 and line 104. Once all the threads are launched, the main thread must wait for them to complete before printing the result, so the main thread calls one join function for each thread (lines 109 and 114). Once all worker threads have joined, we print the result.

Here are some running times on a machine with a 2 core cpu, and the results graphed along with the serial and serial quadratic versions of the algorithm:

Array Size Time (s)
100 .01
200 .019
300 .029
400 .063
500 .125
600 .207
700 .324
800 .475
900 .668
1000 .912

The pthreads version is clearly faster. As expected, the version with two threads runs roughly twice as fast as the version with 1 thread. For an array of 500, the speedup is 1.9, for an array of 1000, the speedup is 2.08. A speedup of more than 2 is likely due to timing granularity. The linux time command was used, which is not the most accurate time keeping method. Other considerations should have an impact. For example, the two threads could have an effect on cache, causing a greater amount of cache hits or misses than the single threaded version. Cache misses cause the processor to stall, wasting time. More cache hits means the processor stalls less, and does the work in a shorter amount of time.

Exercise

The Pthreads version of the Sum3 problem is hard coded to handle 2 threads. Convert the program such that it will take the number of threads to use as a command line argument, and then generate that many threads. Hint: you will have an easier time of it if you use vectors to hold all thread information. For example, you will have a vector of threadInfo structs, on for each thread.

Thread Communication

In the Pthreads version of Sum3, very little thread communication took place. In fact, the only thread communication that happened was the master thread provided some data to the worker threads in the form of a function argument. Usually, threads will communicate with each other through shared memory. The key to understanding shared memory is understanding scoping rules. Basically, a thread will have access to all memory locations that are in scope for that thread, and all memory locations accessible through pointers that are in scope for that thread. This means that if a variable has global scope, then ALL threads will be able to read and write to it. This means, two threads can communicate by reading and writing to such a value. In our program above, we declared two instances of the threadInfo struct in the main() function (line 87); those instances are info1, info2. Those structs now exist in memory, but the names info1, info2 exist in scope for the main() function. We store information that the threads need in those structs. To provide access to that information, we pass the pointer to one of those structs to each thread during the pthread_create call. This means that a thread can access its struct through that pointer. Essentially, the master thread has communicated with the worker threads, letting them know their thread ID, the input array, etc.

Note that in the program, only 1 copy of the data array exists. It is created in the main() function. Pointers to that array are stored in the threadInfo structs, so that each thread can access the same array. This does not cause any problems because both threads only read the array, they are not making changes to it. Also note that each struct has its own memory location to store a count. Thus, each thread is keeping track of its own count. This is why all the count values from the threadInfo structs must be summed in line 119.

Exercise

Change the pthreads Sum3 program such that both threads access the same memory location for keeping track of their count variable. Don’t use any other pthread library features. Run the program multiple times. What happens to the result value? why?

Doing Things a Little Differently with OpenMP

Pthreads is just one method to create multi-threaded programs. Most languages have some built-in mechanism or library to launch threads. Another mechanism in C/C++ is OpenMP. One drawback to pthreads is that although the core Sum3 computation changed very little in the pthread program, the program did require some somewhat significant structural changes. Recall that we had to put our computation into a function, set up a struct to pass data to the function, introduce create and join function calls, etc. Furthermore, we only implemented 2 threads, implementing more threads requires even more work. So, there is some programming overhead to converting a serial program to a parallel program using pthreads.

OpenMP takes a compiler-based approach to threading. Instead of inserting function calls and re-organizing code to create threads, OpenMP requires you to place compiler directives near portions of code that will be run in parallel so the compiler can generate a team of threads. These directives are known as pragmas. One of the goals of OpenMP is to provide a mechanism whereby serial code can be parallelized using multiple threads with only minor code modifications. The result is that you can write a serial program, debug it while it is serial (a much easier task than debugging parallel code), and then simply add a pragma to achieve parallelism using multiple threads.

As an example, we will write a OpenMP version of the Sum3 problem with a similar structure as the Pthread version: we will create 2 threads, and each thread will only iterate over a portion of the outer loop. We will need the following OpenMP pragmas and functions:

OpenMP Interface

1
2
3
4
#pragma omp parallel
{
        // some code
}

Will create a team of n threads where n is the number of computational cores available on the computer on which this code is executed. Each thread will execute the code in the code block. No code will be executed beyond the code block until all threads have joined.

1
int omp_get_thread_num();

Returns the unique thread identifier of the thread that executes this function. If only a single thread exists, the function always returns 0. If multiple threads exist, the function returns a number in the range from 0 to omp_get_num_threads()-1.

1
int omp_get_num_threads();

Returns the number of threads in the current team. If only 1 thread exists (the serial portion of the code), it returns 1.

1
void omp_set_num_threads( int num_threads );

Sets the default number of threads to create in subsequent parallel sections (for example, sections defined by #pragma omp parallel).

1
g++ -fopenmp source.cpp

remember to use the OpenMP compiler flag to compile programs using OpenMP.

With these pragmas and functions, we can duplicate the Pthreads version of the program by forcing 2 threads to operate on the data. We will call omp_set_num_threads(2) (line 57) to force 2 threads, then break up the work similarly to what we did before. Much like Pthreads, OpenMP threads will communicate through memory locations; again, scoping rules apply: (1) any variable whose scope is external to a parallel section will be shared among all threads, and (2) all threads will have their own private copy of any variable declared within the scope of a parallel section. So, we will define an integer for each thread to keep track of the number of triples summing to 0 that it sees external to the parallel section, so we can add those counts together at the end (lines 54-55). Recall in the Pthread code, we had to explicitly set thread identifiers to 0 and 1; OpenMP will do this for us and each thread can find its ID using the omp_get_thread_num() function (line 62):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
#include <iostream>
#include <sstream>
#include <omp.h>

using namespace std;

int getRandInt( int *dataVec, int dataSize )
{
        // load up a vector with random integers
        int num;
        for( int i = 0; i < dataSize; i++ ) {
                // integers will be 1-100
                num = rand() % 100 +1;
                if( rand( ) % 2 == 0 ) {
                        // make some integers negative
                        num *= -1;
                }
                dataVec[i] = num;
        }
}


int main(int argc, char * argv[] )
{
        int dataSize = 0;

        if( argc < 2 ) {
                std::cerr << "usage: exe  [num of nums] [optional seed value]" << std::endl;
                exit( -1 );
        }
        {
                std::stringstream ss1;
                ss1 << argv[1];
                ss1 >> dataSize;
        }
        if( argc >= 3 ) {
                std::stringstream ss1;
                int seed;
                ss1 << argv[2];
                ss1 >> seed;
                srand( seed );
        }
        else {
                srand( 0 );
        }

        // create a data vector
        int *data = new int[ dataSize ];

        // load it up with random data
        getRandInt( data, dataSize );

        // do the naive Sum3 computation.  O(n^3)
        int count1 = 0;
        int count2 = 0;


        omp_set_num_threads( 2 );
#pragma omp parallel
        {
                int count = 0;
                int myID = omp_get_thread_num();
                // each thread only works on half the array in the outer loop
                // compute the bounds based on the ID we assigned each thread.
                // remember, we only have 2 threads in this case, so we will hard code a 2
                int start = (( dataSize / 2) * myID );
                int stop = ((dataSize / 2)* (myID+1));
                if( myID == 1 )
                        stop =dataSize-2;
                for (int i = start; i < stop; i++)
                        for (int j = i+1; j < dataSize-1; j++)
                                for (int k = j+1; k < dataSize; k++)
                                        if (data[i] + data[j] + data[k] == 0){
                                                count++;
                                        }
                if( myID == 0 )
                        count1 = count;
                else
                        count2 = count;
        }
        cout<< count1 + count2 <<endl;
}

Now, some running times. Again, speedup is roughly 2 for the OpenMP version, but the OpenMP version required much less code reorganization to implement. The running times are a few thousandths of a second higher than the Pthread versions. This is due to a different implementation of threading. In OpenMP, the compiler implements the threading, rather than OS system calls. Because of this, the threads get compiled a little differently. Also, different OpenMP compilers will have slightly different performance results.

Array Size Time (s)
100 .005
200 .014
300 .032
400 .066
500 .124
600 .212
700 .333
800 .492
900 .696
1000 .952

Exercise

Compile the serial cubic version of the program and the openmp version with compiler optimizations turned on. How do the running times compare then?

Creating a Lot of Threads

Because OpenMP is integrated into the compiler, the overhead of creating threads is small; thus, it is possible to create many more threads than available processor cores. An important concept of OpenMP is the idea of worker threads: on a computer with 2 cores, the number of worker threads defaults to 2. Even if many threads are created, only 2 will be able to run at any given time. Therefore, it is acceptable, although not always optimal, to create many more threads than there are processors available. One easy way to create a lot of threads is through the omp parallel for pragma. This pragma must be placed directly before a for loop. OpenMP will then generate a single thread for every iteration of that for loop. To be safe, make sure that the for loop contains the initialization, increment, and stopping condition in the loop declaration!

One problem with generating lots of threads is that we also need to then generate lots of count variables (lines 54-55 above). One alternative is to create a single global count variable, but then make sure that only 1 thread accesses it at a time. To ensure that only 1 thread accesses a variable at a time, we put that variable access in a critical section using an OpenMP omp_critical pragma. Anything in a code block directly following an omp_critical pragma is guaranteed to be accessed by exactly 1 thread at a time. Using these directives, our multi-threaded code looks very similar to our original code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
#include <iostream>
#include <sstream>
#include <omp.h>

using namespace std;

int getRandInt( int *dataVec, int dataSize )
{
        // load up a vector with random integers
        int num;
        for( int i = 0; i < dataSize; i++ ) {
                // integers will be 1-100
                num = rand() % 100 +1;
                if( rand( ) % 2 == 0 ) {
                        // make some integers negative
                        num *= -1;
                }
                dataVec[i] = num;
        }
}


int main(int argc, char * argv[] )
{
        int dataSize = 0;

        if( argc < 2 ) {
                std::cerr << "usage: exe  [num of nums] [optional seed value]" << std::endl;
                exit( -1 );
        }
        {
                std::stringstream ss1;
                ss1 << argv[1];
                ss1 >> dataSize;
        }
        if( argc >= 3 ) {
                std::stringstream ss1;
                int seed;
                ss1 << argv[2];
                ss1 >> seed;
                srand( seed );
        }
        else {
                srand( 0 );
        }

        // create a data vector
        int *data = new int[ dataSize ];

        // load it up with random data
        getRandInt( data, dataSize );

        int count = 0;
        // do the naive Sum3 computation.  O(n^3)
        #pragma omp parallel for
        for (int i = 0; i < dataSize-2; i++)
                for (int j = i+1; j < dataSize-1; j++)
                        for (int k = j+1; k < dataSize; k++)
                                if (data[i] + data[j] + data[k] == 0){
                                        #pragma omp critical
                                        {
                                                count++;
                                        }
                                }

        cout<< count <<endl;
}

The one drawback of the critical section is that execution may serialize on it. Therefore, if every thread enters the critical section in every execution of the loop, then the program will behave much like a serial program in terms of running time. Therefore, placement of the critical section is important! Note that we placed it inside the if statement. Every triple must be tested to see if it sums to 0, but only a small portion of the triples actually sum to 0. If we placed the if statement in the critical section, we would get serial behavior. If every triple summed to 0 (an array of lots of 0s), we would also get serial behavior (regardless of the placement of the critical section with respect to the if statement). Thus, the previous method will have more reliable speedups in these edge cases, but this method will get speedups for input that we are likely to see, and requires only a few lines of modification to the code, and no logic changes!

Here are the running times. Note that the critical section does have an impact, but it is not too bad. A speedup of 1.9 vs 1.9 for the Pthreads version at an array size of 500 (the same!), and a speedup of 1.9 vs 2.1 for the Pthreads version at an array size of 1000:

Array Size Time (s)
100 .005
200 .014
300 .032
400 .068
500 .130
600 .218
700 .343
800 .512
900 .722
1000 .989

Exercise

Change the OpenMP program using a critical section to serialize in 2 ways. First, use an input array of all 0’s. Then move the critical section to contain the if statement. Compare running times to the serial version of the algorithm (the cubic one), and give a list of reasons as to why one is faster/slower than the other.

Reductions

OpenMP has a lot of other options, constructs, and functionality. Check the official documentation for more details. One of the advantages of OpenMP is that many of these constructs make it very easy to achieve tasks that one has to do manually with pthreads. A good example of this is the reduction.

A reduction simply means to combine many values into a single value using a specified operator. For example, we can reduce an array of numbers to a single sum of all the numbers in the array (by adding them up). Alternatively, we could reduce an array to its max or min value, by searching for the largest or smallest value, respectively, in the array. We can use a reduction to completely get rid of the critical section in the previous version of the code.

When declaring an omp parallel for pragma, we can identify values that will be reduced after all threads have executed. Each thread will then make its own local copy of any variable specified in a reduction, and OpenMP will automatically perform the specified reduction of all of those local variables into a single value. The reduction operation is also specified.

For example, to get rid of the critical section above, we should tell OpenMP to perform a reduction on count. Thus, every thread will get its own copy of count. We will specify a sum reduction, so that at the end of thread execution, all the local versions of the count variable get summed into a single variable. The result is that we only have to add a single line of code to achieve parallel implementation of Sum3. Line 55 specifies that a sum reduction on the variable count will be computed:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
#include <iostream>
#include <sstream>
#include <omp.h>

using namespace std;

int getRandInt( int *dataVec, int dataSize )
{
        // load up a vector with random integers
        int num;
        for( int i = 0; i < dataSize; i++ ) {
                // integers will be 1-100
                num = rand() % 100 +1;
                if( rand( ) % 2 == 0 ) {
                        // make some integers negative
                        num *= -1;
                }
                dataVec[i] = num;
        }
}


int main(int argc, char * argv[] )
{
        int dataSize = 0;

        if( argc < 2 ) {
                std::cerr << "usage: exe  [num of nums] [optional seed value]" << std::endl;
                exit( -1 );
        }
        {
                std::stringstream ss1;
                ss1 << argv[1];
                ss1 >> dataSize;
        }
        if( argc >= 3 ) {
                std::stringstream ss1;
                int seed;
                ss1 << argv[2];
                ss1 >> seed;
                srand( seed );
        }
        else {
                srand( 0 );
        }

        // create a data vector
        int *data = new int[ dataSize ];

        // load it up with random data
        getRandInt( data, dataSize );

        int count = 0;
        // do the naive Sum3 computation.  O(n^3)
#pragma omp parallel for reduction(+:count)
        for (int i = 0; i < dataSize-2; i++)
                for (int j = i+1; j < dataSize-1; j++)
                        for (int k = j+1; k < dataSize; k++)
                                if (data[i] + data[j] + data[k] == 0){
                                                count++;
                                }
        cout<< count <<endl;
}

And finally, here are some running times:

Array Size Time (s)
100 .005
200 .013
300 .031
400 .068
500 .125
600 .213
700 .332
800 .505
900 .697
1000 .963

The running times are a bit slower than the Pthreads version, and a bit faster than the OpenMP version with a critical section. However, remember that the Pthreads version is hard coded to 2 threads, and the programmer essentially must implement their own reduction if you want to use more threads. This version required adding 1 pragma to an otherwise unchanged serial implementation and achieved threading that will automatically scale to the number of available processors with NO extra work! This code is easy to debug (you simply comment out the pragma and debug in serial), uses advanced features, and is easily adaptable to multiple hardware configurations. This should convince you of why OpenMP is a popular choice for threading.

Exercise

Look at the scheduling clauses for OpenMP (the wiki has a concise description). Try dynamic scheduling with various chunk sizes on the above program. Report the effects on execution time, and describe why those effects are occurring.

Convert the O(n^2) version of the algorithm to a parallel algorithm using OpenMP. Try to get the fastest time for 100,000 numbers. Make sure to use a computer with multiple cores (and hyperthreading turned off!)

[GOM1995CG]
  1. Gajentaan, M.H. Overmars, “On a class of O(n2) problems in computational geometry”, Computational Geometry: Theory and Applications, 1995, 5 (3): 165–185, doi:10.1016/0925-7721(95)00022-2.