========================================================================== Sum3 Multi-Core OpenMP Implementation ========================================================================== At this point we have seen two possible algorithms and implementations for the Sum3 problem 1. A :math:`O(n^3)` algorithm that is extremely easy to implement, but that suffers from poor execution time. 2. A :math:`O(n^2)` algorithm that is much faster in terms of execution time, but that requires care to handle sequences of repeating values correctly. The question now whether or not we can do better. It turns out that algorithmically, we cannot...at least not yet; there are currently no known algorithms to solve the Sum3 problem faster than :math:`On^2` time. In fact, a class of problems exists called Sum3 hard problems [GOM1995CG]_. Any problem that is constant time reducible to a Sum3 problem is Sum3-hard; the implication being that a sub-quadratic time solution to any problem in the Sum3-hard class provides a sub-quadratic time solution to Sum3. Introducing Parallelism with Multiple Threads =============================================== In terms of running time, we can do better with the Sum3 problem. One way to improve running time is to utilize a multi-core CPU with a multi-threaded program. The general approach is to divide the work that must be completed among multiple cores. Ideally, if we evenly divide the work among two processors, the program should run twice as fast. Another way to express this is to say that the program should have a 2x **speedup** over the serial version. The concept of speedup is a useful measure to gauge the effectiveness of a particular optimization, and is defined as follows: .. admonition:: Definition Let :math:`t_{old}` be the running time of a program and :math:`t_{new}` be the running time of the same program that has been optimized. The **speedup** of the new program is equal to the running time of the old program divided by the running time of the new program: **speedup** :math:`= t_{old} / t_{new}` Design of a Multi-Threaded Algorithm ------------------------------------- Designing a multi-threaded program, or a parallel algorithm for that matter, requires a different thought process than defining a serial program. When defining a serial algorithm, the focus is usually to determine the steps that must be completed in succession to achieve an answer. When developing a multi-threaded program, one must think in terms of how the work can be divided among threads. It is sometimes useful to begin with a serial program and try to modify it to divide work, and other times it is easier to simply start from scratch. .. admonition:: Concept **How should the work be divided among threads?** In order to effectively parallelize work, each thread must be able to do a portion of the work. Ideally, each thread will take roughly the same amount of time to execute; it is generally undesirable for a bunch of threads to complete their work quickly and sit around waiting on a single, slow thread to finish. Lets take the cubic version of the Sum3 algorithm as an example (the core of the program is listed below). The main work of the program consists of the double nested loops that generate all possible triples of numbers from the input array. A reasonable method of dividing the work is to make 2 threads such that each thread generates roughly half of the triples that must be tested. We can achieve this by modifying the outer loop such that one thread will only compute triples in which the first number in the triple comes from the first half the array, and the second thread will compute triples in which the first number comes from the second half of the array. Such a modification requires minor changes to the core of the program. .. highlight:: c :linenothreshold: 0 :: #include using namespace std; int main( ) { int dataSize = 5; int* data = new int[ dataSize ]; data[0] = -1; data[1] = -2; data[2] = 0; data[3] = 2; data[4] = 3; // do the naive Sum3 computation. O(n^3) int count = 0; for (int i = 0; i < dataSize-2; i++) for (int j = i+1; j < dataSize-1; j++) for (int k = j+1; k < dataSize; k++) if (data[i] + data[j] + data[k] == 0) count++; cout<< count < #include using namespace std; int getRandInt( int *dataVec, int dataSize ) { // load up a vector with random integers int num; for( int i = 0; i < dataSize; i++ ) { // integers will be 1-100 num = rand() % 100 +1; if( rand( ) % 2 == 0 ) { // make some integers negative num *= -1; } dataVec[i] = num; } } // define a struct to pass information to the threads struct threadInfo{ int myID; int *dataPtr; int dataSize; int count; }; void* partial3SumThread( void* arg ) { // type cast the argument back to a struct threadInfo * myInfo = static_cast( arg ); // each thread only works on half the array in the outer loop // compute the bounds based on the ID we assigned each thread. // remember, we only have 2 threads in this case, so we will hard code a 2 int start = ((myInfo->dataSize / 2) * myInfo->myID ); int stop = ((myInfo->dataSize / 2)* (myInfo->myID+1)); if( myInfo->myID == 1 ) stop =myInfo->dataSize-2; // do the naive Sum3 computation. O(n^3) for (int i = start; i < stop; i++) for (int j = i+1; j < myInfo->dataSize-1; j++) for (int k = j+1; k < myInfo->dataSize; k++) if (myInfo->dataPtr[i] + myInfo->dataPtr[j] + myInfo->dataPtr[k] == 0){ myInfo->count++; } } int main(int argc, char * argv[] ) { int dataSize = 0; if( argc < 2 ) { std::cerr << "usage: exe [num of nums] [optional seed value]" << std::endl; exit( -1 ); } { std::stringstream ss1; ss1 << argv[1]; ss1 >> dataSize; } if( argc >= 3 ) { std::stringstream ss1; int seed; ss1 << argv[2]; ss1 >> seed; srand( seed ); } else { srand( 0 ); } // create a data vector int *data = new int[ dataSize ]; // load it up with random data getRandInt( data, dataSize ); // allocate thread handles pthread_t worker1tid, worker2tid; // allocate and set up structs for 2 threads threadInfo info1, info2; info1.myID = 0; info1.dataPtr = data; info1.dataSize = dataSize; info1.count = 0; info2.myID = 1; info2.dataPtr = data; info2.dataSize = dataSize; info2.count = 0; // allocate space for a return value int returnVal; // call the worker threads if ( returnVal = pthread_create( &worker1tid, NULL, partial3SumThread, &info1 ) ) { cerr<< "pthread_create 1: "<< returnVal < #include #include using namespace std; int getRandInt( int *dataVec, int dataSize ) { // load up a vector with random integers int num; for( int i = 0; i < dataSize; i++ ) { // integers will be 1-100 num = rand() % 100 +1; if( rand( ) % 2 == 0 ) { // make some integers negative num *= -1; } dataVec[i] = num; } } int main(int argc, char * argv[] ) { int dataSize = 0; if( argc < 2 ) { std::cerr << "usage: exe [num of nums] [optional seed value]" << std::endl; exit( -1 ); } { std::stringstream ss1; ss1 << argv[1]; ss1 >> dataSize; } if( argc >= 3 ) { std::stringstream ss1; int seed; ss1 << argv[2]; ss1 >> seed; srand( seed ); } else { srand( 0 ); } // create a data vector int *data = new int[ dataSize ]; // load it up with random data getRandInt( data, dataSize ); // do the naive Sum3 computation. O(n^3) int count1 = 0; int count2 = 0; omp_set_num_threads( 2 ); #pragma omp parallel { int count = 0; int myID = omp_get_thread_num(); // each thread only works on half the array in the outer loop // compute the bounds based on the ID we assigned each thread. // remember, we only have 2 threads in this case, so we will hard code a 2 int start = (( dataSize / 2) * myID ); int stop = ((dataSize / 2)* (myID+1)); if( myID == 1 ) stop =dataSize-2; for (int i = start; i < stop; i++) for (int j = i+1; j < dataSize-1; j++) for (int k = j+1; k < dataSize; k++) if (data[i] + data[j] + data[k] == 0){ count++; } if( myID == 0 ) count1 = count; else count2 = count; } cout<< count1 + count2 < #include #include using namespace std; int getRandInt( int *dataVec, int dataSize ) { // load up a vector with random integers int num; for( int i = 0; i < dataSize; i++ ) { // integers will be 1-100 num = rand() % 100 +1; if( rand( ) % 2 == 0 ) { // make some integers negative num *= -1; } dataVec[i] = num; } } int main(int argc, char * argv[] ) { int dataSize = 0; if( argc < 2 ) { std::cerr << "usage: exe [num of nums] [optional seed value]" << std::endl; exit( -1 ); } { std::stringstream ss1; ss1 << argv[1]; ss1 >> dataSize; } if( argc >= 3 ) { std::stringstream ss1; int seed; ss1 << argv[2]; ss1 >> seed; srand( seed ); } else { srand( 0 ); } // create a data vector int *data = new int[ dataSize ]; // load it up with random data getRandInt( data, dataSize ); int count = 0; // do the naive Sum3 computation. O(n^3) #pragma omp parallel for for (int i = 0; i < dataSize-2; i++) for (int j = i+1; j < dataSize-1; j++) for (int k = j+1; k < dataSize; k++) if (data[i] + data[j] + data[k] == 0){ #pragma omp critical { count++; } } cout<< count < #include #include using namespace std; int getRandInt( int *dataVec, int dataSize ) { // load up a vector with random integers int num; for( int i = 0; i < dataSize; i++ ) { // integers will be 1-100 num = rand() % 100 +1; if( rand( ) % 2 == 0 ) { // make some integers negative num *= -1; } dataVec[i] = num; } } int main(int argc, char * argv[] ) { int dataSize = 0; if( argc < 2 ) { std::cerr << "usage: exe [num of nums] [optional seed value]" << std::endl; exit( -1 ); } { std::stringstream ss1; ss1 << argv[1]; ss1 >> dataSize; } if( argc >= 3 ) { std::stringstream ss1; int seed; ss1 << argv[2]; ss1 >> seed; srand( seed ); } else { srand( 0 ); } // create a data vector int *data = new int[ dataSize ]; // load it up with random data getRandInt( data, dataSize ); int count = 0; // do the naive Sum3 computation. O(n^3) #pragma omp parallel for reduction(+:count) for (int i = 0; i < dataSize-2; i++) for (int j = i+1; j < dataSize-1; j++) for (int k = j+1; k < dataSize; k++) if (data[i] + data[j] + data[k] == 0){ count++; } cout<< count <