As the final lab assignment for our Concurrent Programming module, we were asked to optimize matrix multiplication. C++ was set as the language and Linux was set as the environment for this task. The tasks for this lab assignment were as follows:

- Write a naive sequential program to perform matrix multiplication on a n x n matrix containing double precision values. The matrices were to be filled with random double values. Only the time taken for the actual multiplication part was to be recorded.
- Identify a C++ library for Linux environment which supports parallel for loops and write a parallelized version of the above program.
- Running the above 2 versions of matrix multiplication for matrix sizes of n = 100 to n = 1000 in increments of 100 and record the time taken for each of these executions.
- We were then supposed to look in to ways to optimize matrix multiplication taking the processor architecture into account as well and come up with a 3rd version of the matrix multiplication making use of both parallel for loops and various optimization techniques.

The full source code of this project can be found here: https://github.com/pubudu91/mm-optimization

## Naive Sequential Version

For this, the standard matrix multiplication algorithm was used. If the matrices are called A, B and C,

for (int i = 0; i < n; i++) | |

for (int j = 0; j < n; j++) | |

for (int k = 0; k < n; k++) | |

C[i][j] += A[i][k] * B[k][j]; |

However, since dynamic 2D array creation is not that straight forward in C++, a 1D array was used, with the use of pointer arithmetic to manipulate it like a 2D array. If a 2D array was to be used, the individual rows could be stored all over the memory, resulting in a higher number of cache misses. A 1D array reduces this number drastically since a 1D array is allocated contiguous memory blocks. However it should be noted that this also could count as an optimization technique since it drastically reduces the no. of caches misses. Anyhow, the C++ code for the naive matrix multiplication is given below.

void multiply1D(Matrix1D *A, Matrix1D *B, Matrix1D *C, int n) { | |

for (int i = 0; i < n; ++i) { | |

double *Arow = &A->matrix[i * n]; | |

double *Crow = &C->matrix[i * n]; | |

for (int j = 0; j < n; ++j) { | |

double sum = 0; | |

for (int k = 0; k < n; ++k) | |

sum += *(Arow + k) * B->matrix[k * n + j]; | |

*(Crow + j) = sum; | |

} | |

} | |

} |

I decided on using 1D arrays for the naive version as well since the use of 2D arrays caused some inconsistent test results.

## Parallelized Version

For parallelizing the above algorithm, I decided on using the OpenMP library. Parallelizing the for loops is easy using OpenMP.

void parallelMultiply1D(Matrix1D *A, Matrix1D *B, Matrix1D *C, int n) { | |

omp_set_num_threads(4); | |

#pragma omp parallel for | |

for (int i = 0; i < n; ++i) { | |

double *Arow = &A->matrix[i * n]; | |

double *Crow = &C->matrix[i * n]; | |

for (int j = 0; j < n; ++j) { | |

double sum = 0; | |

for (int k = 0; k < n; ++k) | |

sum += *(Arow + k) * B->matrix[k * n + j]; | |

*(Crow + j) = sum; | |

} | |

} | |

} |

Note that the above function differs from the previous one by just 2 lines. The omp_set_num_threads() function is called to set the number of threads to be used when executing parallel blocks of code. The omp directive given next marks the for loop immediately following the directive to be executed in parallel. The inner loops aren’t parallelized since it didn’t offer any performance gains.

## Parallelized Vs. Naive Sequential

Given below is a comparison of execution times of the above 2 approaches as the size of the matrix increase.

The speedup was calculated as per the following formula:

Speedup = Time taken by the naive sequential approach / Time taken by the improved approach

The graph given below depicts the change of the speedup offered by the parallelized version as the size of the matrix increases.

## Optimizing the Multiplication

For optimizing matrix multiplication, I considered 3 approaches:

### Transposing

In this technique, matrix B is transposed before doing the matrix multiplication. The reason for this is in the inner loop, a column of matrix B is iterated. Since languages like C/C++ uses row major ordering for laying out a matrix, this causes non-contiguous memory blocks to be accessed, resulting in cache misses, which in turn translates into slower execution times. Transposing matrix B allows us to change the multiplication step to A[i][k] * B T [j][k] which results in iterating through a row of matrix B, reducing the number of cache misses, and thus, decreasing the execution time.

### Loop interchange

Interchanging the middle loop and the inner loop has a similar effect to transposing. Instead of iterating over columns, it allows to iterate over the rows of matrix B. Just as in the transposing technique, this results in lesser cache misses, thus improving the performance.

void multiply1DMatrixIKJ(Matrix1D *A, Matrix1D *B, Matrix1D *C, int n) { | |

for (int i = 0; i < n; ++i) { | |

double *Arow = &A->matrix[i * n]; | |

double *Crow = &C->matrix[i * n]; | |

for (int k = 0; k < n; ++k) { | |

double temp = *(Arow + k); | |

double *Brow = &B->matrix[k * n]; | |

for (int j = 0; j < n; ++j) | |

*(Crow + j) += temp * *(Brow + j); | |

} | |

} | |

} |

### Use of Basic Linear Algebra Subprograms (BLAS) standard

BLAS are a set of routines providing basic building blocks for performing various operations on matrices and vectors. There are 3 levels in BLAS:

- Level 1 – has routines for performing various operations on scalars, scalar-vector and

vector-vector. - Level 2 – has routines for performing matrix-vector operations
- Level 3 – has routines for performing matrix-matrix operationsThe BLAS specification of these routines are the de facto standard low-level routines used in implementing

linear algebra libraries.

The BLAS abstraction allows there to be various implementations. While there exists a reference implementation, there are a variety of BLAS libraries, some of which are optimized to certain architectures. ATLAS, one of the more prominent implementations, allows it to be optimized to whatever system it is running on. BLAS implementations offer great improvements in performance. But they do so mostly by using commonly used optimization techniques. For example, in level 1 functions, they can gain performance boosts by making use of SSE instructions which operate on vectors. And the level 2 functions can then be implemented using these and in turn, level 3 functions can be implemented using level 2 functions.

The BLAS implementations also make use of cache hierarchies and instruction parallelism. The larger matrices broken down into smaller matrices (kernels) for which low level implementations are provided. These kernels can then be executed in parallel in a map-reduce pattern.

To use BLAS in C++, we make use of the C interface provided for BLAS called cblas. More on that here: http://www.netlib.org/blas/blast-forum/cinterface.pdf

To do matrix multiplication using BLAS L1 routines, we can make use of the ddot() routine. This routine computes the dot product of 2 double valued vectors. The cblas interface for this routine is cblas_ddot(). This function is of the following form:

double cblas_ddot(const int N, const double *X, const int incX, const double *Y, const int incY);

- N – size of the matrix
- *X – pointer to a 1D array of type double
- incX
- *Y – pointer to a 1D array of type double
- incY

Following is the multiplication function, making use of cblas_ddot(). Here, what is done is treat a matrix as a collection of vectors and apply the dot product repeatedly.

void blasL1(Matrix1D *A, Matrix1D *B, Matrix1D *C, int n) { | |

double *Cmat = C->matrix; | |

double *Amat = A->matrix; | |

double *Bmat = B->matrix; | |

double *Arv, *Crv; | |

omp_set_num_threads(4); | |

#pragma omp parallel for | |

for (int i = 0; i < n; ++i) { | |

Arv = Amat + i * n; | |

Crv = Cmat + i * n; | |

for (int j = 0; j < n; j++) | |

*(Crv + j) = cblas_ddot(n, Arv, 1, &Bmat[j * n], 1); | |

} | |

} |

To do matrix multiplication using BLAS L3 routines, we can make use of the dgemm() routine. This routine allows us to multiply 2 general matrices of double type. The multiplication is of the following form:

C := alpha * op( A ) * op( B ) + beta * C

where op( X ) is one of op( X ) = X or op( X ) = X**T,

The C interface for this routine is cblas_dgemm() and it takes the following form:

void cblas_dgemm(const enum CBLAS_ORDER Order, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_TRANSPOSE TransB, const int M, const int N, const int K, const double alpha, const double *A, const int lda, const double *B, const int ldb, const double beta, double *C, const int ldc);

- Order – Whether the matrices are row major or column major. The 2 possible values for this parameter are “CblasRowMajor” and “CblasColMajor”
- TransA – Specifies whether to transpose A or not. The 3 possible values for this parameter are “CblasNoTrans”, “CblasTrans” and “CblasConjTrans”
- TransB – Same as above. Specifies whether to transpose B or not
- M – Specifies the number of rows in matrices A and C
- N – Specifies the number of columns in matrices B and C
- K – Specifies the number of columns in A and the number of rows in B
- alpha – Specifies the scalar alpha in the equation given above
- *A – A pointer to matrix A, which is in the form of a 1D array of type double
- lda – If TransA = CblasNoTrans, lda = max(1, m) else lda = max(1,k)
- *B – A pointer to matrix B, which is in the form of a 1D array of type double
- ldb – If TransB = CblasNoTrans, ldb = max(1, k) else lda = max(1,n)
- beta – Specifies the scalar beta
- *C – A pointer to matrix C, which is in the form of a 1D array of type double
- ldc – Must be at least max(1,m)

Following is the multiplication function making use of the cblas_dgemm() function.

void blasL3(Matrix1D *A, Matrix1D *B, Matrix1D *C, int n) { | |

cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, n, n, n, 1, A->matrix, n, B->matrix, n, 0, C->matrix, n); | |

} |

## Performance Comparison

Given below are graphs comparing the execution times for various optimization techniques and the speedups gained through them.

Given below is the same graph as above, but considering just up to n=500 for clarity (same legend as above graph).

Given below are the speedup graphs for the above techniques.

Given below is the speedup plotted against n, omitting BLAS based optimizations for clarity in comparing the performance of other techniques (same legend as above graph).