What we're talking about when we talk about shared memory parallelism - Boosting

2020-06-07 1919 words 10 minutes

Contents

This time, I brought the first draft of the paper to be submitted for the seminar, which is a way to be lazy.

A brief Overview on Shared Memory Parallelism in Parallel Computing

1. Why parallel?

To speed up, while we are facing the limitation of current transistors and increasing energy consumption. Now that we know it is necessary and lots of privilege besides you need to reconstruct your program yourself rather than automatically distributed by APIs in a serial way. So, it then leads to the following question: how do we write parallel programs? Truth be told, there are number of possible answers to this question, while most of them share the idea of partitioning the work among cores. The two commonly used approach for this: task-parallelism and data-parallelism. In task-parallelism, we partition the problems into separately tasks that will be carried out in cores. While in data-parallelism each core carries out roughly similar operations on its part of data.

2. As it was mentioned above, when we write programs that are explicitly parallel, we will be focusing on two major types of parallel systems: shared-memory and distributed-memory.

What is the idea of shared memory system? In a shared memory system, processors are connected via an interconnection network, so that every core can access ach memory location. And there are different hardware structures in implementing this idea. In shared-memory system all processors are either connected directly to the main memory or have their own memory blocks and are accessible to each other through special hardware build into the processors. Anyway, shared memory parallel computers may have different implementations, but generally have in common the ability for all processors to access all memory as global address space, which means multiple processors can operate independently but share the same memory resources. Furthermore, changes in a memory location operated by one processor are visible to all other processors.

Interconnection topologies

Thus, the first type of system is called a uniform memory access, or UMA, system, while the second type is called a nonuniform memory access, or NUMA, system. In UMA systems the time for every core to access the memory locations is the same, while in NUMA access time from one cache to distributed data parts varies as topology place.

The NUMA Systems In a Nonuniform Memory Access architecture, each CPU has a memory module physically next to it,

Advantages: Global address space provides a user-friendly programming perspective to memory Data sharing between tasks is both fast and uniform due to the proximity of memory to CPUs

Disadvantages: Primary disadvantage is the lack of scalability between memory and CPUs. Programmer responsibility for synchronization constructs that ensure “correct” access of global memory.

Interconnection networks

This is important in implementing hardware of shared-memory parallelism: the efficiency of data exchange between memory and processors have huge impact on final execution time. Shared-memory interconnects Here we will explain two most widely used interconnects in shared-memory systems: buses and crossbars. The key property for a bus is that the devices connected to it share the communication wires. Since the amount of our cores is not many, it is more flexible to use a bus. However, the expected performance decreases as the number of cores connected to the bus increases. Thus, buses are replaced by switched interconnects in a more complicated shared-memory system. As name suggests, switched interconnects use switches to control the data among the connected devices. (Figure of crossbars) The individual switches can be shown as one from following figure (Figure of crossbar status) Crossbar switches are too expensive for large -scale systems, but are useful in some small systems. However, we can simplify the idea, that leads us to Omega (or Delta) Interconnects, which is similar to crossbars, but with fewer paths. (Figure of Omega Interconnects)

Cache Issues

Before we go further on this topic, let us firstly recall the idea of caching. To solve the problem of redundant time of processors accessing data in main memory we added block of

Cache Coherency

Explanation: To understand these issues, suppose we have a shared-memory system with two cores, each of which has its own private data cache. As long as the two cores only read share data, there is no problem. For example, suppose that x is a shared variable that has been initialized to value 3. In one cache we have the instruction that changes the value of x to 7, while in another cache at the same time also have operation concerning the value of x, which leads to the problem of 3 or 7? (could be a figure involving things describes above)

However, this will be unpredictable situation when situations mentioned above occur regardless of whether the system using which policy among processors, because this occurs in caches with in processors. An intuitive solution to this is an update to the cached variable is “informed” to the other processors. That is that the cached value stored by the other processors is also updated.

How to solve this issue?

Snooping cache coherence The idea behind snooping comes from bus-based system: When the cores share a bus, any signal transmitted on the bus can be aware by all the cores connected to the bus. Thus, when one core updates that a copy of a variable is stored in its cache, other cores that is “snooping” the bus, they will, they will see the update from this cache and mark this variable in their own cache as invalid. However, be aware that our protocol here only informs other processors that the cache line containing the variable, suppose x, has been updated, not x has been updated.

Directory-based cache coherence In large networks broadcasts are expensive, because for every update the cost for broadcast can be high. In NUMA system, for example, core 0 can access the variable x stored in core 1’s memory, so it will certainly be slower if cores are accessing the memory in another core. However, snooping cache coherence can slow the program down since the interconnect with a large amount of processor will be relatively slow. Therefore, directory-based cache coherence comes to solve this problem. In this protocol we introduce a new data structure called a directory, which stores the status of each cache line. In our example, in memory and cache we separate some space for the responsibility storing the part of the directory structure that specifies the status of the cache lines in its local memory. Thus, when a line is read into, say, core 0’s cache, the directory entry corresponding to that line would be updated indicating that core 0 has a copy of the line. When a variable is updated, the directory is consulted, and the cache controllers of the cores that have that variable’s cache line in their caches are invalidated. Clearly there will be substantial additional storage required for the directory, but when a cache variable is updated, only the cores storing that variable need to be contacted.

With these protocols we may solve issue with cache coherence, but new problems also occur: false sharing. We now know when cache issue may occur when same variables are stored within different processors, and these processors try to modify them. but there will also be a problem when processors trying to access two different variables but were stored in their caches.

False sharing occurs when threads on different processors modify variables that reside on the same cache line. When one core updates a variable in one cache line, and another core wants to access another variable in the same cache line, it will have to access main memory, since the unit of cache coherence is the cache line. That is, the second core only “knows” that the line it wants to access has been updated. It does not know that the variable it wants to access has not been changed.

Example: Consider the following expression int x, y; Since x and y are declared adjacently, most compilers will assign them contiguous memory address. Thus, unless one of them is at a boundary of a memory block, when they are cached, they will be stored in the same cache line. Suppose the program writes to Z, and

3.

OpenMP is an API designed for programming shared-memory parallel programming. The MP in OpenMP stands for “multiprocessing”. Something good about OpenMP is that the programmer does not need to specify how each thread behave explicitly, which suggests that OpenMP allows the programmer to simply mark that the block of code should be executed in parallel, and the exact determination of which thread should execute them is handed to the compiler and the run-time system. In other word, OpenMP requires more compiler support rather works like a library of functions. This convenience in writing parallel program does not come at no cost: we give away the power to program virtually any possible thread behaviour in exchange.

(give an example of a program in OpenMP, then introduce the basic rules of writing program with OpenMP)

4.Usage in high performance computing

Parallelism not always make the execution faster, sometimes the more parallelism we had, the slower the program ran, which means we need to reconsider the task before we choose to implement it parallelly.

(table with run time of Dijkstra with 1000 nodes but different number of threads) (table with run time of Dijkstra with 25000 nodes but different number of threads)

Possible way to improve our performance when writing parallel program with OpenMP on shared-memory systems:

False sharing could be a problem. Example to this: analysis the execution of a Matrix multiplies a vector. $Y = A * x$

1
2
3
4
5
6
7
8


 #pragma omp parallel for num_threads(thread_count) \
  default(none) private (i, j) shared(A, x, y, m, n)
  for(i =0; i< m; i++){
     y[i] = 0.0;
       for(j = 0; j < n; j++){
           y[i] += A[i][j] * x[j];
        }
   }

We will compare the performance of the matrix 8000000 x8 or 8000000 x 8 or 8 x 8000000 (efficiency table here) Although we may face much larger numbers in high performance computing
Why is multi threads worse than less threads? Suppose for the moment that threads 0 and 1 are assigned to one of the processors and threads 2 and 3 are assigned to the other. Also suppose that for the 8x8,000,000 problem all of y is stored in a single cache line. Then every write to some element of y will invalidate the line in the other processor’s cache. For example, each time thread 0 updates y[0] in the statement. $y[i] += A[i][j] * x[j];$ if thread 2 or 3 is executing this code, it will have to reload y. Each thread will update each of its components 8,000,000 times. We see that with this assignment of threads to processors and components of y to cache lines, all the threads will have to reload y many times. This is going to happen in spite of the fact that only one thread accesses any one component of y, for example, only thread0 accesses $y[0]$.

(Explain why false sharing may not be a problem here)

Possible ways of avoiding false sharing in the matrix multiplication program: One possible solution is to “pad” y vector with dummy elements in order to guarantee that any update by one thread won’t influence another cache line. Another alternative way is to have its own private storage during the loop, and then update the shared storage when iterations are done.