0% found this document useful (0 votes)

65 views44 pages

P 1

The document provides an overview of parallel computation and parallel systems. It discusses key topics such as parallel memory models, instruction types, process granularity, connection topologies, and examples of parallel supercomputers. Specific topics covered include distributed memory, shared memory, MIMD and SIMD instruction models, coarse-grained and fine-grained processes, static interconnects like meshes and toruses, and examples of parallel systems from IBM, Cray, SGI and more.

Uploaded by

Bhabanath Sahu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

65 views44 pages

P 1

Uploaded by

Bhabanath Sahu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

Lecture Notes on Parallel Computation

Stefan Boeriu, Kai-Ping Wang and John C. Bruch Jr. Office of Information Technology and Department of Mechanical and Environmental Engineering University of California Santa Barbara, CA 1 4 4 4 4 4 6 6 6 6 6 6 7 7 7 7 7 8 8 8 8 9 9 9 10 11 12 13

CONTENTS 1. INTRODUCTION 1.1 What is parallel computation? 1.2 Why use parallel computation? 1.3 Performance limits of parallel programs 1.4 Top 500 Supercomputers 2. PARALLEL SYSTEMS 2.1 Memory Distribution 2.1.1 Distributed Memory 2.1.2 Shared Memory 2.1.2 Hybrid Memory 2.1.4 Comparison 2.2 Instruction 2.2.1 MIMD (Multi-Instruction Multi-Data) 2.2.2 SIMD (Single-Instruction Multi-Data) 2.2.3 MISD (Multi-Instruction Single-data) 2.2.4 SISD (Single-Instruction Single-Data) 2.3 Processes and Granularity 2.3.1 Fine-grain 2.3.2 Medium-grain 2.3.3 Course-grain 2.4 Connection Topology 2.4.1 Static Interconnects Line/Ring Mesh Torus Tree Hypercube

2.4.2 Dynamic Interconnects Bus-based Cross bar Multistage switches 2.5 Hardware Specifics Examples 2.5.1 IBM SP2 2.5.2 IBM Blue Horizon 2.5.3 Sun HPC 2.5.4 Cray T3E 2.5.5 SGI O2K 2.5.6 Cluster of workstations 3. PARALLEL PROGRAMMING MODELS 3.1 Implicit Parallelism 3.1.1 Parallelizing Compilers 3.2 Explicit Parallelism 3.2.1 Data Parallel Fortran90 HPF (High Performance Fortran) 3.2.2 Message Passing PV (Parallel Virtual machine) MPI (Message Passing Interface) 3.2.3 Shared variable Power C, F OpenMP 4. TOPICS IN PARALLEL COMPUTATION 4.1 Types of parallelism - two extremes 4.1.1 Data parallel 4.1.2 Task parallel 4.2 Programming Methodologies 4.3 Computation Domain Decomposition and Load Balancing 4.3.1 Domain Decomposition 4.3.2 Load Balancing 4.3.3 Overlapping Subdomains and Non-Overlapping Subdomains 4.3.3.1 Overlapping subdomains 4.3.3.2 Non-overlapping subdomains 4.3.4 Domain Decomposition for Numerical Analysis

14 14 15 16 17 17 18 18 19 20 21 22 22 22 22 22 23 23 23 23 24 24 24 25 25 25 25 25 26 27 27 27 27 27 28 29

4.4 Numerical Solution Methods 4.4.1 Iterative Solution Methods 4.4.1.1 Parallel SOR (Successive Over-Relaxation) Methods 4.4.1.1.1 Parallel SOR Iterative Algorithms for the Finite Difference Method 4.4.1.1.2 Parallel SOR Iterative Algorithms for the Finite Element Method 4.4.1.2 Conjugate Gradient Method 4.4.1.2.1 Conjugate Iterative Procedure 4.4.1.3 Multigrid Method 4.4.1.3.1 First Strategy 4.4.1.3.2 Second Strategy (course grid correction) 4.4.2 Direct Solution Method 4.4.2.1 Gauss Elimination Method 4.4.2.1.1 Gauss elimination procedure 5. REFERENCES

32 32 32 32 38 40 40 41 41 42 43 43 43 44

1. Introduction
1.1 What is Parallel Computation? Computations that use multi-processor computers and/or several independent computers interconnected in some way, working together on a common task. Examples: CRAY T3E, IBM-SP, SGI-3K, Cluster of Workstations. 1.2 Why use Parallel Computation? Computing power (speed, memory) Cost/Performance Scalability Tackle intractable problems 1.3 Performance limits of Parallel Programs Available Parallelism Amdahls Law Load Balance o some processors work while others wait Extra work o management of parallelism o redundant computation Communication 1.4 Top 500 Supercomputers Worldwide Listing of the 500 most powerful computers in the World, available from www.top500.org. Rmax [Gflops/s for the largest problem] - from LINPACK MPP [Massively Parallel Processors] Updated twice a year. Top 13 presented in Table 1.4.

Table 1.4 TOP 10 - June 2003

Rank Manufacturer 1 NEC HewlettPackard Linux Networx Computer Earth-Simulator ASCI Q AlphaServer SC ES45/1.25 GHz MCR Linux Cluster Xeon 2.4 GHz - Quadrics Rmax Installation Site Country Year Earth Simulator 35860 Center Japan/2002 Los Alamos 13880 National Laboratory Lawrence Livermore 7634 National Laboratory Japan # Proc

2002 5120

USA

2002 8192

USA

2002 2304

IBM

Lawrence ASCI White, SP Livermore 7304 Power3 375 MHz National Laboratory SP Power3 375 MHz 16 way xSeries Cluster Xeon 2.4 GHz Quadrics PRIMEPOWER HPC2500 (1.3 GHz) rx2600 Itanium2 1 GHz Cluster Quadrics AlphaServer SC ES45/1 GHz AlphaServer SC ES45/1 GHz 7304 NERSC/LBNL Lawrence Livermore 6586 National Laboratory National Aerospace 5406 Laboratory of Japan

USA

2000 8192

IBM

USA

2002 6656

IBM

USA

2003 1920

Fujitsu

Japan

2002 2304

HewlettPackard HewlettPackard HewlettPackard

Pacific Northwest USA 4881 National Laboratory Pittsburgh 4463 Supercomputing Center USA

2003 1540

2001 3016

Commissariat a France 3980 l'Energie Atomique (CEA)

2001 2560

2. Parallel Systems
2.1 Memory Distribution 2.2.1 Distributed Memory Each processor in a parallel computer has its own memory (local memory); no other processor can access this memory. Data can only be shared by message passing Examples: Cray T3E, IBM SP2 2.2.2 Shared Memory Global memory which can be accessed by all processors of a parallel computer. Data in the global memory can be read/write by any of the processors. Examples: Sun HPC, Cray T90 2.1.3 Hybrid (SMP Cluster) A distributed memory parallel system but has a global memory address space management. Message passing and data sharing are taken care of by the system. Examples: SGI Power Challenge Array 2.1.4 Comparison Shared Memory o Explicit global data structure o Decomposition of work is independent of data layout o Communication is implicit o Explicit synchronization Need to avoid race condition and over writing Message Passing o Implicit global data structure o Decomposition of data determines assignment of work o Communication is explicit o Synchronization is implicit
6

2.2. Instruction Flynns classification of computer architectures (1966): 2.2.1 MIMD (Multi-Instruction Multi-data) All processors in a parallel computer can execute different instructions and operate on different data at the same time. Parallelism achieved by connecting multiple processors together Shared or distributed memory Different programs can be run simultaneously Each processor can perform any operation regardless of what other processors are doing. Examples: Cray T90, Cray T3E, IBM-SP2 2.2.2. SIMD (Single-Instruction Multi-Data) All processors in a parallel computer execute the same instructions but operate on different data at the same time. Only one program can be run at a time. Processors run in synchronous, lockstep function Shared or distributed memory Less flexible in expressing parallel algorithms, usually exploiting parallelism on array operations, e.g. F90 Examples: CM2, MsPar 2.2.3 MISD (Multiple-Instruction Single-Data) Special purpose computer 2.2.4 SISD (Single-Instruction Single-Data) Serial computer

2.3 Processes and Granularity On a parallel computer, user applications are executed as processes, tasks or threads. The traditional definition of process is a program in execution. To achieve an improvement in speed through the use of parallelism, it is necessary to divide the computation into tasks or processes that can be executed simultaneously. The size of a process can be described by its granularity. 2.3.1 Fine-grain In fine granularity, a process might consist of a few instructions, or perhaps even one instruction. 2.3.2. Medium-grain Medium granularity describes the middle ground between fine-grain and course grain. 2.3.3 Course-grain In course granularity, each process contains a large number of sequential instructions and takes a substantial time to execute. Sometimes granularity is defined as the size of the computation between communication or synchronization points. Generally, we want to increase the granularity to reduce the cost of process creation and interprocess communication, but of course this will likely reduce the number of concurrent processes and the amount of parallelism. A suitable compromise has to be made. In general, we would like to design a parallel program in which it is easy to vary granularity: i.e. a scalable program design.

2.4 Connection Topology The best choice would be a fully connected network in which each processor has a direct link to every other processor. Unfortunately, this type of network would be very expensive and difficult to scale. Instead, processors are arranged in some variation of a grid, torus, hypercube, etc. Key issues in network design are the network bandwidth and the network latency. The bandwidth is the number of bits that can be transmitted in unit time, given as bits/sec. The network latency is the time to make a message transfer through the network. 2.4.1 Static Interconnects Consist of point-to-point links between processors Can make parallel system expansion easy Some processors may be closer than others Examples: Line/Ring, Mesh/Torus, Tree, Hypercube Line/Ring. o a line consists of a row of processors with connections limited to the adjacent nodes. o the line can be formed into a ring structure by connecting the free ends.

Fig. 2.4.1.a - Ring

Mesh o processors are connected in rows and columns in a 2 dimensional mesh o example: Intel Paragon

Fig. 2.4.1.b 2D Mesh In a mesh network of dimension D, each nonboundary processor is connected to 2D immediate neighbors. Connections typically consist of two wires, one in each direction.

Torus This architecture extends from the mesh by having wraparound connections. The torus is a symmetric topology, whereas a mesh is not. All added wraparound connections help reduce the torus diameter and restore the symmetry. o one-dimensional torus o two-dimensional torus o three-dimensional torus o example: Cray T3E

Fig. 2.4.1.c 2D Torus

Tree o binary tree first node is called root each node has two links connecting to two nodes below it as the network fans out from the root node At the first level below the root node, there are two nodes. At the next level, there are four nodes, and at the j-th level below the root node there are 2j nodes. o fat tree The number of links is progressively increased toward the root.

Fig. 2.4.1.d Fat tree o universal fat tree number of links between the nodes grows exponentially toward the root, thereby allowing increased traffic toward the root and reducing the communication bottleneck. examples: the Thinking Machines CM5, Meiko CS2

Hypercube each processor connects to 2n neighbors in a n dimension Hypercube examples: iPSC, nCUBE, SGI O2K

Fig. 2.4.1.e Hypercubes Hypercubes of dimension zero through four. The processors in the cubes are labeled with integers, here represented as binary numbers. Two processors are neighbors if and only if their binary labels differ only in one digit place.

2.4.2 Dynamic Interconnects Paths are established as needed between processors System expansion is difficult Processors are usually equidistant Examples: Bus-based, Crossbar, Multistage Networks Bus-based Networks In a bus-based network, processors share a single communication resource [the bus]. A bus is a highly non-scalable architecture, because only one processor can communicate on the bus at a time. Used in shared-memory parallel computers to communicate read and write requests to a shared global memory

Fig. 2.4.2.a Bus-based Networks A bus-based interconnection network, used here to implement a shared-memory parallel computer. Each processor (P) is connected to the bus, which in turn is connected to the global memory. A cache associated with each processor stores recently accessed memory values in an effort to reduce the bus traffic.

Crossbar Switching Network A crossbar switch avoids competition for bandwidth by using O(N2) switches to connect N inputs to N outputs. Although highly non-scalable, crossbar switches are a popular mechanism for connecting a small number of workstations, typically 20 or fewer.

Fig. 2.4.2.b Crossbar Network A 4*4 nonblocking crossbar, used here to connect 4 processors. On the right, two switching elements are expanded: the top one is set to pass messages through and the lower one to switch messages. Each processor is depicted twice. Pairs of processors can communicate without preventing other processor pairs from communicating.

Multistage Interconnection Networks In a multistage interconnection network (MIN), switching elements are distinct from processors. Fewer than O(p2) switches are used to connect p processors. Messages pass through a series of switch stages. In a unidirectional MIN, all messages must traverse the same number of wires, and so the cost of sending a message is independent of processor location in effect, all processors are equidistant. In a bi-directional MIN, the number of wires traversed depends to some extent on processor location, although to a lesser extent than in a mesh or hypercube. Example: IBM SP networks are bi-directional multistage inter-connection networks: o bi-directional, any-to-any inter-node connection: allows all processors to send messages simultaneously. o multistage interconnection: on larger systems (over 80 nodes), additional intermediate switches are added as the system is scaled upward

Fig. 2.4.2.c Multistage interconnection network Shaded circles represent processors and unshaded circles represent crossbar switches.
16

2.5 Hardware Specifics Examples 2.5.1 IBM SP2 Message passing system Cluster of workstations 200 MHz power 3 CPU o Peak 800 MFLOPS o 4-16 MB 2nd-level cache o sustained memory bandwidth 1.6 GB/s Multistage crossbar switch MPI o Latency 21.7 usec o Bandwidth 139 MB/sec I/O hardware

2.5.2 IBM PWR3 SDSC Blue Horizon 222 MHz 888MFLOPS (1152 CPUs, 144 nodes with 8 CPUs (SMP)) 2 Pipes, 1FMA per pipe per clock tick MPI & OpenMP programming 32 KB L1 Cache, 2MB L2 Cache

CPU PU PU PUCPU PU PU PU C C C C C C bus MEMORY bus MEMORY

CPU PU PU PU C C C bus MEMORY

Networ k

2.5.3 Sun HPC 400 MHz ..800 MFLOPS (64 CPUs) MPI or OpenMP Programming 16 KB L1 Cache, 4MB L2 Cache, 64GB total Main memory 2 Pipes, 1 FLOP per pipe per cycle

CP U

CP U bus

CP U

MEMORY

2.5.4 Cray T3E Remote memory access system Single system image 600 MHz DEC Alpha CPU o Peak 1200 MFLOPS o 96 KB 2nd-level cache o Sustained memory bandwidth 600 MB/s 3D torus network MPI o Latency 17 usec o Bandwidth 300 MB/s Shmem o Latency 4 usec o Bandwidth 400 MB/s SCI-based I/O network

2.5.5 SGI O2K Cc-NUMA system Single system image 600250 MHz MIPS R10000 CPU o Peak 500 MFLOPS o 2nd-level data cache 4-8 MB o Sustained memory bandwidth 670 MB/s 4D hypercube MPI o Latency 16 usec o Bandwidth 100 MB/s Remote memory access o Latency 497 usec o Bandwidth 600 MB/s

2.5.6 Cluster of workstations Hierarchical architecture: shared memory in a node, message passing across nodes. PC-based nodes or workstation-based nodes Networks: Myrianet, Scalable Coherent Interface, Gigabit Ethernet

3. PARALLEL PROGRAMMING MODELS A parallel computer system should be flexible and easy to use and should exhibit good programmability in supporting various parallel algorithms. Explicit parallelism means that parallelism is explicitly specified in the source code by the programmer using special language constructs, compiler directives or library function calls. If the programmer does not explicitly specify parallelism, but lets the compiler and the run-time support system automatically exploit it, we have the implicit parallelism. 3.1 Implicit Parallelism 3.1.1 Parallelizing Compilers o Automatic parallelization of sequential programs o Do not exploit functional parallelism o Compiler performs dependence analysis on a sequential programs source data and then using a suite of program transformation techniques converts the sequential code into a native parallel code. o Some performance studies indicate, however, that the parallelizing compilers are not very effective. 3.2 Explicit Parallelism Although many explicit programming models have been proposed, three models have become dominant ones: data parallel, message passing and shared variable. 3.2.1 Data parallel o Execute the same instruction or program segment over different data sets simultaneously on multiple computing nodes. o Has a single thread of control o Parallelism is exploited at data set level o No functional parallelism available

3.2.1.1 Fortran 90 Uses array syntax to express parallelism Implementation on SIMD and MIMD machines Single processor versions are available Communication is transparent 3.2.1.2 High Performance Fortran (HPF) Evolves from Fortran 90, allows for far more detail in expressing parallelism Attempt to standardize data parallel programming Data distribution and alignment can be defined Allows explicit definition of parallelism 3.2.2 Message-passing model o Multithreading a message-passing program consists of multiple processes, each of which has its own thread of control and may execute different code. Both control parallelism (MPMD Multiple-Program-Multiple-Data) and data parallelism (SPMD Single-Program-MultipleData) are supported. o Asynchronous the processes of a message-passing program execute asynchronously. o Separate address space - the processes of a parallel program reside in different address spaces. o Explicit interactions the programmer must solve all the interaction issues, including data mapping, communication and synchronization. o Scales well, especially if data is well distributed. 3.2.2.1 PVM The PVM (Parallel Virtual Machine) is a software package that permits a heterogeneous collection of Unix and/or NT computers hooked together by a network to be used as a single large parallel computer. Thus large computational problems can be solved most cost effectively by using the aggregate power and memory of many computers. The software is very portable. The source, which is available free thru Netlib [www.netlib.org], has been compiled on everything from laptops to CRAYs.
23

PVM enables users to exploit their existing computer hardware to solve much larger problems at minimal additional cost. Hundreds of sites around the world are using PVM to solve important scientific, industrial, and medical problems in addition to PVMs use as an educational tool to teach parallel programming. 3.2.2.2 MPI MPI (Message Passing Interface) is the standard programming interface MPI 1.0 in 1994 MPI 2.0 in 1997 Library interface (Fortran, C, C++) It includes point-to-point communication collective communication barrier synchronization one-sided communication (MPI 2.0) parallel I/O (MPI 2.0) process creation (MPI 2.0) 3.2.3 Shared variable o Similar to data-parallel model, in that it has single address space o Similar to message-passing model, in that it is multithreading and asynchronous o Data reside in a single, shared address space and does not have to be explicitly allocated o Communication is done implicitly through shared reads and writes of variables o Synchronization is explicit 3.2.3.1 SGI Power C Model extension to the sequential C language with compiler directives (pragmas) and library functions supports shared-variable parallel programming similar extended constructs are also provided for Fortran it is structured and relatively simple
24

3.2.3.2 OpenMP: Directive-based SM parallelization OpenMP is a standard shared memory programming interface(1997) directives for Fortran77 and C/C++ fork-join model resulting in global program it includes: o parallel loops o parallel sections o parallel regions o shared and private data o synchronization primitives barrier critical region

4. Topics in Parallel Computation

4.1 Types of parallelism: two extremes 4.1.1 Data parallel Each processor performs the same task on different data Data mapping is critical Programmed with HPF or message passing Example grid problems 4.1.2 Task parallel Each processor performs a different task More difficult to balance load Commonly programmed with message passing Example signal processing Most applications fall somewhere on the continuum between these two extremes

4.2 Programming Methodologies Bulk of program in Fortran, C, or C++ Data and/or tasks are split up onto different processors by: o Distributing the data onto local memory of CPU thus causing CPU to work on its local memory (MPPs, MPI). o Distribute work of each loop to different CPUs (SMP, OpenMP). o Hybrid distribute data onto SMP box and then within the SMP distribute work of each loop to different CPUs within the box (SMP-Cluster, MPI&OpenMP).

4.3 Computation Domain decomposition and Load Balancing

4.3.1 Domain decomposition The computation domain is partitioned into several subdomains and then mapped onto processors of a parallel system. In general, the number of subdomains equals to the number of processors in a parallel system. 4.3.2 Load Balancing The goal of partitioning is to distribute the computation load such that all processors can finish their computation at about the same time. For homogeneous parallel systems, the computation load is distributed as evenly as possible in a parallel computer. For heterogeneous parallel system, the computation load is distributed according to the computing power of each processor. 4.3.3 Overlapping Subdomains and Non-Overlapping Subdomains: 4.3.3.1 Overlapping Subdomains There is a common computation domain between two adjacent subdmains.

Subdomain 1
2

Subdomain 2
1

Mathematical formulations are applied on 1 and 2 Difficult to deal with irregular overlapping areas.

4.3.3.2 Non-overlapping Subdomains There is only an interface between two adjacent subdomains

Subdomain 1

Subdomain 2

Mathematical formulations are applied on . Can handle irregular interfaces easily.

4.3.4 Domain Decomposition for Numerical Analysis Overlapping Subdomains

Subdomain 1
1

Subdomain 2
2

Domain Decomposition

Subdomain 1
21 = f 2

Subdomain 2
1 22 = f

2 = 1
2 1 =

1 = 2
1 2 =

____

Non-overlapping Subdomains

Subdomain 1

Subdomain 2

Domain Splitting

Subdomain 1
21( n ) = f

Subdomain 2
22( n ) = f

1( n ) = g ( n )
g ( n +1) = 2( n ) + (1 ) g ( n )

2( n ) 1( n ) =

(D)

(N)

Interface Relaxation Process Iterative Scheme 1: 1. Solve interior completely. 2. Update the interface data. 3. Repeat 1. and 2. until convergence on the interface. Iterative Scheme 2: 1. One iteration for the interior mesh points of both subdomains. 2. Update the interface mesh points. 3. Continue 1. and 2. until convergence of all mesh points.

4.4 Numerical Solution Methods 4.4.1 Iterative Solution Methods 4.4.1.1 Parallel SOR (successive over-relaxation) 4.4.1.1.1 Parallel SOR Iterative Algorithms for the Finite Difference Method. One dimensional example:
d 2 =1 dx 2

Difference equation:
j+1 2 j + j-1 = x
2

j=2,,N-1

SOR Iterative Scheme:

j j
(n+1/2)

= ( j+1(n) + j-1(n+1) - x2)/2

(n+1)

= j(n+1/2) + (1 - ) j(n)

Expand to Matrix Form:

1 -1 0 0 0 0 0 0

0 2 -1 0 0 0 0 0

0 -1 2 -1 0 0 0 0

0 0 -1 2 -1 0 0 0

0 0 0 -1 2 -1 0 0

0 0 0 0 -1 2 -1 0

0 0 0 0 0 -1 2 0

0 0 0 0 0 0 -1 1

1 2 3 4 5 6 7 8 = x
2

1/ x2 1 1 1 1 1 1 8/ x2

.. 1 2 3 4 5 6 7 8

subdomain 2 subdomain 1 .. 1 2 3 interface

Reorder Equations:

1 -1 0 0 0 0 0 0

0 2 -1 0 0 0 0 0

0 -1 2 -1 0 0 0 0

0 0 -1 2 0 0 0 -1

0 0 0 0 2 -1 0 -1

0 0 0 0 -1 2 0 0

0 0 0 0 0 -1 1 0

0 0 0 -1 -1 0 0 2

1 2 3 4 6 7 8 5 = x
2

1/ x2 1 1 1 1 1 8/ x2 1

Subdomain 1: 2, 3, 4 Interface: 5 Subdomain 2: 6, 7

Two Dimensional Example:

2 2 + =1 x 2 y 2

Difference Equation: c1(i+1,j - 2i,j + i-1,j) + c2(i,j+1 - 2i,j i,j-1) = 1 SOR Iterative Scheme: i,j (n+1/2) = (c1/c3) (i+1,j (n) + i-1,j(n+1)) + (c2/c3) (i,j+1 (n) + i,j-1(n+1) ) 1/c3 i,j (n+1) = i,j (n+1/2) + (1- )i,j(n) where: c1 = 1/x2, c2 = 1/y2 and c3 = 2/x2 + 2/y2 (n) i,j+1

i-1, j (n+1)

i,j

i+1,j (n)

i,j-1

(n+1)

o o o o o

o 3 o 2 o 1 o o

o o

o 9 o 8 o 7 o o

o o o o o

5 o 4 o o

Reorder Equations: o o o o o o 3 o 2 o 1 o o o o o 9 o 8 o 7 o o o o o Subdomain 2: 7, 8, 9 4 o o o o Interface: 4, 5, 6 Column type subdomains: Subdomain 1: 1, 2, 3 5 o

o o o o o o o o o o

o 3 o 2 o 1 o o o 3 o 2 o 1 o o

o o

o 9 o 8 o 7 o o o 9 o 8 o 7 o o

o o o Subdomain 2: 3, 6, 9 o o o Block type subdomains: o o Subdomain 3: 3 o o Subdomain 4: 9 Interface: 2, 8, 4, 5, 6 Subdomain 1: 1 Subdomain 2: 7 Interface: 2, 5, 8 Row type subdomains: Subdomain 1: 1, 4, 7

5 o 4 o o o o

5 o 4 o o

4.4.1.1.2 Parallel SOR Iterative Algorithms for the Finite Element Method. The General Form of a Finite Element System: k11 ki1 0 k1i kii k2i 0 ki2 k22 u1 ui u2 = f1 fi f2

SOR Iterative Scheme: k11u1(n+1/2) = f1 k1i ui(n) u1(n+1) = u1 (n+1/2) + (1-)u1(n) (1)

kiiui(n+1/2) = fi ki1u1(n+1) ki2u2(n) ui(n+1) = ui(n+1/2) + (1-)ui(n) (2)

k22u2(n+1/2) = f2 k2iui(n+1) u2(n+1) = u2(n+1/2) + (1-)u2(n) (3)

Reorder Equations: k11 0 ki1 0 k22 ki2 k1i k2i kii u1 u2 ui = f1 f2 fi

Parallel SOR Iterative Scheme: k11u1(n+1/2) = f1 k1i ui(n) u1(n+1) = u1(n+1/2) + (1-)u1(n) (4)

k22u2(n+1/2) = f2 k2iui(n) u2(n+1) = u2(n+1/2) + (1-)u2(n) (5)

kiiui(n+1/2) = fi ki1u1(n+1) ki2u2(n+1) ui(n+1) = ui(n+1/2) + (1-)ui(n) (6)

4.4.1.2 Conjugate Gradient Method Conjugate Gradient (CG) Method is a popular iterative method for solving large systems of linear equations. CG is effective for systems of the form: Ax=b where x is an unknown vector, b is a known vector, and A is a known, square, symmetric, positive-definite (or positive-indefinite) matrix. This system arises in many important settings, such as using finite difference and finite element methods for solving partial differential equations, structural analysis and circuit analysis. 4.4.1.2.1 Conjugate Iterative Procedure
d (0) = r (0) = b Ax (0)

(i ) =

r T ( i ) r( i ) d T ( i ) Ad( i )

x(i +1) = x( i ) + ( i ) d (i ) r( i +1) = r( i ) ( i ) Ad ( i ) r T (i +1) r( i +1) r T ( i ) r( i )

(i +1) =

d ( i +1) = r( i +1) + (i +1) d (i )

4.4.1.3. Multigrid Method Many standard iterative methods (i.e. Jacobi, SOR, Gauss-Seidel) possess the smoothing property. This property makes these methods very effective at eliminating the high-frequency or oscillatory components of the error, while leaving the low-frequency or smooth components relatively unchanged. One way to improve a relaxation scheme, at least in its early stages, is to use a good initial guess. A known technique for obtaining an improved initial guess is to perform some preliminary iterations on a coarse grid and then use the resulting approximation as an initial guess on the original fine grid. Relaxation on a coarser grid is less expensive since there are fewer unknowns to be updated. Also, since the convergence factor behaves like 1O(h2), the coarser grid will have a marginally improved convergence rate. The linear system of equations considered is: Ax = b 4.4.1.3.1 First Strategy 1. 2. 3. 4. 5. 6. 7. Relax on Ax=b on a very coarse grid. Relax on Ax=b on 4 h to obtain an initial guess for 2 h . Relax on Ax=b on 2 h to obtain an initial guess for h . Relax on Ax=b on h to obtain a final approximation to the solution.

4.4.1.3.2 Second Strategy (Coarse Grid Correction) 1. Relax on Ax=b on h to obtain an approximation v h . 2. Compute the residual r = b Av h . 3. Relax on the residual equation Ae=r to obtain an approximation to the error e2h. 4. Correct the approximation obtained on h with the error estimate obtained on
2 h : v h v h + e2 h .

Transformation between grids. Interpolation (prolongation) 1. Operator: I 2nh . nh 2. Transferring the data from a coarse grid 2nh to a finer grid nh . 3. Linear interpolation can be used. Injection (restriction) 1. 2. 3. 4. Operator: I 2nh nh Moving data from a finer grid nh to a coarser grid 2nh . Data on the same grid can be used directly. Full weighting can also be used.

Coarse Grid Correction Scheme: v h CG (v h , b h ) . Relax v1 times on Ah x h = bh with initial guess v h . Compute r 2 h = I h2 h (b h Ah v h ) . Solve A2 h e2 h = r 2 h on 2h . Correct fine grid approximation: v h v h + I 2hh e2 h . Relax v2 times on Ah x h = bh on h with initial guess v h .

4.4.2 Direct Solution Method 4.4.2.1 Gauss Elimination Method The Gauss Elimination Method is the most used direct solver for the linear system: Ax = b where A is a known, square, positive definite and dense system. The general procedure for Gauss elimination is to factor the A matrix into an upper-triangular matrix: Ux = y Then use back substitution to obtain the solution x. 4.4.2.1.1.Gauss Elimination Procedure: The Gaussian elimination algorithm can be written in algorithmic form as shown: For k = 1,..., n 1 For i = k + 1,..., n
lik = aik akk

For k = n, n 1,...,1
xk = bk

For i = k + 1,..., n
xk = xk aki xi x xk = k akk

For j = k + 1,..., n
aij = aij lik akj bi = bi lik bk

(a) Forward Reduction

(b) Back Substitution

5. REFERENCES 1. K. Hwang, Z. Xu, Scalable Parallel Computing, Boston: WCB/McGraw-Hill, c1998. 2. I. Foster, Designing and Building Parallel Programs, Reading, Mass: Addison-Wesley, c1995. 3. D. J. Evans, Parallel SOR Iterative Methods, Parallel Computing, Vol.1, pp. 3-8, 1984. 4. L. Adams, Reordering Computations for Parallel Execution, Commun. Appl. Numer. Methods, Vol.2, pp 263-271, 1985. 5. K. P. Wang and J. C. Bruch, Jr., A SOR Iterative Algorithm for the Finite Difference and Finite Element Methods that is Efficient and Parallelizable, Advances in Engineering Software, 21(1), pp. 37-48, 1994. 6. K. P. Wang and J. C. Bruch, Jr., An Efficient Iterative Parallel Finite Element Computational Method, The Mathematics of Finite Elements and Applications, edited by J. R. Whiteman, John Wiley and Sons, Inc., Chapter 12, pp. 179-188, 1994.

HPC-Unit-1
No ratings yet
HPC-Unit-1
65 pages
CSC580 Quick Notes Lect1and2
100% (1)
CSC580 Quick Notes Lect1and2
18 pages
Kai Hwang: Advanced Computer Architecture
No ratings yet
Kai Hwang: Advanced Computer Architecture
9 pages
Module -4 - Parallel Processing
No ratings yet
Module -4 - Parallel Processing
32 pages
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 03-Aug-2021 Lecture1-Course Introduction
No ratings yet
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 03-Aug-2021 Lecture1-Course Introduction
39 pages
PDC Complete Course File
No ratings yet
PDC Complete Course File
422 pages
Parallel and Distributed Computing
No ratings yet
Parallel and Distributed Computing
90 pages
Parallel computing a comparative
No ratings yet
Parallel computing a comparative
65 pages
Unit 1
No ratings yet
Unit 1
22 pages
PDA_2
No ratings yet
PDA_2
105 pages
HPC BOOk
No ratings yet
HPC BOOk
68 pages
BDS Session 2
No ratings yet
BDS Session 2
56 pages
Coa Unit 04
No ratings yet
Coa Unit 04
85 pages
p1
No ratings yet
p1
30 pages
An Approach To Parallel Processing: Yashraj Rai Puja Padiya
No ratings yet
An Approach To Parallel Processing: Yashraj Rai Puja Padiya
3 pages
1-Introduction
No ratings yet
1-Introduction
48 pages
Parallel Computing
No ratings yet
Parallel Computing
19 pages
001__DDS-IIIT-Jan-10th
No ratings yet
001__DDS-IIIT-Jan-10th
34 pages
A Survey On Parallel Architecture and Parallel Programming Languages and Tools
No ratings yet
A Survey On Parallel Architecture and Parallel Programming Languages and Tools
8 pages
Chapter 1 - Parallel Architectures
No ratings yet
Chapter 1 - Parallel Architectures
60 pages
Parallel 123
No ratings yet
Parallel 123
28 pages
Lec1 Introduction to Parallel Computing (2)
No ratings yet
Lec1 Introduction to Parallel Computing (2)
40 pages
Hpc_unit-1 Insem Notes
No ratings yet
Hpc_unit-1 Insem Notes
76 pages
Parallel_computing
No ratings yet
Parallel_computing
32 pages
Preview-9781482211191 A37870511
No ratings yet
Preview-9781482211191 A37870511
50 pages
W3C1 Principles of Parallel Computing
No ratings yet
W3C1 Principles of Parallel Computing
28 pages
Week1-Parallel-and-Distributed-Computing
No ratings yet
Week1-Parallel-and-Distributed-Computing
55 pages
Parallel Programming Module 1
No ratings yet
Parallel Programming Module 1
71 pages
Introduction To Parallel Computing
No ratings yet
Introduction To Parallel Computing
38 pages
Unit VI Parallel Programming Concepts
No ratings yet
Unit VI Parallel Programming Concepts
90 pages
PARALLEL VS DISTRIBUTED COMPUTING
No ratings yet
PARALLEL VS DISTRIBUTED COMPUTING
9 pages
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 05-Aug-2021 Module1 (Part 1)
No ratings yet
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 05-Aug-2021 Module1 (Part 1)
30 pages
Week1 - Parallel and Distributed Computing
100% (1)
Week1 - Parallel and Distributed Computing
46 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
COA - Unit 4
No ratings yet
COA - Unit 4
84 pages
Basics of Parallel Programming: Unit-1
No ratings yet
Basics of Parallel Programming: Unit-1
79 pages
28895568
No ratings yet
28895568
9 pages
Parallel Computing Main
No ratings yet
Parallel Computing Main
47 pages
Flynns
No ratings yet
Flynns
41 pages
Introduction To Computing
No ratings yet
Introduction To Computing
6 pages
Theory of Distributed Computing and Parallel Processing With Its Applications, Advantages and Disadvantages
No ratings yet
Theory of Distributed Computing and Parallel Processing With Its Applications, Advantages and Disadvantages
11 pages
ADVANCED COMPUTER ARCHITECTURE - Parallelism, Scalability, Programmability
No ratings yet
ADVANCED COMPUTER ARCHITECTURE - Parallelism, Scalability, Programmability
9 pages
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
No ratings yet
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
47 pages
Memory in Multiprocessor System
No ratings yet
Memory in Multiprocessor System
52 pages
Parallel N Distributed Systems
No ratings yet
Parallel N Distributed Systems
44 pages
01 Intro Parallel Computing
No ratings yet
01 Intro Parallel Computing
40 pages
Parallel Computing
No ratings yet
Parallel Computing
28 pages
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
No ratings yet
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
26 pages
Project - ParallelComputing BSR v2
No ratings yet
Project - ParallelComputing BSR v2
40 pages
Parallel and Distributed Computing
No ratings yet
Parallel and Distributed Computing
28 pages
Introduction To Parallel Computing LLNL
No ratings yet
Introduction To Parallel Computing LLNL
44 pages
CUDA Programming Fundamentals: Definitive Reference for Developers and Engineers
From Everand
CUDA Programming Fundamentals: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Principles of Multiple Spanning Tree Protocol: Definitive Reference for Developers and Engineers
From Everand
Principles of Multiple Spanning Tree Protocol: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
A SECURE DATA AGGREGATION TECHNIQUE IN WIRELESS SENSOR NETWORK
From Everand
A SECURE DATA AGGREGATION TECHNIQUE IN WIRELESS SENSOR NETWORK
Dr Chaitra HV
No ratings yet
Professional CUDA C Programming
From Everand
Professional CUDA C Programming
John Cheng
5/5 (1)
Exploring the Python Library Ecosystem: A Comprehensive Guide
From Everand
Exploring the Python Library Ecosystem: A Comprehensive Guide
Kameron Hussain
No ratings yet
CompTIA Network+ Review Guide: Exam N10-008
From Everand
CompTIA Network+ Review Guide: Exam N10-008
Jon Buhagiar
No ratings yet
Understanding TCP/IP
From Everand
Understanding TCP/IP
Alena KabelovÃ¡
4/5 (2)
CompTIA A+ Complete Review Guide: Core 1 Exam 220-1101 and Core 2 Exam 220-1102
From Everand
CompTIA A+ Complete Review Guide: Core 1 Exam 220-1101 and Core 2 Exam 220-1102
Troy McMillan
5/5 (2)
Advanced Backend Code Optimization
From Everand
Advanced Backend Code Optimization
Sid Touati
No ratings yet