# 程序代写代做代考 algorithm flex cache cuda PowerPoint Presentation

PowerPoint Presentation

Dr Massoud Zolgharni

mzolgharni@lincoln.ac.uk

Room SLB1004, SLB

Dr Grzegorz Cielniak

gcielniak@lincoln.ac.uk

Room INB2221, INB

Week W/C Lecture Workshop

1 23/01 Introduction –

2 30/01 Architectures

Tutorial-1

3 06/02 Patterns 1

4 13/02 Patterns 2

Tutorial-2

5 20/02 Patterns 3

6 27/02 Patterns 4

Tutorial-3

7 06/03 Communication & Synchronisation

8 13/03 Algorithms 1

Assessment

support

9 20/03 Algorithms 2

10 27/03 Algorithms 3

11 03/04 Performance & Optimisation

Tutorial-4 &

Demonstration

12 24/04 Parallel Libraries

13 01/05 –

Parallel Software Patterns

Map – THE Parallel Pattern

Stencil – Local Reductions

Commonly recurring strategies to deal with problems

• general, so not dependent on particular platform

• still fairly low-level (not full algorithms)

Are used to implement algorithms

• sort, search

We will look at both semantic (how they work) and

specific implementation (e.g. exploiting the use of

shared memory)

Data Task

Dependency

Map applies a function (Elemental Function) to every

element of data in parallel

• resulting in a new collection of the same shape as the

input

Result does not depend on the order in which

various instances of elemental function are

executed

• no need to synchronize between separate elements of

map

Foundation of many important applications

Input

Elemental Function

Output

Task: C = A + B

A, B, C – vectors, N – number of elements

for (int id = 0; id < N; id++)
C[id] = A[id] + B[id];
par_for (int id = 0; id < N; id++)
C[id] = A[id] + B[id];
serial parallel
No dependencies so each
element can be processed
separately
Complexity measures (for N elements)
• work – a total number of operations
• span – a total number of sequential steps
serial parallel
Work – requires N operations, O(n)
Span – requires N steps, O(n)
Work – requires N operations, O(n)
Span – requires 1 step assuming
infinite computational resources!
• for p parallel processors it
is O(n/p)
Problems solved by the Map pattern are called
embarrassingly parallel problems
• straightforward to implement due to lack of dependency
between different tasks
• the best type of parallel problems you can think of!
Many important applications
• e.g. brightness/contrast adjustment in image processing
Unary Map: 1 input, 1 output
N-array Map: n input, 1 output
2*x
x + y
3 7 0 1 4 0 0 4 5 3 1 0
6 14 0 2 8 0 0 8 10 6 2 0
3 7 0 1 4 0 0 4 5 3 1 0
5 11 2 2 12 3 9 9 10 4 3 1
2 4 2 1 8 3 9 5 5 1 2 1
• often several map operations
of same shape occur in
sequence
• vector operations are maps
using very simple operations
like addition and
multiplication
• a naïve implementation may
write each intermediate
result to memory
Tutorial 1: C = A * B + B
C = mult(A, B): C = A * B
C = add(A, B): C = A + B
• convert a sequence of maps
into a map of sequences
• fuse together the operations
to perform them at once
inside a single map
• adds arithmetic intensity,
reduces memory
requirements and bandwidth
Tutorial 1: C = A * B + B
C = mult_add(A, B): C = A * B + B
SAXPY from BLAS library: Single-Precision A·X Plus Y
• a combination of scalar multiplication and vector
addition
• frequently used in linear algebra operations
void saxpy_serial(float a, const float* x, float* y, int N) {
for (int id = 0; id < N; id++)
y[id] = a*x[id] + y[id];
}
serial implementation in C
SAXPY from BLAS library: Single-Precision A·X Plus Y
• a combination of scalar multiplication and vector
addition
• frequently used in linear algebra operations
__kernel void saxpy(__global float a, __global const float* x,
__global float* y) {
int id = get_global_id(0);
y[id] = a*x[id] + y[id];
}
parallel implementation in OpenCL
a 4 4 4 4 4 4 4 4 4 4 4 4
11 23 8 5 36 12 36 49 50 7 9 4y[id]
y[id] 3 7 0 1 4 0 0 4 5 3 1 0
x[id] 2 4 2 1 8 3 9 5 5 1 2 1
*
+
0 1 2 3 4 5 6 7 8 9 10 11Global ID
a 4 4 4 4 4 4 4 4 4 4 4 4
11 23 8 5 36 12 36 49 50 7 9 4y[id]
y[id] 3 7 0 1 4 0 0 4 5 3 1 0
x[id] 2 4 2 1 8 3 9 5 5 1 2 1
*
+
4 1 3 5 8 2 6 10 11 9 0 7Global ID
The order in which each thread/work item is executed is not guaranteed!
Not important for the map pattern as there are no dependencies.
What if N > p (number of processors)?

• need to partition the input data into smaller parts

• workgroups in OpenCL, thread blocks in CUDA

Each partition is then processed in a sequence

In reality workgroups are much smaller than a

number of available processors

• good for synchronisation and communication

• in OpenCL workgroup is executed on a Compute Unit

(CU)

• if a device has many CUs, many workgroups can be

executed at the same time

Stencil is a generalization of Map in which an

elemental function can access not only a single

element in an input collection but also a set of

“neighbours”

• output depends on a “neighbourhood” of inputs

specified using a set of fixed offsets relative to the

output position

Common in image and signal processing

(convolution), PDE solvers over regular grids

Input

Elemental Function

Output

Neighbourhood

This stencil has 3

elements in the

neighbourhood:

id-1, id, id+1

idid-1 id+1

Applies some

function to them…

And outputs to

the ith position of

the output array

Neighbourhood

boundary

conditions

1D averaging filter (radius 1 = 3 neighbours)

• result is an average of values at the central position and

the immediate neighbours to the left and right.

• a simple smoothing filter for noise filtering

void avg_filter_serial(const float* a, float* b, int N) {

//handle boundary conditions:

//id = 0 and id = N-1

for (int id = 1; id < N-1; id++) b[id] = (a[id-1] + a[id] + a[id+1])/3; } serial implementation in C 1D averaging filter (radius 1 = 3 neighbours) • result is an average of values at the central position and the immediate neighbours to the left and right. • a simple smoothing filter for noise filtering __kernel void avg_filter(__global const float* a, __global float* b) { int id = get_global_id(0); //handle boundary conditions: //id = 0 and id = N-1 b[id] = (a[id-1] + a[id] + a[id+1])/3; } parallel implementation in OpenCL b a 2 7 3 2 10 0 Stencils can be one dimensional or multidimensional 1D Stencil nD Stencil 5-point stencil 9-point stencil4-point stencil center cell (P) is used as well center cell (C) is used as well center cell is not used shape of the neighbourhood used depends on the application itself 6-point stencil (7-point stencil) 24-point stencil (25-point stencil) • A – input array • R – output array • Apply a stencil operation to the inner square of the form: R(i,j) = avg(A(i,j), A(i-1,j), A(i+1,j), A(i,j-1), A(i,j+1)) 0 0 0 0 0 0 0 0 0 0 00 6 9 4 7 A 1) Average all blue squares 0 0 0 0 0 0 0 0 0 0 00 6 9 4 7 A 1) Average all blue squares 2) Store result in R 0 0 0 0 0 0 0 0 0 0 00 0 4.4 0 0 R 1) Average all blue squares 2) Store result in R 3) Repeat 1 and 2 for all green squares 0 0 0 0 0 0 0 0 0 0 00 6 9 4 7 A 0 0 0 0 0 0 0 0 0 0 00 6 9 4 7 0 0 0 0 0 0 0 0 0 0 00 0 4.4 0 4.0 RA 0 0 0 0 0 0 0 0 0 0 00 6 9 4 7 A 0 0 0 0 0 0 0 0 0 0 00 3.8 4.4 0 4.0 R 0 0 0 0 0 0 0 0 0 0 00 6 9 4 7 0 0 0 0 0 0 0 0 0 0 00 3.8 4.4 3.4 4.0 A R Original Filtered Original Filtered (larger stencil) Flexible stencil (type and size and shape) • provided externally by the host as an input argument Each input element is read multiple times • use temporary storage (local memory/cache) to store neighbourhood, which is much faster to access than the normal (global) memory Boundary handling • many different ways: ignore/expand borders, etc. • keep the main loop without boundary checks avg 0 0.2 0 0.2 0.2 0.2 0 0.2 0 convolution mask Structured parallel programming: patterns for efficient computation • Chapter 4, Map Heterogeneous computing with OpenCL • Chapter 3 and 4, Hardware and OpenCL Examples