/
Parallel Computing with MATLAB® Parallel Computing with MATLAB®

Parallel Computing with MATLAB® - PowerPoint Presentation

aaron
aaron . @aaron
Follow
350 views
Uploaded On 2018-11-10

Parallel Computing with MATLAB® - PPT Presentation

How to Use Parallel Computing Toolbox and MATLAB Distributed Computing Server on Discovery Cluster An EECE5640 High Performance Computing lecture Benjamin Drozdenko MathWorks TA amp Graduate Research Assistant ID: 727149

parallel job data matlab job parallel matlab data eece5640 lab discovery labindex computing function cluster jobs spmd gpu results

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Parallel Computing with MATLAB®" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Parallel Computing with MATLAB®How to Use Parallel Computing Toolbox™ and MATLAB® Distributed Computing Server™ on Discovery Cluster, An EECE5640: High Performance Computing lecture

Benjamin Drozdenko

MathWorks TA & Graduate Research Assistant

MathWorksHelp@neu.edu

February 17, 2016Slide2

MathWorks

Teaching Assistant

 

For general or public MATLAB & Simulink questions, please visit my

Blackboard Site:MATLAB Help & Mentoringhttp://j.mp/neu-matlab-help For specific or private MATLAB & Simulink questions, please send me an E-mail: MathWorksHelp@neu.edu 

For personalized, immediate on-campus assistance, please attend my Spring 2016 Office Hours: Snell Library, 1st Flr, Rm 138BMondays: 1:30-5:30 pmWednesdays: 9 am-12 pmFridays: 12 pm-5 pmTo find the office, enter Snell Library and go down the hallway to the left of the information desk. At the end of the hallway, turn left at the printing station. Office hours are in the smaller study room, the “Bullpen”, room 138B.

Need Help with MATLAB or Simulink?Then contact the MathWorks Teaching Assistant, Ben Drozdenko, a Ph.D. student and former MathWorks employee. Slide3

Lecture OutlineParallel Computing Toolbox™ (PCT) andMATLAB® Distributed Computing Server™ (MDCS)

Starting a Parallel Pool using

parpool

Task-Parallel Jobs with PCTEmbarrassingly Parallel Tasks using parforSetting up a Simple Independent Job using createJobData-Parallel Jobs with PCTSingle Program, Multiple Data using spmdDistributed Datasets and Operations

Passing MessagesSetting up a Communicating JobIncreasing Scale using MDCS on the Discovery ClusterGPU Computing using PCT on the Discovery Cluster3EECE5640 SP2016Slide4

MathWorks® ProductsMATLABHigh-level language and interactive environment for numerical computation, visualization, and programming. Language, tools, and built-in math functions to explore multiple approaches and reach a solution faster.

Used for a range of applications, including signal processing and communications, image and video processing, control systems, test and measurement, computational finance, and computational biology.

Parallel Computing Toolbox™ (PCT)

Solve computationally-intensive and data-intensive problems using multicore processors, GPUs, and computer clusters. High-level constructs—parallel for-loops, special array types, and parallelized numerical algorithms—to parallelize MATLAB® applications without CUDA or MPI programming.MATLAB® Distributed Computing Server™ (MDCS)Run computationally intensive MATLAB® programs and Simulink

® models on computer clusters, clouds, & grids. Develop your program or model on a multicore desktop computer using PCT and then scale up to many computers by running it on MDCS.Server includes a built-in cluster job scheduler and provides support for commonly used third-party schedulers (e.g. Platform LSF, MS Windows HPCS, PBS TORQUE, etc.). Source: MathWorks® Product Page: http://www.mathworks.com/products/ 4EECE5640 SP2016Slide5

Starting Parallel Pool with parpool

To get an estimate of the #computational cores on local machine:

>> maxNumCompThreads

To open a pool of 4 workers, type at the command prompt: >> parpool(4)This command starts 4 new instances of MATLAB, which are available for computations as part of a resource pool.

In Windows, type Ctrl-Alt-Del to start Task Manager. Select Processes. In total, there are now 4 instances of MATLAB.exe running on the system—4 as part of the pool + original instance. To query the current pool: >> p = gcp; p.NumWorkersA parallel pool automatically starts when you execute a parallel language construct that runs on a pool, such as parfor or spmd.When finished, to close pool, type: >> delete(gcp)EECE5640 SP20165Slide6

Lecture Outline (2)Parallel Computing Toolbox™ (PCT) and

MATLAB® Distributed Computing Server™ (MDCS)

Starting a Parallel Pool using

parpoolTask-Parallel Jobs with PCTEmbarrassingly Parallel Tasks using parforSetting up a Simple Independent Job using createJob

Data-Parallel Jobs with PCTSingle Program, Multiple Data using spmdDistributed Datasets and OperationsPassing MessagesSetting up a Communicating JobIncreasing Scale using MDCS on the Discovery ClusterGPU Computing using MATLAB on the Discovery ClusterEECE5640 SP20166Slide7

Task-Parallel Jobs with PCTParallel For Loops with parfor

keyword

Each

worker runs independentlyIdeally suited for embarrassingly-parallel tasksParallel Jobs using createJob functionAssign specific tasks to run on workersSpecific rules for usageNo communication between loop iterationsEECE5640 SP2016

7Slide8

Example #1: Birthday ParadoxWhat is the probability that in a group of 30 randomly selected individuals, at least two of the individuals will share the same birthday?

Assuming independent events,

MATLAB® code for one trial:

function

match = birthday(groupSize)bdays = randi(365, groupSize, 1);bdays = sort(bdays);match = any(diff(bdays) == 0);Code to run many trials sequentially (“brute-force algorithm”):function prob = runBirthday(numtrials, groupsize)

matches = false(1,numtrials);for trial = 1:numtrials matches(trial) = birthday(groupsize);endprob = sum(matches)/numtrials; EECE5640 SP20168Slide9

Parallel For Loops with parfor

Use the keyword

parfor

to make any for loop into a parallel loop that runs the independent iterations on different workersCode to run many trials in parallel: function prob = pRunBirthday(numtrials, groupsize)matches = false(1,numtrials);

parfor trial = 1:numtrials matches(trial) = birthday(groupsize);endprob = sum(matches)/numtrials;EECE5640 SP20169Slide10

Time Code using tic and toc

With pool still open, time parallel version

>> tic; p = pRunBirthday(1e5,30), toc

Close pool, and time sequential version>> delete(gcp); >> tic; p = runBirthday(1e5,30), tocOn my local machine,

EECE5640 SP201610Ntrials1e51e6Sequential0.40 sec4.00 secParallel, Nworkers=40.24 sec1.81 sec

Speedup1.67X2.2XSlide11

Setup a Parallel Job using createJob

Observe the behavior of a parallel for loop from the following output:

>>

parfor i=1:10, disp(i); end

Parfor allows for little control over parallel execution. To assign specific tasks to each worker, create an independent job instead: Connect to a cluster: >> cluster = parcluster('local');Create an independent job:>> job = createJob(cluster);EECE5640 SP201611Slide12

Setup a Parallel Job using createJob (cont.)

Create many tasks for the job to handle (you could use a

for

loop or a while loop for this). >> createTask(job, @runBirthday, 1, {1e5,5});

>> createTask(job, @runBirthday, 1, {1e5,10});>> createTask(job, @runBirthday, 1, {1e5,15});>> createTask(job, @runBirthday, 1, {1e5,20});>> createTask(job, @runBirthday, 1, {1e5,25});>> createTask(job, @runBirthday, 1, {1e5,30});Submit the job. Optionally, wait for it to finish.>> submit(job); wait(job, 'finished');Retrieve the results.>> results = fetchOutputs(job);

>> results{end,1}>> r = cell2mat(results); mean(r)When finished, delete job & clear object. >> delete(job); clear job; EECE5640 SP201612Slide13

Example #2: Gene MatchingEECE5640 SP2016

13

function

results = pargenematchsol()searchSeq = repmat

('gattaca', 1, 10);numTasks = 2;numBases = 7048095;cluster = parcluster('local');job = createJob(cluster);[startValues, endValues] = splitDataset(numBases, numTasks);offsetLeft

= floor(length(searchSeq)/2);if mod(length(searchSeq),2) == 0 offsetRight = offsetLeft - 1;else offsetRight = offsetLeft;endstartValues(2:end) = startValues(2:end) - offsetLeft;

endValues(1:end-1) = endValues(1:end-1) + offsetRight;for tasknum = 1:numTasks createTask(job, @genematch, 2, {searchSeq, 'gene.txt', ...

startValues

(

tasknum

),

endValues

(

tasknum

)});

endSlide14

Example #2: Gene Matching (cont.)EECE5640 SP2016

14

submit(job);

% Submit and Wait for Resultswait(job, 'finished');results = fetchOutputs(job);

% Report the resultsresults = cell2mat(results); % Return absolute position[~,idx] = max(results(:,1));bpm = results(idx,1);msi = results(idx,2)+startValues(idx)-1;function [startValues, endValues] = splitDataset(numTotalElements

, numTasks)numPerTask = repmat(floor(numTotalElements/numTasks), 1, numTasks);leftover = rem(numTotalElements, numTasks);numPerTask(1:leftover) = numPerTask(1:leftover) + 1;endValues = cumsum(numPerTask);

startValues = [1 endValues(1:end-1) + 1];Slide15

Example #2: Gene Matching (cont.)function

[

bestPctMatch,matchStartIdx

]=genematch(searchSeq,file,startIdx,endIdx)fid = fopen(file, '

rt');geneSeq = fscanf(fid, '%c');fclose(fid);if nargin < 3, startIndex = 1; endif nargin < 4, endIndex = length(geneSeq); end

[bestPctMatch,matchStartIdx]=findsubstr(geneSeq(startIdx:endIdx),searchSeq);function [bestPctMatch,matchStartIdx]=findsubstr(baseString,searchString)bestPctMatch = 0;matchStartIdx = 0;for startIdx = 1:(length(baseString

)-length(searchString)+1) currentSection = baseString(startIdx:startIdx+length(searchString)-1); pctMatch = nnz(currentSection==

searchString

)/length(

searchString

);

if

pctMatch

>=

bestPctMatch

bestPctMatch

=

pctMatch

;

matchStartIdx

=

startIdx

;

end

end

EECE5640 SP2016

15Slide16

Lecture Outline (3)Parallel Computing Toolbox™ (PCT) and

MATLAB® Distributed Computing Server™ (MDCS)

Starting a Parallel Pool using

parpoolTask-Parallel Jobs with PCTEmbarrassingly Parallel Tasks using parfor

Setting up a Simple Independent Job using createJobData-Parallel Jobs with PCTSingle Program, Multiple Data using spmdDistributed Datasets and OperationsPassing MessagesSetting up a Communicating JobIncreasing Scale using MDCS on the Discovery ClusterGPU Computing using MATLAB on the Discovery ClusterEECE5640 SP201616Slide17

Data-Parallel Jobs with PCTSingle Program, Multiple Data with

spmd

Self-identification using

labindex and numlabsTypes of arrays—replicated, variant, privateComposite data typeDistributed Datasets and OperationsPassing MessagesPractical ConsiderationsSetting up Communicating Jobs using

createCommunicatingJobAssign one task to run on all labsA lab can pass messages to other labsEECE5640 SP201617Slide18

Single Program, Multiple Data: spmdEECE5640 SP2016

18

>>

spmd

>> labindex>> numlabs>> endLab 1:

1 4Lab 2: 2 4Lab 3: 3 4Lab 4: 4

4

Lab 1

Lab 2

Lab 3

Lab 4

>>code

>>code

>>code

>>code

>>

spmd

>>

code

>>

endSlide19

Types of Arrays on Labs in SPMDReplicated Array

>>

spmd

>> x = 5;>> endVariant Array>> spmd>> y = rand;

>> endPrivate Array>> spmd>> if (labindex==2)>> z = 7;>> end>> endEECE5640 SP201619Lab 1

x = 5Lab 2x = 5Lab 3x = 5

Lab 4x = 5

Lab 1

y = 0.3246

Lab 2

y = 0.2646

Lab 3

y = 0.8847

Lab 4

y = 0.8939

Lab 1

Lab 2

z = 7

Lab 3

Lab 4Slide20

Composite Class & ReductionsFrom the MATLAB client, all these three types of arrays show up in the workspace as a Composite data type.

>> class(y)

Use the curly braces to extract the contents at any lab index.

>> y{3}Use a global operation to combine the results from all the labs. This performs a Reduction & Broadcast, which turns a Composite variable into a replicated array. >> spmd

>> ay = gcat(y); % Global concatenation>> sy = gplus(y); % Global summation>> my = gop(@max,y); % Global maximum>> endOr, specify a lab index to perform a Reduction and turn it into a private array: >> spmd>> ay1 = gcat(y,1,1);>> sy1 = gplus(y,1);

>> my1 = gop(@max,y,1);>> endEECE5640 SP201620Slide21

Distributed Datasets and OperationsUse Distributed Data Type from Client.

vars = load(

'airportdata'

); dlat = distributed(vars.lat);dlong = distributed(vars.long);OR: Use Codistributed Data Type from Labs.

spmd vars = load('airportdata'); clat = codistributed(vars.lat); clong = codistributed(vars.long);endUse gather function to convert to replicated array. >> allat = gather(dlat);Or, specify a lab index to convert to a private array. >> allat1 = gather(dlat, 1);Also, use getLocalPart to convert to a variant array.

>> loclat = getLocalPart(dlat);EECE5640 SP201621Slide22

Example #3: Airport Distancesspmd

R = 3963.2;

% in miles

vars = load('airportdata'); lat = codistributed(vars.lat

); long = codistributed(vars.long); lat = 90 - lat; long = 360 + long; x = R * sind(lat) .* cosd(long); y = R * sind(lat) .* sind(long); z = R * cosd(lat); coords = [x y z]'; dotprod = coords' * coords; mag = sqrt(sum(coords.^2)); angles = min(dotprod ./ (mag' * mag), 1); dist = R * acos(angles); % Arc lengthendEECE5640 SP201622Slide23

Passing MessagesTo send a variable x to another lab:

>>

labSend(x, dest_labindex);

To receive a variable x from another lab: >> x = labReceive(src_labindex

);To see whether a lab is ready to receive data: >> isReady = labProbe(dest_labindex);To broadcast a variable x to all other labs: >> x = labBroadcast(src_lab,x);To synchronize all the labs: >> labBarrier;

EECE5640 SP201623Slide24

Passing Messages: Practical Considerations

spmd

switch

labindex case 1 x1 = labindex * ones(1, 5); % Create local data

x2 = labReceive(2); % Receive data from peer labSend(x1, 2); % Send data to peer y = x2; % Return peer's data case 2 x2 = labindex * ones(1, 5); % Create local data x1 = labReceive(1); % Receive data from peer labSend(x2, 1); % Send data to peer y = x1; % Return peer's data endendDeadlock!

Use labSendReceive function instead to exchange data between labs. spmd switch labindex case 1 x1 = labindex * ones(1, 5); % Create local data x2 = labSendReceive(2, 2, x1); % Exchange data with lab 2 y = x2; % Return peer's data case 2 x2 = labindex * ones(1, 5); % Create local data x1 = labSendReceive(1, 1, x2); % Exchange data with lab 1

y = x1; % Return peer's data endendEECE5640 SP201624Lab 1 waits to receiveLab 2 waits to receiveSlide25

Example #4: Parallel Heat Equation

Get a matrix representing the temperature at each point of a 2D square plate of length

L

& diffusivity c. For example, solve for the temperature on a 3m-by-3m copper plate after 40 seconds have elapsed, using 500 time steps of 80 ms each. Thermal diffusivity of copper is 1.13e-4 m^2/s. EECE5640 SP201625Slide26

Example #4: Sequential Heat Equationfunction

U =

heateq

(k, n, Ts, L, c)ms = L / n; if Ts > (ms^2/2/c), error(

'Selected time step is too large.'); endU = initialTempDistrib(n);north = 1:n;south = 3:(n + 2);curr = 2:(n + 1);east = 3:(n + 2);west = 1:n;for iter = 1:k U(curr, curr) = U(curr, curr) + c * Ts/(ms^2) * (U(north,

curr) + ... U(south, curr) - 4*U(current, curr) + U(curr, east) + U(curr, west)); end function U = initialTempDistrib(n)U = 23*ones(n + 2);U(1, :) = (1:(n + 2))*700/(n + 2);U(end, :) = ((1:(n + 2)) + (n + 2))*700/2/(n + 2);U(:, 1) = (1:(n + 2))*700/(n + 2);U(:, end) = ((1:(n + 2)) + (n + 2))*700/2/(n + 2);EECE5640 SP201626Slide27

Setting up Communicating JobsConnect to a cluster.

>>

cluster = parcluster('local');

Create a communicating job. >> job = createCommunicatingJob(cluster,'Type','SPMD');

Create a repeating task for the job to handle. >> createTask(job,@parheateqn,1,... {1e3,500,0.08,3,1.13e-4});Set a range for the number of workers needed. >> set(job,'NumWorkersRange',[3 3]);Submit the job. Optionally, wait for it to finish.>> submit(job); wait(job,'finished');Retrieve the results.

>> results = fetchOutputs(job);>> U = cell2mat(results');>> imagesc(U)When finished, clean up. Delete job & clear object. >> delete(job); clear job; EECE5640 SP201627Slide28

Example #4: Parallel Heat Equationfunction

U =

parheateqn

(k, n, Ts, L, c)ms = L / n; if (Ts>(ms^2/2/c)), error('Selected time step is too large.'

); endUinit = initialTempDistrib(n);parts = codistributor1d.defaultPartition(n+2);numLocalCols = parts(labindex);leftColInd = sum(parts(1:labindex - 1)) + 1;rightColInd = leftColInd + numLocalCols - 1;U = Uinit(:, leftColInd:rightColInd);if (labindex > 1), U = [zeros(n+2, 1) U];

endif (labindex < numlabs), U = [U zeros(n+2, 1)]; endif (labindex == 1) || (labindex == numlabs) numLocalCols = numLocalCols - 1;endrightNeighbor = mod(labindex, numlabs) + 1;leftNeighbor = mod(labindex - 2, numlabs) + 1;north = 1:n;south = 3:n + 2;currRow = 2:n + 1;currCol = 2:numLocalCols + 1;east = 3:numLocalCols + 2;west = 1:numLocalCols;

EECE5640 SP201628Slide29

Example #4: Parallel Heat Equation (cont.)for

iter

= 1:k rightBoundary = labSendReceive(leftNeighbor,rightNeighbor,U(:,2)); leftBoundary

= labSendReceive(rightNeighbor,leftNeighbor,U(:,end-1)); if (labindex > 1), U(:, 1) = leftBoundary; end if (labindex < numlabs), U(:, end) = rightBoundary; end % Update grid for current iteration U(currRow,currCol) = U(currRow,currCol) + ... c*Ts/(ms^2)*(U(north,currCol

) + U(south,currCol) - ... 4*U(currRow,currCol) + U(currRow,east) + U(currRow,west));end% Combine parts from all labs into a single matrix stored on lab 1U = gcat(U(currRow, currCol), 2, 1); function U = initialTempDistrib(n)U = 23*ones(n + 2);U(1, :) = (1:(n + 2))*700/(n + 2);U(end, :) = ((1:(n + 2)) + (n + 2))*700/2/(n + 2);

U(:, 1) = (1:(n + 2))*700/(n + 2);U(:, end) = ((1:(n + 2)) + (n + 2))*700/2/(n + 2);EECE5640 SP201629Slide30

Summary: Problem TypesEECE5640 SP2016

30

Interactive

Batch

Task-ParallelparpoolparforcreateJobcreateTaskData-ParallelparpoolspmdcreateCommunicatingJobcreateTaskSlide31

Lecture Outline (4)Parallel Computing Toolbox™ (PCT) and

MATLAB® Distributed Computing Server™ (MDCS)

Starting a Parallel Pool using

parpoolTask-Parallel Jobs with PCTEmbarrassingly Parallel Tasks using parfor

Setting up a Simple Independent Job using createJobData-Parallel Jobs with PCTSingle Program, Multiple Data using spmdDistributed Datasets and OperationsPassing MessagesSetting up a Communicating JobIncreasing Scale using MDCS on the Discovery ClusterGPU Computing using MATLAB on the Discovery ClusterEECE5640 SP201631Slide32

Increasing Scale using Multiple Systems with MDCSEECE5640 SP2016

32Slide33

MDCS One-Time Setup on the Discovery ClusterLoad the required modules for MATLAB R2013b.

> module whatis matlab_dce_2013b

Best practice is to add the following lines to your ~/.

bashrc file (if they’re not already in there): module load gnu-4.4-compilers

module load fftw-3.3.3module load platform-mpimodule load oracle_java_1.7u40module load matlab_dce_2013bThen, log out and log back in to the Discovery Cluster to effect your changes. Copy .matlab directory to your home directory. > cp -R /shared/apps/matlab/matlab-2013b/env_script/.matlab ~/.Get a compute node on the ht-10g queue interactively. > bsub –Is –n 2 –q ht-10g /bin/bashOutput like: <<Starting on compute-0-007>>

Verify that the proper modules have been loaded. > module listRun MATLAB with no display to verify that you have MATLAB installed correctly. > matlab –logfile ./output.txt –dmlworker –nodisplay –r “ver;exit”The terminal output shows MATLAB start, display all its product versions, and then exit. If you’re done on the compute node, exit out of it. > exitSource:

http://nuweb12.neu.edu/rc/?page_id=18#matjobs EECE5640 SP201633Slide34

Running an MDCS Submit Script on the Discovery ClusterCreate a Platform LSF submit script called “

bsub_parfor.bash

” with the following content:

#!/bin/bash#BSUB –L /bin/bash#BSUB –J BensParforJob.01

#BSUB –q ht-10g#BSUB –o %J.out#BSUB –e %J.err#BSUB –n 9work=/home/drozdenko.b/hpc/matlab_dcs_testMATLAB_infile=

parfor_parallelcd $workmatlab –logfile ./output.txt –nodisplay –r $MATLAB_infileAlways set –n to one more than the number of MATLAB worker threads your code expects. Submit the job and check your job and output. > bsub < bsub_parfor.bash>

bjobs –wOutput is something like: JOBID 36768/USER drozdenko.b/STAT RUN…> bpeek 582677 (if bjobs shows that your job’s status is still running)

> cat output.txt

(once

bjobs

shows that your job is finished)

EECE5640 SP2016

34Slide35

Running Jobs in Batch Mode from MATLAB GUI on the Discovery Cluster

Start an interactive session with X11-forwarding on the ht-10g queue.

> bsub –Is -XF –n 1 –q ht-10g /bin/bash

Output is something like: <<Starting on compute-0-006>>Ensure that the modules needed for MATLAB are loaded

> module listRun MATLAB> matlab &Configure cluster profile settings. >> configCluster('discovery');>> ClusterInfo.setQueueName('ht-10g')>> ClusterInfo.setProcsPerNode(16)Submit batch SPMD job using the batch function. >> j = batch('spmd_parallel','matlabpool',16);Wait for job to finish. Check the diary. Fetch the outputs. >>

j.State>> j.wait>> j.diary>> out = j.fetchOutputs{:}EECE5640 SP201635Slide36

GPU Setup on Discovery ClusterLoad required CUDA modules (in addition to already loaded MATLAB R2013b modules).

> module

whatis

cuda-5.5Best to add the following lines to your ~/.bashrc file (if they’re not already in there):

module load gnu-4.4-compilersmodule load fftw-3.3.3module load platform-mpimodule load cuda-5.5Start an interactive session with X11-forwarding on the par-gpu queue. > bsub –Is -XF –n 1 –q par-gpu-2 /bin/bashOutput: <<Starting on compute-2-160>>Run MATLAB. > matlab &Confirm that you are connected to a GPU device.

>> gpuDeviceCountMATLAB command line output should be “ans = 1”. Get GPU device information: >> d = gpuDeviceOutput shows properties in table at right (among others): Source: http://nuweb12.neu.edu/rc/?page_id=18#gpujobs http://www.mathworks.com/help/distcomp/identify-and-select-a-gpu-device.html EECE5640 SP201636

GPU PropertyValueNameTesla K20m/40mComputeCapability3.5SupportsDouble1MaxThreadsPerBlock1024MaxShmemPerBlock

49152

MaxThreadBlockSize

[1024 1024 64]

MaxGridSize

[2.1475e+09

65535 65535]Slide37

Run Built-in Functions on GPU from MATLAB GUI on the Discovery ClusterTry running and timing the MATLAB built-in function FFT with an array of 10 million random doubles.

n = 1e7;

r = rand(n,1);

rf = fft

(r);Output is like: Elapsed time is 0.151070 seconds.Next, put the array on the GPU and run the GPU version of the FFT function on the same array. When you’re done with the GPUArray data, use the gather() function to transfer it back to your local workspace. tic; g=gpuArray(r); gf=fft(g); gg=gather(gf); toc; Output is like: Elapsed time is 0.124526 seconds.Note that the GPU version runs slightly faster with an array of this size. Refer to MathWorks documentation for the latest list of built-in functions: http://www.mathworks.com/help/distcomp/run-built-in-functions-on-a-gpu.html EECE5640 SP201637Slide38

Run CUDA PTX files on GPU from MATLAB GUI on the Discovery ClusterCreate a CUDA C kernel function. This add2 function adds two double vectors.

__global__ void add2(double *v1,

const

double *v2) { int idx = threadIdx.x;

v1[idx] += v2[idx];}Next, compile the CUDA C kernel function using nvcc, producing only the .PTX file. > nvcc -ptx gpufcn.cuIn the MATLAB GUI, create a CUDAKernel object and set its properties. >> k=parallel.gpu.CUDAKernel('gpufcn.ptx','gpufcn.cu','add

');>> k.ThreadBlockSize = 128;Call the feval function to run the CUDA kernel with gpuArray data. >> x1 = gpuArray(rand(n,1));>> x2 = gpuArray(rand(n,1));>> y = feval(k,x1,x2);

>> yg = gather(y);Refer to MathWorks documentation for more detailed instructions: http://www.mathworks.com/help/distcomp/run-cuda-or-ptx-code-on-gpu.html EECE5640 SP201638Slide39

Cleanup on Discovery ClusterClose any interactive MATLAB GUI windows.

>> exit

Check to see if you have any other processes still running on each compute node.

> psIgnore ps and bash. For other listed processes, use the kill command.

> kill 29777Exit each interactive session on a compute node. > exit (each compute node)Check to see if you have any remaining jobs still running or queued.>> bjobs –wIgnore jobs on a discovery login node where JOB_NAME is /bin/bash. For all other listed jobs, use the bkill command. >> bkill 506225When finished, exit each interactive session and exit Discovery.

> exit (the Discovery cluster)EECE5640 SP201639Slide40

ConclusionUse Parallel Computing Toolbox on your local machine to prototype your parallel algorithms. Use

parfor

to enact task-parallel algorithms (like with OpenMP) and

spmd to enact data-parallel algorithms (like with MPI). Move your parallel algorithms onto the Discovery Cluster to see significant speedup using MDCS. You can also perform GPU computing on the Discovery Cluster using gpuArray’s, built-in functions, CUDA PTX files, and gather. Questions?

EECE5640 SP201640