Performance The speed at which a computer executes a program is affected by the design of its hardware processor speed clock rate memory access time etc the machine language ML instructions the instruction format instruction set etc ID: 614191
Download Presentation The PPT/PDF document "Advanced Architectures" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Advanced ArchitecturesSlide2
Performance
The speed at which a computer executes a program is affected by
the design of its hardware – processor speed, clock rate, memory access time etc.
the machine language (ML) instructions – the instruction format, instruction set etc
the compiler that translates HLL programs into ML programs – how efficient the ML code generated by the compiler is.Slide3
Performance –
Memory Access Time
Techniques used to reduce memory access time are-
Use Cache Memory
Prefetch instructions and place them in the Instruction Queue in the processorThese reduce instruction fetch time close to within a processor clock cycle.
Memory
Cache
Instruction Queue
Instruction RegisterSlide4
Performance Equation
To execute a m/c instruction processor divides the actions to be taken into a sequence of basic steps
Each basic step can be executed in one clock cycle
For a clock cycle period, P the Clock rate, R = 1/P
Let T be the processor time required to execute a program written in a high level language (HLL) N is the number of machine language instructions generated corresponding to the program
S is the average basic number of steps needed to execute a single machine instructionsThe program execution time T is given as T = (N*S) /RThis is called the basic performance equationSlide5
Performance Equation
To enhance the performance of a computer the performance parameter T must be reduced
To reduce T: reduce N and S and increase R
N, S and R are interdependent
Reduction in N usually comes at the cost of greater number of basic steps per instructionReduction in S usually comes at the cost of greater number of instructions
Increasing the value of R reduces the length of the clock cycle, thereby giving less time for execution of a basic stepHence, need is to introduce features that collectively brings down TSlide6
Pipelining
Normally it is assumed that instructions are executed one after another, therefore
The total number of basic steps for a program
S1 + S2 + … + SN = N*S
where S1, S2 … SN are the individual number of basic steps for each of the N instructions of a given program and S is the average number of basic steps
N*S is the total number of clock cycles required to execute a program if all the steps of an instruction is executed by the same module in the processor and the N*S steps are executed sequentiallyThere can be multiple functional modules to execute the different steps in an instruction with these operating in a pipeline to execute the identical steps of the consecutive instructions in successive clock cycles.
Leads to overlapping execution of the successive instructions in the program Results in reduction of the program execution time Slide7
Pipelining
Pipelining takes advantage of the fact that the execution of any instruction can be broken down into a sequence of basic steps which are handled by different hardware units inside the processor
For example
Fetch (F) is performed by the system bus
Decode (D) is performed by the decode unit in the control unitExecute (E) is performed by the ALU
Write (W) is performed by the internal processor bus
Fetch Unit
Decode Unit
Execute Unit
Write Unit
InstructionSlide8
Pipelining
In pipelining, different instructions at different stages of their execution are executed at the same time
It is assumed that each of the basic steps require the same amount of time – 1 clock cycle
Completes the processing of instruction in each cycleSlide9
Pipelining
If there are K stages in the pipeline, there will be steps of K instructions parallelly being executed.
After the initial charging of the pipeline execution of one instruction will be completed in each clock cycle.
The time required to execute a program involving N*S steps will be- (N*S/K) + (K-1)
Speed-up achieved is- Time required to execute the program without pipeline
Time required to execute the program with pipeline = (N*S)/ ((N*S/K) + (K-1)) = 1/ ((1/K) + (K-1)/N*S)
≈ K for K << N*S Slide10
Dependencies – Pipeline Hazards
There are dependency conditions that prevents normal scheduling of the instruction steps in the pipeline. These lead to
pipeline
hazardThree types of dependenciesStructural Dependencies – when an instruction in the pipeline needs a hardware resource being used by another instruction
Data Dependencies – when an instruction depends on a data value produced by an instruction still in the pipelineControl Dependencies – when whether an instruction will be executed or not is determined by a control instruction which is still in the pipeline
These hazards need resolution for the instructions to execute correctly.Slide11
Structural Hazard
Usually occurs when instructions have different sequences of basic steps –
To deal with such situations:
Programmer explicitly avoids such sequence of instructions
Stalling – postponing a step in the latter instruction to avoid collision
Add more h/w resources to allow instructions to use independent resources at the same time
M2 is a operand fetch step
F5 is a stalled instruction fetch step
W3 and W4 are stalled write stepsSlide12
Data Hazard
To deal with it –
Programmer explicitly avoids such sequencing
Stalling
– Freeze latter stages until results of preceding instruction are written
Bypassing/Operand forwarding – Data available at o/p of ALU is directly forwarded to next instruction instead of writing the results
Using software – During compilation, compiler detect such hazards and introduces NOP (no operation) instructions in between
D2A is stalled because it uses output
of I1 before the results are written
into (W1)Slide13
Control Hazard
Usually as a result of a branch instruction
To deal with such hazards –
Branching delay/stalling
Introduce delay slots after every branch instruction to avoid execution of unnecessary instructions
Reorder the sequence of instructions to avoid wastage of processor cycles due to introduction of delay slotsUseful only in case of unconditional branch instructions
I3 and I4 are executed needlessly as the
branch instruction I2 jumps to
IkSlide14
Control Hazard
To deal with such hazards …
2.
Static Branch Prediction
– In case of conditional branching predict whether or not a branch will be taken or not and fetch/execute next instruction accordingly. The predicted instruction is executed but results are not written into.
Predict based on some heuristic such as –Branch is always takenBranch is taken 50% of the timeTake if branch instruction is in the beginning of a loop
3. Dynamic Branch Prediction – In static branch prediction the branch taken/not taken decision will be the same for all cases. At some point, the static prediction will lead to a wrong decision. In dynamic prediction, decision is made by looking at the instruction execution history
The probability of a branch being taken or not depends on the branch decisions taken so far.Slide15
Control Hazards
Reordering InstructionsSlide16
CISC/ RISC
CISC – Complex Instruction Set Computer
Allows different instructions to have different number and sequence of basic steps
Allows instructions of different length
RISC – Reduced Instruction Set ComputerAll instructions have a fixed length of 1 wordAll instructions require equal execution timeSlide17
RISC
Disadvantages of CISC
Complex compilers – complex machine instructions are often hard to exploit because compiler needs to find the exact machine instructions that fit the HLL construct. Having a simple (reduced) instruction set means that there are fewer instructions to choose from
Smaller programs are not necessarily fast – ML programs with fewer set of instructions ensures that the space taken by the program is less; but the number of basic steps to execute each instruction is moreSlide18
RISC
Characteristics of RISC processors:
One instruction per cycle –
With simple, one-cycle instructions, there is little or no need for microcode; the machine instructions can be hardwired. Such instructions should execute faster
Register-to-register operations – most operations are register to register, with only simple LOAD and STORE operations accessing memory. This design feature simplifies the instruction set and therefore the control unit
Simple addressing modes – Almost all RISC instructions use simple register addressing; simplifies the instruction set and the control unit
Simple instruction formats – Only one or a few formats are used. Instruction length is fixed; Field locations (especially opcode) are fixed. Benefits are:Opcode decoding and register operand accessing can occur simultaneously.
Simplified formats simplify the control unit. Instruction fetching is optimized because unit word-length unit are fetched. Pipelining is optimized easily due to these RISC featuresSlide19
Superscalar and VLIW Processors
Superscalar and Very Long Instruction Word (VLIW) processors maintain multiple instruction execution pipelines
Instructions are parallelly scheduled and simultaneously executed in these pipelinesSlide20
Superscalar/ VLIW Processors
The superscalar/ VLIW approach depends on the ability to execute multiple instructions
in parallel.
I
nstruction-level parallelism refers to the degree to which, on average, the instructions of a program can be executed in parallelData and control hazards are even more difficult to deal withSlide21
Superscalar Processors
In a Superscalar processors a normal m/c code program is made available to the processor.
The processor schedules these instructions to the pipelines in it after resolving the hazards.
Stage1
Stage2
Stage3
Stage4
Stage1
Stage2
Stage3
Stage4
Stage1
Stage2
Stage3
Stage4
Instruction Queue
Instruction Execution Pipelines
. . . .
. . . .
. . . .
. . . .Slide22
VLIW Processors
In a VLIW processor a specially designed compiler resolves the hazards at the time of compilation to prepare a m/c code program of very long instruction words.
Each of these VLIWs contain several parallelly executable instructions to be executed in the available pipelines.
Stage1
Stage2
Stage3
Stage4
Stage1
Stage2
Stage3
Stage4
Stage1
Stage2
Stage3
Stage4
VLIW
Instruction Execution Pipelines
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .Slide23
Parallel Processors
A taxonomy of different types of parallel processors was put forward by Flynn
, as follows,-
Single Instruction Single Data (SISD) stream
- A single processor executes a single instruction stream to operate on data stored in a single memory. E.g. - Uniprocessors
Single Instruction Multiple Data (SIMD) stream - A single machine instruction controls the simultaneous execution of a number of processing elements on a lockstep basis. Each processing element has an associated data memory, so that instructions are executed on different sets of data by different processors.
Multiple Instruction Single Data (MISD) stream - A sequence of data is transmitted to a set of processors, each of which executes a different instruction sequence. Not available commerciallyMultiple Instruction Multiple Data (MIMD) stream
- A set of processors simultaneously execute different instruction sequences on different data setsSlide24
Parallel ProcessorsSlide25
Parallel Processors
MIMDs can be further subdivided by the means in which the processors communicate
Multiprocessors:
Processors share a common memory; each processor accesses programs and data stored in the shared memory, and processors communicate with each other via that memory.
Multicomputers: Processors have individual memory areas; these are basically collection of independent uniprocessors
/multiprocessors and communication among the computers is either via fixed paths or some network facility. Also called clusters.