Levels of abstraction
-
Architecture (high level)
-
functional behaviour
-
instruction set architecture
-
programmer visible instruction set
-
-
-
logical structure or organisation that performs the architecture
-
micro-architecture or organisation
-
this shows the constituent parts of the system and how they are interconnected and how they interoperate in order to implement the architectural specification
-
-
includes things like
-
the pipeline details
-
caching
-
-
-
Realisation (low level)
-
physical structure that embodies the implementation
-
hardware
-
Computer architecture
-
need to make decisions and trade-offs across the levels
-
optimise by
-
make common case fast
-
explore predictable behaviour
-
identify possible parallelism
-
-
“design for tomorrow not yesterday”
Performance gains, come from:
-
exponential increase in number of transistors
-
less logic between registers
-
pipeline depth increases
-
increased number of Instructions per Clock cycle (IPC)
Tech trends
-
VLSI tech scaling
-
smaller wires and transistors
-
gate capacitance decreases with feature size
-
-
FO4 metric
-
used to compare designs and normalise out impact of fabrication tech
-
is the delay of an inverter driving and inverter four times as large
-
-
scaling wires means resistance gets worse unless wire becomes shorter
-
problems with long wires
-
time to traverse a die increases quickly
-
have to have partitioning in complex designs
-
large centralised structures don't scale well
-
-
scaling leads to process variation
-
with a very large number of transistors on chip many will not work
-
one time production tests are not possible, chips will need to
-
be able to self reconfigure
-
have fault tolerant designs
-
-
-
is the hardware/software interface
-
includes
-
word size
-
operations
-
registers
-
operand types
-
addressing modes
-
instruction encodings
-
memory architecture
-
trap and interrupt architecture
-
-
backwards binary compatibility almost always a must
-
can remove the OS and ISA tie for applications by one of:
-
use virtual ISA (virtual machines)
-
translate from assembly to machine code at run time
-
most modern Intel processors transform to RISC internally
-
-
-
dynamic-static interface:
-
i.e. what is exposed to software and what is hidden by hardware
-
-
-
poor compiler technology
-
memory was expensive so code size important
-
complex instructions executed using micro-code internally
-
problems with microcode
-
efficiently controlling a highly concurrent datapath using microcode is complex
-
CISC exceptions in pipelines are complicated
-
microcode engine is an overhead
-
-
-
-
exposes pipeline to the compiler
-
only read the registers in the execute stage
-
worked as had different set of trade-offs
-
goals
-
common case fast (instructions, addressing modes etc.)
-
designed for pipelines
-
assumes high level languages
-
-
-
word size
-
is the native size of the integer register
-
64 bit?
-
most standard integers and pointers will be not using most of the top bits
-
-
-
operands
-
two address format, assume one operand is where store the result, i.e.
-
a = a + b
-
-
one address format – accumulator machine
-
assume second operand is accumulator, and store back to accumulator
-
-
zero address format – stack machine
-
-
memory accesses
-
memory-memory : can do instruction without any register
-
register-memory : memory accessed as part of instruction
-
register-register / load-store : memory only accessed by loads and stores
-
-
procedure calling conventions for registers
-
need to do something about preserving (some) registers on procedure call
-
could spill some automatically (temps preserved by callee manually if necessary)?
-
have certain registers for arguments and return values?
-
Register windows
-
e.g. each proc has 32 registers -
outs of the previous window become the ins of the next one
-
-
-
-
aligned
-
addresses are aligned to the size of the data type which is being accessed
-
still need logic to align anything smaller than a word
-
-
unaligned
-
-
addressing modes
-
classic RISC
-
register
-
immediate
-
register indirect with displacement
-
-
-
conditional things
-
can use
-
condition codes/flags
-
a conditional register
-
compare and branch
-
-
things to consider
-
impact on local code scheduling optimisations
-
general purpose registers
-
number of instructions required to implement a conditional branch
-
cycle time implications
-
-
-
-
-
variable length format
-
fixed length format
-
hybrid format
-
-
consider
-
number of registers/addressing modes supported
-
what type instructions are included
-
size of code/instructions
-
fetch/decode complexity
-
-
16 bit instruction set encoding
-
can get static code size reductions
-
performance and energy overheads of pure 16 bit code can be reduced by switching between 16 and 32 bit
-
-
Make the common case fast

-
some enhancements may slow the common case

where CPI is Cycles Per Instruction
Exploitable properties of programs
-
locality of reference
-
predictable control flow: branch prediction schemes
-
predictable data values (locality of value)
-
e.g. store function inputs and computed results to remove redundant computation
-
attempt to predict data values at run time
-
-
for instruction reference stream
-
temporal locality: loops and function calls
-
spatial locality: apart from branches instructions are executed sequentially
-
-
for data reference stream
-
temporal locality: program variables and the call stack
-
spatial locality: array access, process streams, stack frame
-
-
used to allow operations to be performed earlier that otherwise possible
-
e.g. speculate that
-
we have predicted branch correctly
-
no memory carried dependency between a store and a load instruction
-
Reconfigurable and adaptive hardware
-
the “common case” is highly dependent on the application
-
can the hardware be configured to deal with application characteristics?
-
configure processor just before production
-
allow compiler to create custom machine instructions
-
ability to change micro-architecture parameters at run time (cache configuration, queue length etc.)
-
-
power is a first order constraint
-
performance gains 1986-2002 at expense of sharp power consumption rise
-
low power design
-
dynamic power
-
-
static power
-
-
traditionally have tried to reduce dynamic as static was only a small amount of total power consumption
-
static heading towards 50% now
-
-
-
simulation of computer systems
-
direct hardware prototyping infeasible
-
accurate performance, power and thermal models are needed
-
these are hard to predict in an analytical/theoretical manner
-
-
you can't wait for the H/W to be produced
-
the design needs to be validated before the H/W is available/made
-
-
need flexible system (control over parameters) to explore design space (different alternatives)
-
-
have to trade off between performance and how detailed the simulation is
-
different models
-
functional simulators
-
this models the architecture level (ISA)
-
they are used to verify correct execution of a program
-
-
performance simulators
-
this models the micro-architectural level
-
generates performance metrics to execute a program for
-
cycles taken
-
power
-
thermal
-
-
-
Register transfer language (RTL) models
-
contains implementation (hardware) specific details
-
use to validate performance models
-
-
-
simulation approaches
-
trace driven simulation
-
a trace is generated beforehand by instrumenting software or hardware
-
the simulation is processing the trace
-
-
execution driven simulation
-
needs a functional simulator
-
can cope with speculation and timing dependent computation
-
-
-
-
there is a large design space
-
availability of a simulation tool shouldn't drive the fraction of the design space explored
-
-
measuring cross machine performance
-
generally use benchmarks to compare machines
-
running our application on all machines is usually impractical
-
-
need to ensure that the benchmarks have similar characteristics to the application
-
writers could fudge it so compiler significantly improves for the common case section of code for the benchmark and this code is never common in normal applications
-
-
can get suites of benchmarks to represent an application domain (e.g. desktop, server, etc.)
-
-
data dependency
-
when one instruction j is dependent on instruction i
-
-
hazards
-
structural hazards
-
stalling to wait for a shared resource
-
e.g. if the mul instruction takes two cycles to complete then two consecutive mul's
-
-
-
when parallel execution makes it possible for data dependencies to be violated
-
read after write
-
write after write
-
write after read
-
fixes
-
stall the pipeline
-
use compiler to schedule so don't have data hazards
-
have a shorter pipeline for memory access instructions
-
hence the load/store happens earlier than execute for ALU instructions
-
-
-
-
control hazards
-
on a conditional branch, where fetch the next instruction from?
-
-
-
exceptions
-
precise exceptions
-
it appears as if all those instructions prior to the faulting one have executed and none following it
-
this is not necessarily the case given that we are executing multiple instructions (in different stages of the pipeline) at once
-
i.e. we know exactly which instruction faulted
-
method
-
each instruction is accompanied by its PC and exception status as it moves down the pipeline
-
if an instruction causes an exception it is noted
-
no faulting instruction is permitted to modify state
-
-
at the write back stage deal with (flush pipeline etc.)
-
-
-
could deal with in hardware (whilst stalling the pipeline)
-
-
multi-cycle operations
-
multiplies, divides or floating point operations take more than one cycle in the execute stage
-
could have multiple execute pipelines
-
may need to separate for all stages from the execute in order to not interrupt the MEM or WB for following instructions
-
this could introduce more possible data hazards
-
-
-
limitations of pipelining
-
adding stages increases complexity
-
pipelining registers are not free
-
by increasing the depth the number of stalls due to hazards increases, especially
-
branch prediction miss costs
-
data dependencies
-
need more cycles to access memory
-
-
Branches
-
conditional branches implemented by
-
condition flags/codes
-
this may constrain ordering of instructions: introduces data dependencies
-
-
condition registers
-
e.g. cmp r3, r1, r2
-
is expensive in register use
-
-
compare and branch
-
e.g. blt r1, r2, loop branch is r1 < r2
-
-
-
ways to deal with control hazards
-
stall the pipeline
-
predict not taken
-
assume branch is not taken and fetch instructions accordingly
-
-
evaluate branch earlier in the pipeline
-
reduces the branch penalty, but there still is some
-
-
delayed branch slot
-
assume the n instructions following a conditional branch are taken
-
compiler has to fill then with instructions for which this is ok, or nop's
-
hard to do, can only do 60-70% of the time for one slot
-
-
-
static branch prediction
-
at compile time decide which way you think the branch will go and include that in the code
-
can do using simulation runs
-
-
fill branch delay slot with the instruction of where you think it will go
-
if it goes the unexpected way annul the delay slot instruction
-
-
dynamic branch prediction
-
1 bit or 2 bit saturating counters
-
read MSB to predict
-
update counter
-
if count > 0 and branch-not-taken then dec
-
if count < max and branch-taken then dec
-
-
-
local history predictors-
for each branch keep a shift register tracking whether or not a branch was taken
-
use the contents of the shift register to index into a table (e.g. right) to predict what happened last time this history was seen in the shift register
-
when the actual branch direction becomes known update the bit which was read previously
-
can use a two bit saturating counter etc. for the next instead of a single bit
-
-
global predictors
-
the behaviour of a branch is often correlated to the behaviour of other branches
-
have an n bit shift register which tracks taken/not taken for all branches
-
keep a table for each branch with n entries, like with local history
-
use the register to index into the specific table for this instruction (like with local history)
-
-
combinations
-
implement local and global and select which is best on a per branch basis
-
-
-
-
-
-
-
-
limitations of dynamic
-
takes time to train
-
limited size in terms of bits for the predictor
-
aliasing – index using branch address, but only first 10 bits or so
-
-
-
need to know the branch target by the end of the IF stage in order to fetch the next instruction, if the branch is predicted to be taken
-
keep a small cache mapping branch instruction address to target address
-
called Branch Target Buffer
-
-
return address predictors
-
procedures may be called from multiple places in a program
-
the probability that the address of a return branch is in the BTB is low
-
keep a small stack to store return addresses on
-
-
-
predicated execution
-
all instructions are conditional on flags
-
transforms control dependency into data dependency
-
-
-
ignoring implementation issues a super-pipelined machine of degree M and a superscalar machine of degree P should have roughly the same performance
-
have to find M or P independent instructions which can execute in parallel
-
data dependencies limit this Instruction Level Parallelism
-
-
limits of ILP
-
inside a basic block 2 < ILP
-
if you branch predict properly (bypass conditional branches) can get
-
-
for bypassing k branches
-
-
upper limit is enforced by true data dependencies
-
could in theory be exceeded
-
data value prediction
-
cache and recall common results
-
-
-
-
instruction fetch
-
we need to be able to fetch enough instructions to keep the superscalar pipeline busy
-
limitations
-
instruction cache performance
-
branches
-
instruction fetch and alignment issues
-
it's difficult to fetch unaligned instructions from non sequential addresses
-
need to be able to fetch multiple basic blocks per cycle
-
-
-
trace cache
-
cache basic blocks
-
include the branch prediction for the branches in the basic blocks cached
-
-
-
register renaming
-
name dependencies (as well as data dependencies) limit exploitation of ILP
-
there are more registers available than those made available to the compiler
-
one method
-
keep
-
a list of free physical registers
-
a map from architectural registers to real registers
-
-
take each destination register, find a free physical register and record the mapping in the table
-
take each rouse register and replace with currently mapped physical register
-
-
-
in order issue
-
instructions are handled in program (control flow) order
-
-
dynamic scheduling (out of order execution)
-
maintain a buffer of instructions waiting to execute
-
issue instructions from the buffer as soon as they are ready to execute, in any order
-
has to wait for operands to be produced
-
has to wait for functional unit to be available
-
-
issues:
-
load/store instructions
-
instructions which take more than one cycle
-
-

-
-
memory aliasing: two memory references involving the same location
-
memory disambiguation: process of determining if two memory references will alias or not
-
memory accesses consist of
-
calculating the address
-
accessing the data cache
-
-
load/store queues
-
put loads and stores in separate queues until they are ready to be retired
-
-
restrictions on out of order memory execution
-
store operations cannot be undone
-
must not be done based upon speculation of branch prediction
-
any earlier exceptions must be handled before writing back
-
-
memory carried dependencies: must not read a memory location before it has been written to by an older instruction
-
shared memory consistency models, e.g. some multiprocessor systems will require that loads access the same memory address in control flow order (load-load violations)
-
-
implementation
-
stores
-
execute in program order
-
-
loads
-
store to forward loading
-
all loads search the store queue for older instructions accesses the same memory location as it
-
if found use the data from the youngest store instruction
-
-
speculative loads-
can you read in this case, when you don't know the address of one of the stores?
-
can do speculatively and detect dependency later, if it exists
-
raise exception and restart from problem load
-
-
can keep a record of those instructions whose speculative loads fail
-
next time force them to wait
-
-
-
load-load violations
-
check no younger load has already read the same memory location
-
-
-
-
complexities
-
have to compare whole memory addresses
-
load/store queues must be content addressable memories and multiported
-
memory dependencies are not static
-
-
-
precise interrupts
-
if an instruction raises an exception, due to out of order execution those later in the control flow may have already executed – need precise interrupts
-
techniques
-
-
-
-
reorder buffer: retire instructions in order
-
-
-
-
-
-
don't commit register writes until all earlier instructions have completed
-
if there are any mispredicted branches/exceptions simply clear the buffer and restart
-
now operand values could be in the registers or the register buffer...
-
-
-
-
-
-
unified register file
-
-
-
-
-
-
store all registers in a single large register
-
requires
-
a map of the “current” (future) mapping of architectural registers to physical registers????
-
a map that is updated as instructions commit
-
handling mispredicted branches
-
can attempt to handle mispredicted branches before they commit
-
save the register map table each time encounter a branch for easy recovery
-
-
-
-
-
-
limits of superscalar processors
-
diminishing returns attempting to exploit ILP
-
complex to design and verify
-
-
VLIW and superscalar are at opposite ends of a spectrum
-
NB, compiler scheduling still affects superscalar in terms of the pool of instructions from which it executes
-
-
packs multiple independent operations into one long instruction
-
execution is statically scheduled by compiler
-
hence can have simpler hardware
-
-
an example of the make up of a VLIW is
-
two int ops (one cycle latency)
-
two memory ops (three cycle latency)
-
two floating point ops (four cycle latency)
-
-
we need to make architectural changes to expose greater amounts of ILP to aid the compiler
-
local scheduling (inside a basic block)
-
there is very limited ILP here, as often it is only 4-7 instructions long
-
-
loop unrolling
-
method
-
replicate the body of the loop n times
-
adjust memory accesses (+/- some constants) and the counter variable for the change
-
-
advantages
-
increases basic block size so permits more ILP (possibly)
-
reduces some increment/decrement of counter instructions
-
-
disadvantages-
increase in code size
-
may use more registers
-
-
-
for (i = 0; i < 10; i++)
a[i] = a[i] + 1;
-
-
-
the above can be unrolled to the right and scheduled as
ld r1, 0 (r2)
add r3, r1, #1 , ld r4, 4 (r2)
sw r3, 0 (r2) , add r5, r4, #1 , ld r6, 8 (r2)
...
-
schedule instructions from different iterations of the loop
-
-
global scheduling
-
the above (including the loops) does not include conditional branches
-
may be able to move instructions between basic blocks
-
must respect data dependencies
-
-
there is a very large search space in terms of what can be moved
-
optimisation will be hard
-
also shouldn't attempt to optimise all basic blocks, but focus on those who will be executed more often
-
-
-
trace scheduling
-
make the common case fast!
-
find the most common path (call it a trace) by
-
static branch prediction
-
profiling (running and measuring) the code
-
-
optimise the instruction schedule with this path
-
-
-
-
superblocks-
a superblock is a trace without side entrances
-
can remove side entrances by creating a copy of the basic blocks
-
this is called tail duplication
-
hence don't need complex compensation code on trace side entries
-
-
-
-
hardware support to aid the compiler
if (A == 0) { S = T }
bnez r1, L
add r2, r3, 0
comvz r2, r3, r1
-
-
-
can be used to eliminate some branches
-
is particularly good for hard to predict branches (those not in loops)
-
annulling instructions
-
predicated instructions are not removed as soon as they enter the pipeline if their conditions aren't met – leave till late in the pipeline
-
hence all predicated instructions are executed, and later annulled if their conditions weren't met
-
if a large number of instructions are predicated this can degrade performance (as there are many instructions being executed which don't add change anything)
-
-
-
speculation
-
better schedules can be created if the compiler is permitted to speculate
-
compiler may move instructions from later basic blocks to earlier ones, assuming the later will be executed
-
if this doesn't happen then the hardware must ensure the exception behaviour is correct
-
i.e. if the instruction moved forward and causes an exception but the later block isn't executed then the exception should not appear to have occurred
-
-
-
memory reference speculation
-
-
VLIW processors
-
rotating register file
-
register allocation for software pipelining is difficult
-
it would be much easier to let the hardware deal with it
-
in a rotating register file the physical register specifier is given by
(rotating register base + register number) mod (number of rotating registers)
-
each time the loop's branch is executed the RRB is decremented, so that a new set of registers are available for each loop iteration
-
do need to account for, once this happens there will still be code using the previous register values, and their register number will have to change to get the same physical register
-
this would be really hard to figure out what was going on in the code
-
-
-
code density
-
if a full set of independent instructions cannot be found the rest of the VLIW must be packed with nop's
-
can have variable length VLIW instructions...
-
-
-
in many applications there may be much coarse grained (thread level) parallelism
-
can even automatically identify threads in a single threaded program
-
-
throughput-
a thread which has I/O will have many times when it is waiting
-
if instead of waiting another job is switched in the overall throughput may be improved
-
however the run time of a single thread may be increased
-
-
-
hardware support for multithreading
-
context switches are expensive
-
could instead have PC and register file per thread
-
effectively need to duplicate all registers
-
-
when switching have to flush the pipeline
-
shorter pipelines will reduce the penalty
-
-
buffering a few instructions from each thread in order to be able to get going straight away
-
each thread can have its own pipeline registers at each stage
-
i.e. if an instruction stalls in the memory write stage those in the preceding three stages are not flushed but instead another thread, which already has instructions in its own pipeline registers in each stage, can start on the next cycle
-
-
-
when to switch threads
-
switch to an alternative thread when the current one stalls (cache miss or I/O etc.)
-
timeout
-
higher priority thread becomes runnable
-
-
fine grained (interleaved) multithreading
-
a new thread is selected on every clock cycle
-
scheduling
-
round robin to each thread
-
insert a nop if no instruction is available
-
this may reduce data dependencies and memory latency, as there is actually n cycles between each instruction (for n threads)
-
that can permit simplifications in the implementation (might not need a data cache...)
-
-
permits predictable performance (verses a pipelined processor with O/S handling the switching between threads)
-
-
programmable thread schedule
-
instead of going around in a round robin manner so that the O/S indicates which thread gets which slots
-
hardware then uses this schedule to select the next thread
-
-
dynamic thread selection policies
-
avoid switching to a thread if it's stalled
-
dynamic thread priorities could be supported
-
-
-
-
simultaneous multithreading
-
superscalar processors with out of order execution can be modified to permit instructions from many different threads to be in flight simultaneously
-
-
impact of multithreading on the memory system
-
protection issues
-
cache pressures
-
since have more threads each needing a minimum working set, have a much larger combiner working set
-
-
load store queues
-
hardware needs to be thread away are respect the memory consistency model
-
-
how much sharing?
-
which do we want, multithreading or chip-multiprocessor
-
-
-
alternative to cache memories
-
explicitly managed scratch-pad memories
-
software managed caches
-
-
unified/mixed cache
-
this is a cache which holds both instructions and data
-
-
advantages of separate caches
-
pipelined processors often want to access instruction and data memory in the same cycle
-
separate caches permit this
-
-
advantages of unified caches
-
can dynamically adjust the proportion of memory allocated to instructions or data
-
-
other possible classifications memory/caches
-
divide the data cache into scalar and array caches
-
permit cache line locking
-
-
block size
-
cache divided into blocks/cache lines
-
on cache miss must load a minimum of one block of data from main memory into the cache
-
larger block size
-
better exploits spacial locality
-
may reduce the miss rate
-
may increase the miss penalty
-
-
-
direct mapped cache
-
each block can only be in one place in the cache
-
i.e. index by lowest m bits of address
-
suffers from many collisions
-
simple and fast
-
-
n way set associative cache
-
each block can only be in one set in the cache
-
each cache has 2n values in it
-
have to store enough of the address to disambiguate between things which map to the same line
-
-
fully associative
-
content addressable memory
-
can perform a parallel search to locate an address in any location
-
-
CAM-tag/CAM-RAM
-
have multiple banks of fully associative caches
-
only activate a bank (in terms of power) if accessing
-
reduces power requirement
-
increases access time
-
requires custom implementation and has 10% area overhead
-
-
-
least recently used (due to locality)
-
approximating LRU
-
FIFO
-
not last used
-
-
random
-
-
write policies
-
if data isn't in the cache on a write
-
write allocate – allocate a block in the cache and then write
-
no write allocate – write to the next level memory
-
-
options for writing to next level of memory
-
write through – write cache and next level at the same time
-
write back – write back dirty cache lines when the cache line is replaced
-
-
-
reasons for cache misses
-
compulsory – a block is brought into the cache for the first time
-
capacity – number of blocks required the capacity if the cache
-
conflict – banks map to the same set
-
-
virtual addressing
-
how address/tag the cache given virtual addresses
-
aliasing issues with using virtual addresses
-
cache optimisations
-
write buffers
-
keep a buffer/cache for writes to the next level of memory
-
prioritise writes – either get the value from the buffer or pass through to the next level, before the writes in the cache
-
if each entry in the buffer holds a block rather than a word
-
may be able to merge sequential writes to generate more efficient burst writes to the next level
-
-
repeated writes to the same location can be made in the buffer without having to write through
-
-
multi level caches
-
L1
-
small
-
lower associativity
-
-
L2/3
-
off chip
-
large with higher associativity
-
-
multi level inclusion – L1 data always present in L2 cache
-
useful for multiprocessor systems as cache consistency can be checked only by looking at the L2 cache
-
disadvantages
-
may want different block sizes in the two
-
L2 is not that much bigger than L1
-
-
-
-
critical word first
-
often a program may only need a single word
-
provide a desired word first then fill in the rest of the block
-
-
sub block placement
-
permit loading a partial cache block
-
have to have bits to indicate which words are valid
-
-
victim cache
-
small fully associative cache
-
holds lines displaced from the L1 cache
-
-
assist cache
-
a small fully associative cache
-
used to assist direct mapped cache
-
data enters assist cache when L1 misses
-
-
non blocking caches
-
when out of order execution is permitted we want to be able to continue operating in the case of a cache miss
-
hence we need the cache to be able to serve new requests, whilst it is handling a miss
-
-
programming for caches
-
design algorithms so the working set fits in the cache
-
it may be quicker to do calculations repeatedly than to have a large look up table
-
-
reorganise (to enhance locality)
-
data structures
-

data access orders
-
-
loop fusion
-
fuse together loops that access the same data
-
-
array merging
-
references to different arrays may all map to the same cache block
a[i] = b[i] + c[i]
-
could reorganise into
struct merged { int a, b, c }
merged M[N]
-
or could array pad instead of merging
-
-
-
instruction and data prefetching
-
prefetching too early will pollute the cache
-
software prefetching
-
hardware prefetching
-
prefetch on miss – on a cache miss
-
check to see if it is in the prefetch buffer, and if not
-
fetch the required block
-
prefetch the next sequential block into prefetch buffer
-
-
strided prefetch
-
detect regular access patterns and prefetch (e.g. b, b + n, b + 2*n)
-
-
-
main memory
-
north bridge also known as memory hug
-
south bridge also known as I/O hub
-
types of memory arrays
-
static and dynamic RAM
-
ROM
-
content addressable memories
-
-
-
6 or 8 transistors per storage cell
-
reading data doesn't change it's value
-
access time is close to the cycle time
-
it can retain data in standby mode with minimal power
-
much lower latency than DRAM
-
fabricated using the standard logic process
-
SRAM cache and processor can share the same die
-
-
-
-
a single transistor and a capacitor
-
reading
-
the bit line is precharged to vdd/2
-
if the word line is true then the capacitor shares its charge with the bit line
-
the (small) change in voltage is detected
-
-
the read is destructive
-
data is not retained, it must be refreshed
-
optimised for capacity and low cost
-
if a DRAM chip outputs more than a single bit at a time, internally each bit is provided from a separate memory array
-
internally in a DRAM chip during a read
-
open a new page (otherwise known as a row)
-
a page can be many thousands of bits
-
-
buffer the page in a latch
-
hence to read other columns from this row don't need to read the row again
-
-
select certain columns to get bit values
-
DRAMs normally provide multiple (2, 4, 8 ... ) columns at once
-
hence data comes in bursts
-
-
-

-
-
a DRAM chip is made of multiple banks
-
each bank is made of n memory arrays (in order to output n bits at once)
-
since there are multiple banks we can pipeline accesses to different memory addresses in different rows
-
-
DIMMs
-
these provide two ranks, one per side
-
this can be used to increase the bandwidth
-
logically provides a 128 bit channel but is actually 64 bits from each rank
-
-
-
Address mapping
-
careful address mapping can result in lower average memory access delay
-
exploit
-
locality in a row
-
interleave accesses across banks and ranks
-
-
-
things to bear in mind in memory controller design
-
needing to generate signals for the memory, guaranteeing refresh
-
arbitrating between different components performing memory requests
-
power mode management
-
memory access scheduling
-
highly dependent upon access pattern
-
-
-
coping with failure
-
memories have lots of transistors
-
since the structures are regular we can introduce
-
redundancy
-
error detecting and error correcting codes
-
-
causes
-
transient errors
-
radiation
-
-
persistent errors
-
defects in manufacture
-
breakdown over time
-
-
-
error detection/correction schemes
-
parity
-
single bit error correction
-
single bit error correction, double bit error detection
-
-
-
die stacking-
bandwidth and latency are limited by the need to access a remote memory chip across a PCB
-
there is significant delay in going off chip
-
-
< name="fram">non volatile memory (Flash)
-
lower performance than DRAM
-
typically only block (rather than word) sized reads are supported
-
limited erase/write cycles
-
-
choices in terms of the ISA to support vector processing
-
memory-memory vector processors
-
vector register processors
-
lower memory bandwidth requirements
-
it is easy to check dependencies between vector ops
-
-
-
vector registers
-
each vector register holds a fixed number of elements, e.g. 64
-
vector register file has lots of ports so different vector operations can be performed in parallel
-
-
vector functional units
-
these are pipelined so that we can process a new element of a vector on each clock cycle
-
-
vector load/store unit
-
assume data is moved between registers and memory at a rate of one word per cycle
-
-
scalar registers
-
these also provide data as input to vector functional units
-
-
vector processing
Y = aX + Y
l.d f0, a
lv v1, x
mulvs.d v2, v1, f0
lv v3, y
addv.d v4, v2, v3
sv y, v4
-
advantages of vector architectures
-
exploit data level parallelism
-
potentially can energy per operation (by reducing complexity)
-
vector operations specify lots of independent operations which will execute simply in parallel
-
less switching activity in the datapath
-
large reduction (two order of magnitude in the example above) of number of instructions having to be processed
-
regular memory and register access patterns
-
-
-
vector data path-
vector operations specify a large number of independent small operations
-
since we are always operating on the same element from different vectors we can partition the register file and functional units into “lanes”
-
elements are interleaved across the lanes to ensure the results are obtained in order
-
-
vector execution time (time to execute a sequence of vector instructions)
-
start up time is the number of cycles before the first result is produced, i.e. the pipeline depth
-
the initiation rate is the number of elemental operations that are completed per cycle for each vector operation, after the start up
-
-
vector memory systems
-
the start up time for a vector load is the time to get the first word from memory into the register
-
to maintain the above initiation rate the memory must be capable of providing/accepting data at this rate
-
hence need an adequate number of independent memory banks
-
-
-
vector length
-
we can't assume all vectors are 64 elements long
-
keep a register which control the length of any vector operation
-
vector functional units must take a copy, as the value will be changed for subsequent operations
-
-
adjacent elements in a vector may not appear in consecutive memory locations
-
strip mining
-
we do this if the vector operation length is greater than maximum vector length (64)
-
we first operate on an odd sized piece of the vector
-
then operate on however many MVL pieces as necessary until the end of the vector
-
-
-
vector stride
-
we need to be able to read elements from memory separated by a given fixed distance
-
this is to access multi-dimensional arrays
-
-
this requires the load/store operations to be changed
-
“load with vector stride” instruction
-
-
-
vector chaining and tailgating
-
addv.d v1, v2, v3
-
don't really need to wait for the whole vector to be in the register file to start the operation
-
v1[1] = v2[1] + v3[1] only needs the first element from each vector
-
write after read
-
this permits the value in v2 to be overwritten as soon as it is used in the top instruction
mul.d v1, v2, v3
addv.d v2, v1, v5
-
-
-
vector conditional execution using vector mask control
-
we have an extra register which is a boolean vector of length MVL
-
this controls the execution of a vector instruction for each element separately
-
vector operation turns into nop at elements where the mask bit is clear
-
-
-
sparse matrices
-
need to support indirect accesses into arrays
-
introduce load (gather) and store (scatter) vector indexed instructions
-
v1(i) = Mem[t1 + v2(i)]
Mem[r1 + v2(i)] = v1[i]
lvi v1, (r1 + v2)
svi (r1 + v2), v1
-
-
-
pipeline start up time
-
we want to be able to issue the next instruction to a vector FU and keep the pipeline full
-
normally need some recovery time between instructions dispatched to the same FU
-
is a trade off for additional complexity
-
-
exploiting vector processing
-
for scientific and engineering applications
-
on the desktop
-
multimedia and communications code is often scientific style code
-
should exploit the fact that this is vectorisable
-
general purpose registers aren't good at this
-
requirements
-
identical operations on streams of input data
-
data normally has narrow data types (8 or 16 bits)
-
pipelining functions
-
very high memory bandwidth requirements
-
-
could modify the ISA to add SIMD instructions
-
single instruction multiple data (SIMD)
-
e.g. support data parallel operations on wide registers, treating 128 bit register as eight 16 bit values
-
have SIMD functional units and SIMD registers
-
-
-
-

-
using multiprocessors: to exploit N processors we need N threads which can run in parallel
-
threads could be independent processes scheduled by the OS
-
threads could come from a single process and be exposed to the programmer
-
-
interconnection options:
-


shared memory multiprocessor-
large caches permit a small number of processors to be satisfied by a single memory
-
-
-
-
distributed memory multiprocessor
-
the centralised memory can be a bottleneck for a large number of processors
-
can have the memory distributed amongst the processors
-
processors communicating and accessing data not in your memory is complex and expensive
-
models for communication
-
shared address space
-
address space is logically unified but physically distributed
-
distributed shared memory architectures
-
variable access times, depending on where it is
-
-
message passing
-
-
-
-
why not use uniprocessor?
-
exploiting greater levels of ILP starts having diminishing returns
-
limits to pipelining
-
power
-
complexity and verification of
-
-
why use multiprocessor?
-
can limit power consumption by simplifying cores
-
double the cores per generation (technology scaling)
-
we can use existing uniprocessor designs
-
-
cache coherency protocols
-
if there are multiple copies of an address in caches, and it is modified in one, how deal with it in others?
-
options
-
send a message out on the bus to invalidate other copies
-
generates least traffic
-
-
update the shared copies
-
snooping
-
access to the bus is arbitrated
-
all transactions are broadcast and can be observed by all processors in the same order
-
coherence is maintained by having all controllers snoop all changes put onto the bus
-
this assumes single level write through caches
-
-
snooping with write back (not write through) caches-
processors must snoop read requests too and service them if it has an updated version in its cache
-
also have to invalidate the memory access (which was snooped) at main memory
-
-
directory based cache coherency protocols
-
used for distributed memory multiprocessors
-
have to keep a directory (a table) about where every block is and whether it is modified or not, etc.
-
this directory could be kept centrally or with the memory at each node (and to get information about a remote block you have to ask the node which owns it's directory about it)
-
-
-
often four or more states are associated with each block
-
these are used to reduce the number of bus transactions
-
-
generally coherency is maintained at a block level
-
-
memory consistency
-

assume flag and a are 0 to start with -
previously, nothing has been specified about when a written value becomes visible to a reader
-
but the programmer who writes P1 and P2 could very reasonably expect the change in value of a to become visible before the change in value of flag
-
-
memory consistency models provide formal specification of memory semantics
-
sequential consistency
-
a multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in program order
-
this ensures that it appears that memory operations execute atomically with respect to other memory operations
-
implementation
-
this is hard as many common uniprocessor optimisations violate this (e.g. write buffers)
-
caches make it difficult as we have to write for previous write operations to complete
-
one option is using a speculative processor methodology
-
-
-
relaxed consistency
-
unfortunately sequential consistency restricts many performance optimisations
-
often relaxed memory consistency is used instead
-
-
-
-
cache misses (additional reasons)
-
true sharing misses
-
this is when a cache miss occurs in order to fulfil the protocol, and data is transferred
-
i.e. a useful cache miss
-
-
false sharing misses
-
this is when a cache miss occurs in order to fulfil the protocol, and no useful data is transferred
-
e.g. two processors writing to different words in the same block (since the granularity is at block level still have to cache miss even though the data transferred won't be used)
-
-
-
sometimes specific targets (in terms of performance, power, cost etc.) required specialised solutions
-
we want to make the common case fast for this application domain
-
-
options
-
specialised hardware solution (ASIC, FPGA)
-
programmable solution (preferred)
-
-
general purpose solutions have their costs-
they spend lots of effort attempting to find and exploit ILP
-
not very much area is spent on computational power
-
-
gaming and entertainment
-
typical applications
-
huge processing and memory bandwidth requirements
-
don't need to resort to instruction level to find parallelism
-
the general purpose memory hierarchy is inadequate
-
-
GPUs
-
GPUs are becoming increasingly programmable
-
fixed pipeline model has been replaced by the use of programmable processors
-
traditionally it was done with specific functional units doing specific jobs
-
some work on vertices
-
some work on pixels
-
now they're moving away from this to having lots of processors all capable of handling everything
-
-
GPUs are now being used to accelerate applications outside graphics
-
-
IBM cell
-
aim is to be a half way house between general purpose processors and GPUs
-
combines
-
a modest general purpose processor
-
many additional simpler cores which are optimised for floating point or multimedia computations
-
-
-
-
parallel supercomputers
-
issues for parallel computing
-
organisation and orchestration of multiple processing elements so that they may cooperatively solve large problems quickly
-
data locality and interconnection networks
-
programmability
-
-
today's grand “scientific challenges” have vast storage and computational performance requirements
-
nodes
-
need to connect directly into the switching fabric
-
on board-node FPGA is useful as a programmable co-processors to speed up key algorithms
-
-
-
replacing hardware with many cores
-
some applications may appear to need ASIC of FPGA implementations
-
challenging processing requirements
-
need to provide hard real-time response times
-
e.g. to implement communications protocols
-
-
it would be really good if we could do this in software, for
-
design time
-
flexibility
-
-
-
methodology
-
provide a large number of cores
-
the cores are connected with a flexible global interconnect
-
resources are partitioned and dedicated to different parts of the application
-
the mapping is similar to floorplanning in an ASIC or an FPGA
-
sometimes called “software circuits”
-
-
-
this works well as the nature of programs has changed
-
it's not about exploiting instruction level parallelism
-
-
-
-
-
signals may travel across tiles in less than one clock cycle
-
can be order(s) of magnitude better for stream or embedded computations
-
-
process activity can be data driven
-
processor waits for the bus
-
don't have to do a waiting loop
-
switch off logic when not performing useful computation
-
-
programmer describes process bodies and their interconnection
-
mapping to the processor array is handled automatically by the “place and switch” tool
-
-
the interconnect
-
is normally just a simple statically scheduled network
-
i.e. the switches rotate through a schedule fixed at compile time
-
processors link to the switched via a time division multiplexed bus
-
-
manufacturing defects-
we can apply the techniques used in DRAM chips to conceal a row with a defective processor
-
-
-
-
it could be possible to use processors to replace hardware with much more modest programming requirements (i.e. cheap ones)
-
would have hard real time requirements
-
could peripherals and external interfaces be implemented in software?
-
use threads to emulate hardware (including I/O control)
-
can be used to emulate parallel activity seen in an ASIC such as
-
input
-
processing
-
waiting
-
output
-
-
-
-

-
tailoring a processor to an application
-
given a processor design how might we improve its performance characteristics for a given application? (i.e. make the common case fast and efficient)
-
make manual modifications to the RTL
-
add a custom accelerator as a coprocessor
-
adopt a configurable processor approach
-
exploit reconfigurable functional units
-
design the processor to adapt at run time
-
e.g. if the system is not using the cache well then power down part of the cache
-
-
-
configurable processors
-
configure the base processor resources
-
extend the processor's basic capabilities
-
the user specifies extensions in a high level language
-
the process of configuration and extension are automated
-
the RTL description and tool chain are automatically extended
-
these systems impose restrictions on what can be specified
-
this is to simplify verification
-
-
-
some systems can analyse the source code and suggest the best ISA extensions automatically
-
e.g. triple DES requires
-
rotation on 28 bit boundaries
-
bit packing and unpacking
-
frequent table lookups
-
-
support these in the ISA
-
-
combining with chip multiprocessors
-
start with a base chip multiprocessor
-
allow the customisation of size of a tile and the type of the memory
-
use a configurable processor generator to optimise individual processors and functional units
-
optimise chip-wide interconnection networks and buffer size
-
replace some processor tiles with custom accelerators
-
they should use the same interface into the interconnect network
-
-
-
-

