Advanced System Topics

Tasks that can be parallelised between a number of threads

  • coarse granularity

    • parallel ray tracing

  • fine granularity

    • parallel garbage collection

    • lots of communication

 

General approaches

  • identify tasks

  • ensure they operate safely in parallel (basic locks etc.)

  • make it unlikely for them to get in each others way

    • there's no point running two CPU bound processes on a single processor

    • e.g. copy a shared read only structure to reduce locking time

    • try to

      • avoid contention for resources which cannot be shared

      • encourage locality for resources which can

  • reduce overheads introduced by their co-ordination, e.g.

    • context switching time

    • waiting for locks

    • pollution of caches

  • remember, the bottleneck resource is the most important

 

Java

public class Worker implements Runnable {

public void run () {

...

}

...

}

public static void main() {

Worker w = new Worker();

new Thread(w).start();

}

  • each object has a mutex

    • synchronised methods require holding the mutex

  • each object has a condvar

    • to block on an object waiting for a condition to be true

    • you do it after you've acquired a lock on an object (and entered a method sometimes) in order to let go control so another thread can do something you need, but to reacquire the lock afterwards

    • methods

      • wait()

        • atomically release mutex control and be put in waiting status

      • notify

        • wake up one thread which is waiting

        • the woken thread needs to reacquire the lock

      • notifyAll()

        • wakes all the threads blocked on this condvar

        • they all compete for the lock

 

Fine grained parallelism

  • explicitly use pragmas to indicate sections of code which are parallelisable and how

  • implementation is responsible for creating a suitable number of threads

 

Threads

  • separating reasonably independent tasks

    • advantages

      • in modern systems each thread can block/unblock independently

    • disadvantages

      • creating threads in an uncontrolled manner causes contention

      • high load can cause poor response which causes clients to retry causing higher load

  • kernel managed threads

    • usually separate threads sharing the same address space

    • advantages

      • OS can balance threads across CPUs

    • disadvantages

      • moderately expensive to create threads

  • user/language managed threads

    • threads are multiplexed onto a single kernel process

    • implemented within a library

    • advantages

      • fast creation and context-switch times

    • disadvantages

      • can't use CPU parallelism

      • a blocked thread can block the entire thread

 

Commands

  • things to be done

  • e.g.

    • each line of a scene to ray-trace

    • each object to examine in a garbage collector

    • each request received from a remote client

  • should be lightweight to create

  • in terms of Java: an instance of Runnable

 

Executors

  • things responsible for arranging the execution of commands

  • can be expensive to create, e.g. having associated with a thread

  • normally more long lived

 

Thread pools

  • commands enter a single queue

  • each thread takes a command and executes it before taking another

  • items are queued if six threads are running

  • if fewer that two threads are running then new threads are created

  • how to deal with signal queue overflow?

 

More complicated commands

  • above they just run to completion, could instead

    • block (e.g. ongoing interaction with clients)

      • could possibly vary maximum pool size, if suitably independent

    • generate other commands (e.g. parallel garbage collection)

      • adding them to the queue could

        • harm locality

        • cause problems if the queue fills

  • solutions

    • asynchronous I/O (generate new command when previous I/O is completed)

    • next command is responsible for the next step of execution before issuing next I/O

    • in order to reduce harm to locality, attempt to encourage the series of commands to be dealt with by the same thread(s)

 

More complicated thread pools

  • each worker thread has a separate pool

  • try to put a generated command in the pool where the generating command is executed

    • head or tail?

  • if empty take item from another queue

    • head or tail?

    • one or many?

  • useful to have concurrent queues which can be accessed independently at both ends

 

Shared data structures

  • techniques to reduce contention

    • confinement: ensure some objects are always local, no need to lock/unlock them separately

    • accept stale/changing data:

      • particularly allowing reads without locking

      • what's the worst that can happen?

      • can stale data be detected?

    • Copy on write

      • assumes writes are rare

    • reduce lock granularity

      • lots of small locks instead of a few larger ones

 

Hardware architectures:

  • Cache coherent shared memory multiprocessor, with

 

Write back cache coherency

  • write back only when line is replaced or other set intervals

  • requires there to be at most one dirty version of a cache line

 

 

 

 

 

 

SMP cache coherence

  • SMP snoop requests on the bus

  • multiple processors can read a line

  • use BusRdX to read for exclusive ownership

  • other processors snoop the BusRdX and invalidate their own copies

  • flush aborts remote bus cycle and writes back to memory

 

 

 

ccNUMA

  • a node is the grouping of CPUs with their local memory

  • interconnection permits message passing between nodes

    • no bus is shared

  • directories provide serialisation point

    • one directory per node

    • tracks caching of memory blocks belonging to its node

  • CPUs issue read write requests to the home node of the memory block

    • local node: the node where the CPU making the request is

    • home node: the node where the memory (which the CPU is trying to access) lives

    • remote node: any node that has the block in question cached

  • interconnect bandwidth can be problematic

  • this can be very, very bad delay in worst cases

 

Memory consistency issues

  • remember that instructions are executed out of order

  • can need to insert memory barriers to enforce ordering

  • programs using mutual exclusion generally needn't worry about this

 

Compare and swap

seen = CAS (address, old, new)

  • atomic operation

    • examines contents address

    • if they are equal to old then write new there

    • returns the value seen there before the (possible) write

 

Spin locks

  • naïve implementation

    • disadvantages

      • the CAS operation must acquire exclusive access to the entire cache block where the boolean variable locked is located

      • access to this block will probably be accessed by another as soon as one relinquishes, bouncing around the system, often with remote nodes caching it and the related overheads

      • the interconnect will probably become saturated

      • spinning will harm the performance of other processes on the machine

        • if the thread holding the lock is on the same processor than harm that one too

  • slightly better implementation

void lock() {

while( CAS (&locked, false, true) != false) {

while (locked == true) { }

}

}

  •  
    • advantages

      • threads don't waste resources doing CAS over and over

      • doesn't cause interconnect traffic

    • disadvantages

      • stampede for lock when it becomes available

  • other options

    • use other types of locks (e.g. multiple reader single writer MRSW)

    • introduce a purely local delay between seeing the lock available and going for it

    • explicitly queue threads and ensure the one at the front of the queue gets the lock

  • Linux “big reader” locks

    • assumes reads are much more common than writes

    • there is a series of per CPU or per thread MRSW locks

    • a reader acquires just a lock for that CPU

    • a writer must acquire the lock for all the CPUs

  • queue based spin locks

    • each thread

      • spins on a separate location/object

      • keeps a reference for who gets the lock next

    • adding a new node safely

      • prepare a new node

      • takes a copy of the tail reference

      • uses CAS to update tail to refer to its node

      • write the next field of the previous tail to refer to itself

    • how tell the next thread to unblock?

      • if next is not null then wake the successor (set its value to false)

      • otherwise

        • attempt to CAS the tail from itself to null

        • if that fails then something is attempting to add itself to the list

          • spin watching next until it is non null

 

Mutual exclusion

  • disadvantages

    • deadlock due to circular waiting

    • priority inversion problems

    • data shared between an interrupt handler and the rest of an OS

    • pre-emption whilst holding locks

  • performance concerns

    • programming with a few big locks is easy but prevents many valid concurrent actions

    • programming with lots of little locks is tricky

 

Non blocking data structures

  • guarantees that the system as a whole can still make progress if any finite number of threads in it are suspended

  • non blocking designs are categorised (strongest first) as

    • wait free: per-thread progress bound

    • lock free: system wide progress bound

    • obstruction free: system wide progress bound if threads run in isolation

  • CAS is a universal primitive for building wait free designs (i.e. it can build anything)

  • linearisability

    • e.g.

      • one thread invokes get_key_for('red') on a lookup table

      • another thread concurrently invokes insert(10, 'red'); delete(20) on the same table

    • this is important for correctness of non blocking structures

    • each operation should appear to take place atomically at some point between which when it was invoked and when it returned

    • this is more strict than serialisability

    • often implemented by making a single CAS operation which atomically which checks and/or updates all of the state that the result depends on

 

Software transactional memory

  • ensures all accesses within a transaction are linearisable

  • is a research topic

 

Summary

  • lots of threads usually means lots of context switching

  • using ~number of processors is good if the threads are CPU bound

  • excessive contention and low locality will lead to poor performance

  • try to ensure threads are just using local resources as much as possible

  • designing scalable shared data structures is hard and depends on

    • the workload

    • the execution environment

 

 

Internet properties

  • as opposed to with POTS, with the internet all the intelligence is in at the sender and receiver, not the network itself

  • connectivity

    • ISPs want to limit connectivity

    • connectivity turned off by default, on by config

    • they only want to carry traffic if one of the sender or received is a paying customer

  • autonomous system number (ASN)

    • sort of like an IP number but for networks

  • routers talking to routers (control plane)

    • tend to use layer 2, except BGP

    • routing computation is distributed among routers within a routing domain

    • routing messages are usually not routed but exchanged on layer 2 between physically adjacent routers

      • internal BGP and multi-hop external BGP are exceptions

    • distributed routing

      • link state

        • runs Dijkstra locally

        • all need to know is neighbour's best route

        • see DigiComms

        • fast but needs more state

          • converges fast but doesn't scale

      • vector state

        • see DigiComms

 

 

 

 

 

 

 

 

 

 

IP forwarding

  • forwarding table

    • includes

      • destination

      • next hop

      • interface (the card you're going to send it out on)

    • is configured

      • statically by administrator

      • dynamically

        • routers exchange network reachability information

        • is adaptive

        • dynamic routing adjustments operate more slowly than fluctuations in traffic load

  • addresses

    • classful

    • cider

  • routing: establishing an end to end path

  • forwarding: determining the next hop

Shortest path

  • shortest path problems

    • single source: find a shortest path from a given vertex to each of the other vertices

    • single pair: find a shortest path between a given pair of vertices

    • all pairs: find a shortest path between al pairs of vertices

  • Bellman-Ford algorithm

    • O(V * E)

Bellman-Ford(G, w, s)

1 for each vertex v ∈ V[G] do

2 d[u] ← ∞

3 π[u] ← nil

4 d[s] ← 0

5 for i ← 1 to (|V[G]| - 1) do

6 for each edge (u, v) ∈ E[G] do

7 relax(u, v, w)

8 for each edge (u, v) ∈ E[G] do

9 if d[v] > d[u] + w(u, v) then

10 return FALSE

11 return TRUE

Where it only returns true when there are no negative weight cycles

w is the weighting function

 

Relax (u, v, w)

1 if d[v] > d[u] + w(u,v) then

2 d[v] ← d[u] + w(u, v)

3 π[v] = u

  • Dijkstra algorithm

    • only works on graphs without negative weight edges

    • more efficient than Bellman-Ford

    • O(E + V log V)

Dijkstra(G, w, s)

1 for each vertex v ∈ V[G] do

2 d[u] ← ∞

3 π[u] ← nil

4 d[s] ← 0

5 S ← {}

6 Q ← V[G]

7 while Q ≠ {} do

8 u ← Extract-Min(Q)

9 S ← S ∪ {u}

10 for each vertex v ∈ Adj[u] do

11 Relax(u, v, w)

  • Routing Information Protocol (RIP)

    • distance vector

    • simple to configure and implement

    • doesn't scale up, is designed for small LANs

    • routers exchange their entire distance vector every 30s

    • count to infinity, infinity = 16 in this protocol

 

  • Open Shortest Path First (OSPF)

    • link state protocol

    • scales well

    • is complex

    • has rapid, loop free, non count to infinity convergence

    • areas

      • OSPF networks are divided into areas to help scalability

      • there is a backbone area and a set of k other areas

      • non backbone areas

        • the link state database inside each of these must be synchronised at all routers

        • a router doesn't have detailed information about topology outside its area

          • this helps reduce the size of the required database

      • route summarisation and filtering is possible between different areas

 

Autonomous routing domains

  • an autonomous routing domain is a collection of physical networks “glued together” using IP, which have a unified administrative routing policy

  • an autonomous system is an autonomous routing domain which has been assigned an Autonomous System Number (ASN)

  • scalability

    • AS graphs do not show topology, but only that particular autonomous systems are connected (somehow)

    • throw away topology information (internal structure) to make it stable

  • ASNs can be shared

  • multihoming with CIDR

    • this is when a customer wants to connect to the internet (advertise its IP address space) through two separate ISPs

      • e.g. to reduce single point of failure

    • to establish correct communication with the Internet, both ISP1 and ISP2 must advertise the subscriber's specific address space of 205.113.50.0/23

    • if ISP2 does not advertise this address, all the subscriber's incoming traffic passes through ISP1

    • if ISP2 advertises 205.113.50.0/23, whereas ISP1 advertises only its own CIDR block, all the subscriber's incoming traffic matches the more-specific route and passes through ISP2

    • requires

      • ISP1 to modify its filters and “punch a hole” in its own CIDR block

      • ISP2 to advertise some of a competitor's address space

  • transit ASs

    • a transit AS permits traffic with neither source nor destination within AS to flow across the network

    • a non-transit AS only permits traffic with source or destination within AS

      • this is easy to do, e.g. for C all that happens is you don't advertise to B that you have a route to A, and you don't advertise to A that you have a route to B

 

 

Inter domain routing

  • preferences

    • prefer customer routes over others

    • use provider (higher level) routers as a last resort (as they are charged for)

    • don't provide transit for peers

  • the ISP closest to a customer is responsible to ensure blocks announced by the customer are owned by the customer

    • i.e. verify the customer's address space!

    • otherwise the customer could accidentally announce some domain and traffic would go there rather than the correct place, “black holing” the site

  • Border Gateway Protocol (BGP)

    • policy based

    • horrible, horrible hacked things which no-one really understands

      • e.g. on the path of where the thing (packet?) has been, when you give it to someone else you hack the list of hops so that it goes to yourself over and over without going anywhere else, in order to make this path (and hence using your resources) less preferable

    • count to infinity is minimise by throwing away something if it has you in its path

    • remember every route your neighbours ever told you

      • if one goes down select the next best

    • inter domain routing is not based on shortest path

      • these paths may be contrary to commercial agreements

      • e.g. going to a third regional ISP via a second from a first, the first is getting a free ride, since it's business is providing regional connectivity, the national ISPs are for getting between regions

  •  
    • peers do not provide transit for other peers

    • operation

      • open a connection

      • share best data

      • after that send only keep-alives and changes

    • is not guaranteed to converge

      • can require lots of manual intervention

      • can be very hard to figure out what went wrong

        • “reboot and restart, put it down to a CISCO bug”

    • types of BGO

      • EBGP is for between different ASs

      • IBGP is for within the same autonomous system

    • dynamics of BGP, you can have any two of the following (at odds) goals

      • fast convergence

      • minimal updates

      • path redundancy

    • there are normally a lot of updates, the reason for this could be any or none of:

      • there's always something going on in the big wide world of the internet, so it's just good housekeeping

      • BGP exploring alternate paths

      • IGP instabilities being exported to the internet

      • bad trade offs in router software

      • BGP sessions being reset due to congestion

 

  •  
    • can reduce updates by

      • rate limiting updates

        • only send updates every 30s or so

        • a router can change its mind many times about the best route without having to tell its neighbour every time

        • this slows down BGP exploring alternate paths

      • route flap dampening

        • routes are given a penalty for changing

        • the total penalty decays over time, but if it goes over a limit then the route is dampened

        • this is bad for a mesh, as it punishes well connected destinations

  • implementing customer-provider and peer-peer relationships

    • enforce

      • transit relationship (outbound route filtering)

        • i.e. don't let any transit traffic through

      • order of route preference

        • provider < peer < customer

 

Link state routing

  • router knows entire network topology, computers shortest path itself

  • link state packet (LSP)

    • cost of link between itself and its neighbour, e.g. (from A) {A; B; 4}

  • control flood the network with LSPs

    • store LSPs

    • if one comes in that is new send it out on all routes besides incoming one

    • network with E edges will copy at most 2E times

  • use sequence numbers to tell if a LSP is new

    • sequence numbers run out/don't know which to use on boot up

    • solutions:

      • ageing

        • put a timeout value in the header

          • and control flood notification of the event so that routers that don't know the time can do it too

        • on booting a router has to wait until all the old LSPs are purged and refreshed

      • lollipop

        • need a unique start sequence number

        • a is older than b if

        • if a router gets an older LSP tell sender about newer LSP

  • recovering from a partition

    • the two nodes on either side of the re-established link co-ordinate to update databases

 

Distance vector routing

  • node sends Distance Vector to all its neighbours

    • that says what the node's best idea of distance to every node in the network, from it

  • advantages

    • distributed

    • adapts to traffic changes and link failures

    • suitable for networks with multiple administrative entities

  • problem: count to infinity, can deal with using

    • DV has path vector

    • triggered updates (faster count to infinity)

 

 

Memory models

  • shared memory model

    • collection of threads sharing address space

    • read writes on memory locations implicitly and immediately globally visible

    • used for tightly coupled systems

    • advantages

      • easy to use

      • scalable

    • disadvantages

      • race conditions

      • synchronisation issues

  • message passing model

    • collection of process with private address space

    • used for loosely coupled systems

    • advantages

      • control

      • protection

    • disadvantages

      • performance

  • demand paged virtual memory

    • run time mapping from logical to physical address space (by MMU)

    • use page table to indicate whether a page

      • should be accessible (valid) for that process

      • is in memory

    • page faults on a uniprocessor

      • check whether the page is valid

        • if not then kill the process

      • otherwise page in the desired page and restart the process from the faulting instruction

  • distributed shared virtual memory

    • have a large logical DSVM (e.g. 264 bits, is roughly 18 pico-bytes)

    • each processor has its own physical memory

    • each page

      • has a home processor

      • can be mapped into a remote (other processor) address space

        • on read then page in across network/interconnect

        • on write access sort out ownership

    • the DSVM library or OS is responsible for

      • tracking current ownership

      • copying data across the network

      • setting access bits to ensure coherence

    • implementation: centralised page manager

      • a single manager (on a single processor) keeps, for each page p

        • owner(p), the processor which created or last wrote to p

        • copyset(p), the processors with a copy of p

      • a read fault has four messages

  1. faulting processor contacts the manager

  2. the manager forwards to owner

  3. the owner sends page

  4. requester acks to manager

  •  
    •  
      • a write fault

        • faulting processor contacts the manager

        • the manager invalidates the copyset (contacts all processors which have a copy?)

        • manager contacts the owner

        • the owner relinquishes the page

        • the requester acks to manager

      • we can reduce messages by making manager(p) = owner(p)

        • however now we need to find out who is the manager(p)

          • broadcast a request to find manager(p)

            • this can lead to race conditions in broadcast

          • at each processor keep a record of who you think is manager (from snooping broadcasts, etc.)

      • we can reduce messages by changing the consistency model

        • false sharing (is expensive)

          • P1 owns p, P2 has read access to p

          • P1 writes to p

            • this copies the change to P2, but P2 doesn't care about the change

        • sequential consistency (what we have used up to now)

          • every read sees latest write

          • is easy to use for the programmer

          • is expensive (causes false sharing)

        • release consistency

          • read and writes occur locally

          • processors will explicitly do

            • release

              • this is a serialisation point to ensure all processes can see the writes

            • acquire

              • become the owner (I assume includes the previous owner doing a release before relinquishing ownership)

        • type specific coherence

          • the consistency depends upon type labels given to variables/state/data structures

          • gives the best performance

          • types

            • private memory: ignore (don't synchronise)

            • write once: only permit others to read it (i.e. only service read faults)

            • read mostly: writing is infrequent and owner broadcasts updates

            • producer-consumer: the producer sends the changes to the consumer before it requests them (through a read fault)

            • write many: release consistency + buffering

            • synchronisation: strong consistency (e.g. for locks)

          • this permits optimistic multiple writers

          • almost as fast as hand coded message passing

 

  •  
    •  
      •  
        • lazy release consistency

          • update not on release but on next acquire

          • reduced messages and higher complexity

  • persistent virtual memory

    • virtual memory is (or may be) backed by non-volatile storage (where pages that aren't in use are paged out to)

      • why then should virtual memory have to be volatile? this is persistent virtual memory

    • permits

      • no more distinction between files and memory

      • programmatic access to file structure or databases

        • is easier (e.g. with linked structures)

        • can benefit from a type system

    • persistence of data is the length of time it exists

    • orthogonal persistence

      • the length of persistence of some data does not control how that data is accessed

      • i.e. you don't have to treat files and variable separately

    • implementation options

      • functional/interpreted languages

        • these could fake out an implementation in their language runtime

      • imperative/compiled languages

        • these prescribe the way to access data (e.g. pure OO, pointer chasing, etc.)

        • we can use virtual memory to implement persistent memory

    • Multics virtual memory

      • no file system per se

        • user sees large number of orthogonal linear regions (segments)

          • each segment is backed by secondary store

        • there is an overarching directory tree (like in Unix) which transcends segments

      • segments are created by users and exist until explicitly deleted

      • many concentric rings of privilege

      • directories contain a set of branches (~ inodes)

      • a branch contains a set of attributes for a segment

        • a unique segment ID (identifies to which segment the branch belongs)

        • an access control list for each UID

        • a ring bracket and limit (ACL only applies within the bracket)

        • a list of gates (procedure entry points)

      • however this good security makes it harder to do things (Multics is a victim of complexity)

    • secondary storage got bigger than 232 bits (~4 GB)

      • the words size is 32 bits

        • hence we cannot safely refer to (name) all the data

      • an implementation using pointer swizzling

        • any data in persistent pages is accessed by special 64 bit persistent pointers

        • these persistent pointers are never directly accessed/dereferenced

          • any resident persistent pages are marked as invalid

          • on access, trap and for each persistent pointer p'

            • allocate a new page P and mark it as invalid

            • rewrite (swizzle) p' to be a normal pointer referring to P

            • unprotect the original page and resume

      • 64 bit word sized machines

        • virtual addresses can serve as unique names

        • we can have a single address space operating system

 

  • recoverable virtual memory

    • recoverable virtual memory refers to a region of virtual memory on which transactional semantics apply

    • lightweight recoverable virtual memory

      • full transactional semantics are too expensive

        • just considers atomicity and some durability

        • full transactional semantics can be built on top of LVRM

      • the method:

        • a process maps region(s) of some segment (memory/secondary storage etc.) into its virtual address space

        • invokes begin_transaction(mode)

        • invokes set_range(t, base_address, nbytes)

          • LVRM will normally copy the range

            • it makes a note in the log that it has done this

            • if there is an abort it can hence restore the values, unless the mode is “no restore”

      • this is expensive

        • the redo log gets full – log truncation is complicated

        • up to three copies of the data are made

        • there are expensive synchronous disk writes

      • alternate implementation (Rio Vista)

        • controversial method

        • uses a persistent (non volatile RAM, NVRAM) file cache

          • on map just mmap a region of the NVRAM

          • set_range(...) copies the range into the undo log which lives in NVRAM

        • all updates are immediately durable – no redo log

        • reduced cost

          • no synchronous disk writes

          • no redo log (and hence reduce the number of copies required)

        • in the case of a crash

          • early in the reboot flush the NVRAM contents to disk (this is possible since NVRAM persists)

          • on a map lazily undo any transactions which had not committed at time of crash

        • performance issues if the database doesn't fit into NVRAM

  • capability based addressing

    • a capability is a protected name for an object

      • possession is necessary and sufficient for access

      • need to be unforgeable

      • can be manipulated in a defined and restricted set of ways (passing as a parameter, refined, etc.)

    • these are implementable as

      • hardware

      • software (using cryptography)

 

  •  
    • Cambridge CAP computer

      • controlled what a process can write to registers, not which process (i.e. not who can write to registers)

      • a capability consists of the values

        • base address & limit (defines where and how large the segment is)

        • access code (read only read/write, etc.)

      • memory is divided into segments (like lines or blocks)

        • data segments contain data and instructions, and may be transferred to and from arithmetic registers

        • capability segments contain capability values, and may be transferred to and from capability registers

      • mechanism for loading capabilities into C registers

        • capabilities are passed as arguments in a function call

        • loading is implicit whenever a capability is referred to

      • there are both data and capability registers

        • capability registers will hold the capability for the present process

          • they will limit

            • the data segments accessible

              • i.e. you can't go and access some other process's memory, we'll give an address fault if you do

            • the capability segments accessible

              • i.e. when calling a function you can't pass it any capability segment (and hence any capability) you want, you can only pass those which you own yourself

              • i.e. capabilities can only become more refined (less privileged), not more

        • one has to hold the capability for both C registers and D registers for the same segment, in order to control and create capabilities

          • only done by some highly trusted part of the kernel

      • the CAP is to hardware what scoping is to software

        • with this you can actually prove correctness of programs

    • hierarchies are useful in organisation of flow of control but are unnecessarily restrictive for protection

    • domains of protection

      • the set of capabilities to which a process has access

      • there is a special instruction needed to change domain of protection (ENTER)

      • when leaving a protection domain you can't just leave capabilities lying around in capability registers

      • ENTER and RETURN give a hierarchy of control but not of protection

    • protection of processes

      • this is necessary to support multiprogramming

      • need to define a protection environment for each process (differing privileges)

      • “kernel” ENTERs user process

        • control RETURNs on trap or interrupt

      • requires hardware support

 

  •  
    • relative capabilities

      • rather than each capability be for an absolute base address, we can permit the base of a capability to be relative to the base of some other segment

        • {base, limit, access code, reference}

        • reference is {capability | whole memory}

      • hence a process can hand on a subset of its privileges to a sub-process

      • in CAP

        • the kernel capabilities are in a segment called the master resource list

        • each process has a process resource list with capabilities relative to those in the MRL

        • this makes revocation easy as you only have to revoke the top level process and all subprocesses are relative to it

    • summary

      • capabilities mean that processes run with minimum degree of required privileges

      • hardware is expensive and complex

      • system ends up being slow

 

Microkernel operating systems

  • kernel operating systems became very complex

  • microkernels attempt to

    • simplify the kernel

    • build a more modular system

    • support multiprocessors, distributed computing etc.

  • the new structure attempted to

    • move some functionality into user space servers

    • access servers via some inter process communication (IPC) system

    • increase modularity

  • Mach microkernel

    • provided compatibility with 4.3 BSD, OS/2 ...

    • attempted to

      • support diverse architectures and multiprocessors (SMP, NUMA ...)

      • scale across network speeds

      • have distributed operation

    • a task is an execution environment

    • a thread is a unit of execution

    • IPC is based on ports and messages

      • a port is a generic reference to a resource

      • IPC is message passing between threads

  • L3/L4 microkernels

    • a problem with microkernels: many crossings in and out of the kernel is expensive (e.g. in terms of worse locality)

    • approach:

      • minimise what should be in the kernel

      • make those primitives really fast

    • L3/L4 did that by providing only

      • recursive construction of address spaces

      • threads

      • IPC

      • unique identifier support

 

  •  
    • design/implementation

      • address spaces are supported by the primitives

        • grant: give pages to another address space

        • map: share pages with another address space

        • flush: take back mapped or granted pages

      • threads execute with address space

        • the microkernel manages thread to address space binding

      • IPC is message passing between address space

        • interrupts are handled as messages too

  • EROS

    • a persistent, software capability microkernel

      • motivation behind EROS

        • reliability requires

          • system decomposition

          • access delegation

        • access policy is a run time problem

        • persistence simplifies applications and improves IO

        • mutually suspicious users

      • attempted to challenge the idea that

        • capabilities are slow

        • microkernels are slow

        • capabilities can't support discretionary access control (just passing them on)

    • there are two disjoint “spaces” (like in the CAP)

  1.  
    1. data space

  •  
    •  
      •  
        • set of pages

        • each page holds 4096 bytes

        • we can read and write data to/from data registers

  1.  
    1. capability space

  •  
    •  
      •  
        • set of cgroups

        • each cgroup holds 16 capabilities

        • can read and write to/from capability registers

    • each capability is (type, oid, authority)

      • types

        • basic types are

          • page

          • cgroup

          • number

          • schedule

        • complex types include

          • segment: corresponds to address spaces

          • domain: corresponds to processes

    • persistence is achieved by flushing objects to disk

      • use a log for check-pointing

    • before using capabilities they must be prepared

      • a capability will (in general) be pointing at an object on disk

      • the object is brought into memory

      • the capability is rewritten to point at the object's memory location

    • segment tree

      • this is a tree of cgroups

      • the leaf cgroups refer to data pages (others refer to more cgroups)

 

  •  
    • speed-ups are achieved by

      • keeping a cache (a page-table representation) of the segment tree

        • associate each data page with a capability, like a key|-> value map

        • i.e. don't have to walk the tree to get the capabilities for a data page

        • we simply associate the evaluated capabilities with each page table entry

      • fast capability based IPC scheme

        • hand coded (similarly to L4)

 

Virtual machine monitors

  • virtual machine monitor = hypervisor

    • multiplexes multiple OSs

  • IBM's VM/CMS

    • control program on top of which OSs sit

    • control program provides each OS with

      • virtual console

      • virtual processor

      • virtual memory

      • virtual I/O devices

    • performance is good since most instructions run directly on hardware

  • Disco

    • want

      • to be able to run commodity OSs on ccNUMA

      • fault tolerance between operating systems

      • sharing between OSs

    • to the OS the CPU looks like a real MIPS R10000

      • TLB fills etc. are emulated

  • VMWare

    • we want to have virtual machines for x86

    • there are 17 instructions which have different user/kernel semantic and don't trap

      • these cannot be emulated

      • solution: perform binary rewriting to manually insert traps (hack!)

    • virtual physical addresses for an OS are mapped to actual physical addresses by shadow page tables

    • performance issues

      • don't have the source to make small modifications

      • instead write special device drivers and other low level code

  • Denalie

    • uses VMM as an isolation kernel

      • security isolation: no sharing between VMs

      • performance isolation: VMM supports fairness mechanisms (fair queueing and LRP on network path) and static memory allocation

    • uses para-virtualisation to improve performance

      • uses x86-like ISA

      • have to rewrite the OS to deal with this

    • Yakima is a work in progress OS

      • is written to use the para-virtualisation

 

  • XenoServers

    • vision: XenoServers across the globe that are usable by anyone to host services, applications etc.

    • use the Xen hypervisor to allow the running of arbitrary untrusted code (including OSs)

    • insight

      • guarantee resources in time and space, and then charge for them

      • share and protect CPU, memory, network, disks etc.

    • uses para-virtualisation, but with real OSs

    • implementation (Xen 1)

      • based on low level Linux (i.e. doesn't rewrite bootstrapping)

      • has Xeno-aware device drivers in the OSs

      • a special OS (domain 0) is started at boot time

        • is used to create/suspend/resume/kill other domains (OSs)

      • physical memory is allocated at start-of-day

      • interrupts are converted into events

        • write event to event queue in the domain

        • the domain only “sees” the event when it is activated

      • guest OSs can run a scheduler off of either

        • virtual timer

        • real time timer

      • asynchronous queues are used for network and disk

    • Xen 2

      • supports live migration of virtual machines

        • copy all of memory from one server to another

        • whilst copying watch to see where anything is modified during the continued running of the VM

          • iterate, copying only the sections which are dirtied during the previous copy

          • when down to ~20MB suspend and copy the rest, with a down time of around ~200ms

    • Xen 3

      • supports h/w assisted full virtualisation

      • 32/36/64 bit support

  • conclusions

    • VMs are popular as

      • OS static size is small compared to memory

      • OS level security is perceived as weak

      • flexibility is desirable

    • emerging applications

      • internet suspend-and-resume

        • at the end of the day suspend the VM to disk

        • copy to another site (conference etc.)

      • multi level secure systems

        • many people use virtual private networks from home to work

        • this risks leakage/cross contamination of viruses etc.

        • instead run a VM with only VPN accessibility

 

Extensibility

  • this is basically about allowing variation/changes in some section of the system environment

    • fixing mistakes

    • supporting new features

    • efficiency

    • individualism

      • per-process thread scheduling algorithms

      • customising page replacement schemes

  • different methods to implement this exist

    • low level techniques

      • i.e. give everyone their own (virtual) machine

      • low level software provides

        • virtual hardware

        • some simple, secure multiplexing (getting N pieces of hardware from one)

      • need to ensure not too much control is built into the Virtual Machine Monitor or we go back to the same problem

      • users can choose, reboot or even recompile their OS without logging off

    • kernel level schemes

      • replacing the entire OS can be a little overkill

      • sometimes we just want to replace/modify some part of the OS

        • permit portions of the OS to be dynamically loaded and unloaded

      • problems

        • this requires clean and stable interfaces

        • security: who do you trust to load into your OS?

      • schemes to avoid security problems

        • trusted compiler

          • if you trust the compiler it can provide code with a digital signature, and then you can trust the code

        • proof carrying code

          • the compiler generates a proof, against which the program is checked at load time

          • the proof is of certain properties

            • doesn't read/write/execute outside some logical fault domain

          • you end up needing

            • formal specification language for safety policy

            • formal semantics of the language for untrusted code

            • language for expressing proofs

            • algorithm for generating proofs

            • algorithm for validating proofs

        • sandboxing

          • limit absolute memory references to per-module segments

          • check for certain instructions

          • method: transform untrusted code

            • e.g. insert bounds checking of branches before they occur

          • this does lead to code expansion

 

  •  
    •  
      • SPIN operating system

        • permits extensions to be downloaded into the kernel

        • we want comparable performance to a procedure call, so use compiler check language safety

        • the kernel is written mainly in Modula-3

          • capabilities are Modula-3 pointers

          • the protection domain is enforced by language name space (not virtual addressing)

        • the extensions

          • define events and handlers

          • applications register handlers for specific events

            • e.g. a handler for “select a runnable thread”

        • problems

          • trusted compiler

          • locks

          • termination

      • the VINO operating system

        • downloads things called grafts into the kernel

        • grafts

          • written in C or C++

          • have free access to most kernel interfaces

          • this is achieved safely by sandboxing

          • has to use a trusted compiler

        • prevents resource abuse by resource quotas and accounting

        • prevents resource starvation by timeouts

          • grafts must be preemptible

        • safe graft termination assured by transactions

          • transactions are wrapper functions around grafts

          • all access to kernel data is via accessors

          • transactions

            • use two phase (fine grained) locking

            • an in-memory undo stack

            to achieve the safety properties required

    • user level schemes

      • we can avoid the complexity required of putting extensions in kernel space by putting them in user space

        • e.g. microkernels

      • alternatively reconsider how the kernel works

        • the kernel provides protection and abstraction

          • only abstraction needs to be trusted

        • implementations

 

  •  
    •  
      •  
        •  
          • Exokernel

 

 

 

 

 

 

 

 

  •  
    •  
      •  
        •  
          •  
            • run most of the OS in user-space library

            • attempts to use as few abstractions as possible, as abstractions

              • deny application specific optimisations

              • discourage innovation

              • impose mandatory costs

            • functionality is limited to ensuring protection and multiplexing of resources

              • this is vastly simpler than conventional microkernels' implementation of message passing

            • applications may request specific memory addresses, disk blocks, etc.

              • the kernel only ensures that the requested resource is free, and the application is allowed to access it

          • Nemesis operating system

 

 

 

 

 

 

 

 

 

 

 

 

  •  
    •  
      •  
        •  
          •  
            • guarantees each application a share of physical resources (in space and time)

              • designed to support soft real-time applications which have QoS needs

              • isolation: explicit guarantees to applications

              • exposure: multiplexes real resources

              • responsibility: applications must handle their own data path

            • use interface description language (IDL, the language neutral way to specify communication between different languages) to allow user-space extensibility

            • main principle: the majority of code could execute in the application process itself

            • Nemesis therefore has an extremely small lightweight kernel, and performs most operating system functions in shared libraries which execute in the user's process

              • this leads to a vertically-structured operating system

 

  • conclusions

    • it's more than just a “performance hack” as it

      • simplifies system monitoring

      • enables dynamic system tuning

      • provides the possibility for better system and application integration

    • it allows extensible applications to take advantage of OS flexibility

 

Database storage

  • naïve implementation (relations and directories in ASCII formant)

 

 

 

 

 

  •  
    • e.g. to do select * from R, S where condition you have to

      • read file the directory to get R and S's attributes

      • read file containing R and for each line

        • read file containing S and for each line

          • create a join tuple between that and the R line

          • check condition

          • display if ok

    • issues

      • tuple layout on disk is inefficient

      • search is expensive as there are no indexes

      • no reliability

        • can lose data

        • can leave operations half undone

      • no security

      • no buffer management/concurrency control

  • disk storage issues

    • block size

      • larger block size amortises I/O coast

      • smaller blocks

        • read in less useless stuff

        • takes shorter time

    • we need efficient use of the disk

      • e.g. sorting of data

      • I/O costs are likely to dominate (we want algorithms to do as little as possible)

    • need to maximise concurrency

      • use asynchronous I/O and a database specific buffer manager

    • need to improve reliability

      • use write ahead log to deal with failures mid-transaction

 

  • representing records

    • records can be

      • fixed or variable format

      • fixed or variable length

    • fixed format

      • this is the “traditional” model where you have a schema for the table which specifies

        • number of fields

        • types of each field

        • order in the record

      • the record isn't really interpretable without the schema

    • variable format

      • each record self describes itself

      • you still have to have some basic schema, e.g.

        • the first byte is the number of fields

        • the first byte of each field is a type code

        • the second byte of each field is the type specific information

        • ...

    • fixed length

      • each field is a fixed size

      • e.g. a name field where the name has to be 30 characters or fewer

    • variable length

      • each field can vary in size

      • e.g. a string can be as long as it needs to be

    • hybrid schemes

      • record header with schema id and length

      • fixed record with a variable suffix

  • block level storage of records

    • storing records within a block

      • fixed sized records don't need to be explicitly separated

      • variable length ones might use a separation marker

        • it might be better to have a header at the start of the block with pointers to the start of each variable length record

    • spanning multiple blocks

      • we can implement with pointers at the end of a block to indicate where the next part is

    • mixing record types within a block

      • this can provide a clustering locality benefit for related records

      • normally is too messy

    • order of records is important

      • if we get them sequential and sorted it's easier to do merge-join

 

  • efficient record retrieval

    • instead of keeping the table in memory, keep its index instead

    • we assume that the table is a sequential file ordered by key

      • just because a key isn't in the index doesn't mean the record doesn't exist

      • we could in theory just have the pointer to the first record and traverse the table each time (pointer chasing)

    • dense index

      • advantage: existence checks can be done without accessing the file

      • is required for secondary indexes

    • sparse index

      • advantage: can fit more of it into memory

      • can implement multilevel sparse indexes

        • sort of like multi level page tables (with the second table being dense)

        • entries in the top-level sparse index point to ranges in the lower-level sparse index

        • these in turn contains entries that point to ranges of actual records

    • can also include block pointers (which are for inside the block to which the index pointer points)

    • insertions and deletions are messy

    • an alternative

      • don't focus on the index being sequential, instead focus on balance

      • B+ trees

 

 

 

 

 

 

 

 

  •  
    •  
      •  
        • each node has n keys and n+1 pointers

        • non-leaf nodes

          • each pointer points to nodes where the key values

            • are less equal to than the key value for this pointer, for the left pointer

            • are greater than the key value for this pointer, for the right pointer

        • leaf nodes

          • pointers point directly to record (or across to the next section)

        • balanced tree

          • all leaves are the same depth

          • search is easy and fast O(log(n))

        • insertion

          • if there is space in the leaf then fine

            • “space” is defined by the order of the tree, e.g. 4 (maximum no of keys in each node)

            • a function space also defines the minimum number of keys permissible at each node

          • if there is not space

            • split the node, promoting the middle key to the next level up

            • do this as far up the tree as necessary

        • deletion is harder

          • if the minimum number of keys permissible is not violated then fine

          • otherwise need to one of

            • redistribute keys (and propagate upward)

            • coalesce siblings

        • buffering: want to keep root and higher levels in memory

        • the link between leaves provides for interesting optimisations of sequential access at leaf level

      • B trees

        • a B-tree of order d

        • each node

          • contain at most 2d keys and at most 2d+1 pointers

          • must have at least d keys and d+1 pointers

          • is at least half full

        • insertion and deletion methods are utilised to ensure that the tree remains balanced

        • the longest path in a B-tree of n keys is (approximately) at most log(d.n) nodes

        • leaves are all at the same height in the tree

        • at any node, each key has stored with its associated data value, therefore leaves are simply nodes without any forwards pointers to other nodes

  • spacial indexes

    • spatial data pertains to the space occupied by objects (points, lines, surfaces, volumes etc.)

      • real life applications: roads, cities, countries, internet...

    • this is hard to represent for conventional DBMS

      • it is highly dimensional/continuous

        • hence cannot store it simply as a relation

      • we want to be able to express special queries (e.g. close to, encompasses, intersects etc.)

    • B-tree cannot handle high dimensionality

    • hashing cannot handle range queries

    • possible approaches

      • balanced trees in spatial occupancy (R-trees)

      • multi-dimensional hashing (grid files)

  • Postgres DBMS

    • old DBMS were data management only, i.e.

      • fixed format records

      • traditional transactions

      • queries

    • we now need object management

      • bitmaps

      • vector graphics

      • free text

    • Postgres uses set oriented Postquel

      • advantages

        • variable persistence

        • standard control flow

        • has a small number of concepts so is simple for users

      • disadvantages

        • has a large memory footprint

 

  •  
    • implementation

      • doesn't use a write-ahead log

      • uses a no-overwrite storage manager

        • leave the old version of the record in the database

        • the log is now just 2 bits per transaction stating if the transaction is

          • in progress

          • committed

          • aborted

      • advantages

        • abort is cheap

        • recovery is cheap

        • can support historic “time travel” queries

      • disadvantages

        • need to flush new records to disk on a commit

        • may need multiple indexes

        • disk fills up

          • have to archive it

        • time travel queries are hard to express

  • operating systems and databases

    • operating systems do not work well with databases

      • extra copies to/from disk

      • buffer replacement

        • most OSes use Least Recently Used

        • DBMS accesses are

          • sequential access to blocks without re-reference

          • sequential access to blocks with cyclic reference (which kills LRU)

          • random access to blocks without re-reference

            • should discard immediately

          • random access to blocks non-zero probability of re-reference

            • this is the only case that works ok with LRU

        • the DBMS knows about these things

      • similar issues with prefetching

    • DBMSes end up doing their own buffer pool management in user space

      • issues with user space buffer management

        • we want access to the blocks, not variable byte-length files

        • the buffer pool is in virtual memory, so can suffer from “double paging” (interacting badly with the VM system)

    • other problems

      • multiple trees are wasteful

      • multi-process DBMSes can suffer from priority inversion if there are user-space locks in the DBMS

        • lower priority DBMS processes could hold a lock which a higher one needs, etc.

      • single process DBMSes are complex and can't benefit from multiprocessor support

    • possible solution

      • OS accepts “hints” from the DBMS

      • give DBMS raw partition access

 

 

Distributed storage

  • both file-systems and DBMSes want big, reliable disks

  • RAID

    • redundant array of inexpensive disks

    • can get better performance through striping

      • saving across disks, so can be writing multiple files at once

    • can get better reliability though redundancy

      • write the same thing to multiple disks at once

  • distributed storage

    • separation of data management from applications and servers

    • permits

      • centralised data management

      • more scalability

      • location fault tolerance

      • mobility for access

  • network attached storage (NAS)

    • NAS uses file-based protocols such as NFS or SMB/CIFS where it is clear that the storage is remote, and computers request a portion of an abstract file rather than a disk block

    • runs over TCP/IP

    • more general purpose

    • generally accessed by NFS, CIFS, SQL etc.

  • storage area networks (SAN)

    • appears to software as if it's local storage, i.e.

      • it allows access to blocks

      • the filesystem/DBMS runs on the host (see right)

    • is accessed via encapsulated SCSI

    • more specialised (usually enterprise level)

    • runs over a fibre channel

  • network file systems

    • these are often used to access NAS

    • are mainly remote procedure call (RPC) based, at some level

    • NFS

 

 

 

 

 

 

 

 

  •  
    •  
      • designed (originally, in v2) to be stateless

        • no record of clients or open files

        • no implicit arguments (e.g. a file pointer) to a request

        • no write back caching on the server

        • requests are idempotent (i.e. have the same result even if applied over and over) where possible

        • only hard state is on the server local filesystem

      • NFS v3

        • scalability

          • permits 64 bit offsets

          • removes some limits on file names etc.

        • explicit asynchrony

          • server can do asynchronous writes

          • client sends a commit after some number of writes

            • must keep a cached copy until the commit is successful

      • NFS v4

        • a major rethink

        • is a single, stateful protocol

        • TCP access only (or suitable reliable transport layer)

        • explicit open and close operations

        • file level revocable locks (share reservations)

        • server gives the client some revocable autonomy (delegation)

        • arbitrary compound operations

    • AFS

      • all accesses in this protocol are RPC, even if the file system is local

      • modifications to files are made on locally cached copies

        • when a file is closed the required sections are written back

        • callback: this is a procedure such that each client which has a copy of a file out is informed by the server if another client writes back a modification

      • a volume is a tree of files, sub-directories and AFS mountpoints (links to other AFS volumes)

        • volumes are created by administrators

        • once created, users of the filesystem may create directories and files as usual without concern for the physical location of the volume

          • live replication and relocation of volumes

    • Coda

      • supports disconnected operation

        • under normal operation, a user reads and writes to the file system normally

        • during this the client fetches, or "hoards", all of the data the user has listed as important in the event of network disconnection

        • if the network connection is lost, the Coda client's local cache serves data from this cache and logs all updates (during this stage is called disconnected operation)

        • upon network reconnection, the client moves to reintegration state; it sends logged updates to the servers

        • then it transitions back to normal connected-mode operation

      • cache manager operates in one of three modes

        • hoarding

        • emulating (for disconnected operation)

        • reintegrating (assisted conflict resolution)

      • extensions can support weakly connected (e.g. wireless) operation

 

  •  
    • LBFS

      • Low-bandwidth Network File System

      • used for wide area/wireless etc. networks

        • need to compress an order of magnitude better

      • maintain a persistent cache on the client

      • method

        • server divides its files into chunks and computes a secure hash on each

          • client does the same

        • if the client wants to write a portion of a file

          • first just transfer hashes for the relevant chunks

          • if the server already has the chunks then you can avoid the transfer

          • otherwise it requests whatever chunks are missing

      • possible issues

        • edit in place?

        • fetching files never seen before?

      • more detail of method (deals with issues)

        • file chunk boundaries are based on the contents (data)

        • a Rabin fingerprint is computed over every overlapping 48 byte region of every file

        • if the lower 13 bits match a magic boundary then that position is deemed to be a chunk boundary

          • expected chunk size is 8K

          • use 2K and 64K thresholds to deal with pathological cases

        • hence we can have hits if the contents appear anywhere in any file we know about

          • ~20% redundancy between unrelated files

        • what about hash collisions?

    • serverless file systems

      • all nodes hold some data

        • think peer to peer in the local area

      • xFS

        • this file system includes

          • clients

          • cleaners

          • managers

          • storage servers

        • any machine can be almost any subset of these

        • to read a file

          • look up the manager in a globally replicated map

          • contact the manager with request

          • manager redirects to cache or disk

        • to read a file

          • obtain a write token from manager

          • append all changes to the log

          • when hit threshold, flush to stripe group (RAID)

        • can approach 10x better than NFS, as

          • has co-operative caching

          • parallelism via software RAID (striping)

          • avoids read-modify-write by using a log structure

          • managers are replicated for fault tolerance

 

  •  
    •  
      • JetFile

        • serverless internet storage

        • aims

          • support shared personal file access in LAN and WAN

          • provide comparable performance to the local file system

        • implementation

          • uses scalable reliable multicast (SRM)

            • receiver driven: send out a request to the multicast group for the data needed

              • hopefully receive 1 or a few responses

            • version numbers are kept on all the data, so the receiver can retry to guarantee eventual delivery

          • file name/identifier is (org, vol, fileID, version)

            • the first three are hashed to get a file address

            • to retrieve the file first multicast a data request

            • responses will indicate who has it

              • pick one and use unicast with them to transfer the file

          • updates are handled by versioning

            • write on close semantics (this increases the version)

            • the client is now the server for the new version

            • explicit version requests, plus “current table” are multicast over per-volume channel

  • file systems for SANs

    • SAN has a pool of disks accessed via iSCSI

    • when a SAN is being accessed by multiple clients coordination is needed

    • methods to build a Shared Disk File System (SDFS)

      • asymmetric

        • add a metadata manager

        • there is exclusive access to metadata disks

        • clients access data disks directly

        • are simpler but less scalable

      • symmetric

        • clients access data and metadata directly

        • distributed locking used for synchronisation

        • implementation

          • lower level – network storage pool driver

            • combines all disks into a single address space

            • supports striping for performance/reliability

          • higher level – file system

            • almost a standard Unix structure (inodes etc.)

            • device locks and global locks for synchronisation

 

  •  
    • network attached secure disks (NASD)

      • concept: a less stupid SAN

      • still have network attached disks but

        • disks export variable length object interface

          • i.e. allow storage devices to transfer data directly to clients

        • most client operations like Read/Write go directly to the disks

        • less frequent operations like authentication go to the file manager

        • disks transfer variable-length objects instead of fixed-size blocks to clients

      • integrity/privacy available on transfers

      • security

        • a disk and the file manager share a secret key

        • after file system specific checks the file manager can issue derived capabilities to clients

          • this permits clients to securely access the disk directly

      • advantages

        • data path operations are fast and secure

        • a lot of work is offloaded from the file manager

        • multiple NASDs can be accessed in parallel

      • disadvantages

        • less allocation control

  • Venti

    • this is distributed archival storage

    • archived storage is normally on tape

      • magnetic disk is cheap, ubiquitous and fast

      • it is also subject to overwrite

    • use software to provide immutable storage

      • every file update a new version

      • the address of a block is a 160-bit SHA hash of the data

        • this enforces write-once since no other data block should be able to be found with the same address

      • directories map names to inode content hashes

    • a nice property

      • if a file changes in a leaf directory then some of the block hashes will change

      • hence the data in the inode will change

      • hence the inode content hash will change

      • hence some directory block(s) will change

      • this carries on all the way up the tree

    • issues

      • all addressing is via hashing – how do we map back to block location?

      • Venti uses log-based (append only) storage

        • blocks are stored sequentially in an arena

          • each block has a header including its hash

        • a trailer on the arena points back to blocks

        • build an index to map hashes to block headers

      • the index can be a bottleneck

        • hence add an index cache plus a regular block cache