software solutions
Computer Science » Advanced System Topics
Tasks that can be parallelised between a number of threads coarse granularity parallel ray tracing fine granularity parallel garbage collection lots of communication General approaches identify tasks ensure they operate safely in parallel (basic locks etc.) make it unlikely for them to get in each others way there's no point running two CPU bound processes on a single processor e.g. copy a shared read only structure to reduce locking time try to avoid contention for resources which cannot be shared encourage locality for resources which can reduce overheads introduced by their co-ordination, e.g. context switching time waiting for locks pollution of caches remember, the bottleneck resource is the most important Java public class Worker implements Runnable { public void run () { ... } ... } public static void main() { Worker w = new Worker(); new Thread(w).start(); } each object has a mutex synchronised methods require holding the mutex each object has a condvar to block on an object waiting for a condition to be true you do it after you've acquired a lock on an object (and entered a method sometimes) in order to let go control so another thread can do something you need, but to reacquire the lock afterwards methods wait() atomically release mutex control and be put in waiting status notify wake up one thread which is waiting the woken thread needs to reacquire the lock notifyAll() wakes all the threads blocked on this condvar they all compete for the lock Fine grained parallelism explicitly use pragmas to indicate sections of code which are parallelisable and how implementation is responsible for creating a suitable number of threads Threads separating reasonably independent tasks advantages in modern systems each thread can block/unblock independently disadvantages creating threads in an uncontrolled manner causes contention high load can cause poor response which causes clients to retry causing higher load kernel managed threads usually separate threads sharing the same address space advantages OS can balance threads across CPUs disadvantages moderately expensive to create threads user/language managed threads threads are multiplexed onto a single kernel process implemented within a library advantages fast creation and context-switch times disadvantages can't use CPU parallelism a blocked thread can block the entire thread Commands things to be done e.g. each line of a scene to ray-trace each object to examine in a garbage collector each request received from a remote client should be lightweight to create in terms of Java: an instance of Runnable Executors things responsible for arranging the execution of commands can be expensive to create, e.g. having associated with a thread normally more long lived commands enter a single queue each thread takes a command and executes it before taking another items are queued if six threads are running if fewer that two threads are running then new threads are created how to deal with signal queue overflow? More complicated commands above they just run to completion, could instead block (e.g. ongoing interaction with clients) could possibly vary maximum pool size, if suitably independent generate other commands (e.g. parallel garbage collection) adding them to the queue could harm locality cause problems if the queue fills solutions asynchronous I/O (generate new command when previous I/O is completed) next command is responsible for the next step of execution before issuing next I/O in order to reduce harm to locality, attempt to encourage the series of commands to be dealt with by the same thread(s) each worker thread has a separate pool try to put a generated command in the pool where the generating command is executed head or tail? if empty take item from another queue head or tail? one or many? useful to have concurrent queues which can be accessed independently at both ends Shared data structures techniques to reduce contention confinement: ensure some objects are always local, no need to lock/unlock them separately accept stale/changing data: particularly allowing reads without locking what's the worst that can happen? can stale data be detected? Copy on write assumes writes are rare reduce lock granularity lots of small locks instead of a few larger ones Hardware architectures: write back only when line is replaced or other set intervals requires there to be at most one dirty version of a cache line SMP cache coherence multiple processors can read a line use BusRdX to read for exclusive ownership other processors snoop the BusRdX and invalidate their own copies flush aborts remote bus cycle and writes back to memory ccNUMA a node is the grouping of CPUs with their local memory interconnection permits message passing between nodes no bus is shared directories provide serialisation point one directory per node tracks caching of memory blocks belonging to its node CPUs issue read write requests to the home node of the memory block local node: the node where the CPU making the request is home node: the node where the memory (which the CPU is trying to access) lives remote node: any node that has the block in question cached interconnect bandwidth can be problematic this can be very, very bad delay in worst cases Memory consistency issues remember that instructions are executed out of order can need to insert memory barriers to enforce ordering programs using mutual exclusion generally needn't worry about this Compare and swap seen = CAS (address, old, new) atomic operation examines contents address if they are equal to old then write new there returns the value seen there before the (possible) write Spin locks naïve implementation disadvantages the CAS operation must acquire exclusive access to the entire cache block where the boolean variable locked is located access to this block will probably be accessed by another as soon as one relinquishes, bouncing around the system, often with remote nodes caching it and the related overheads the interconnect will probably become saturated spinning will harm the performance of other processes on the machine if the thread holding the lock is on the same processor than harm that one too slightly better implementation void lock() { while( CAS (&locked, false, true) != false) { while (locked == true) { } } } advantages threads don't waste resources doing CAS over and over doesn't cause interconnect traffic disadvantages stampede for lock when it becomes available other options use other types of locks (e.g. multiple reader single writer MRSW) introduce a purely local delay between seeing the lock available and going for it explicitly queue threads and ensure the one at the front of the queue gets the lock Linux “big reader” locks assumes reads are much more common than writes there is a series of per CPU or per thread MRSW locks a reader acquires just a lock for that CPU a writer must acquire the lock for all the CPUs each thread spins on a separate location/object keeps a reference for who gets the lock next adding a new node safely prepare a new node takes a copy of the tail reference uses CAS to update tail to refer to its node write the next field of the previous tail to refer to itself how tell the next thread to unblock? if next is not null then wake the successor (set its value to false) otherwise attempt to CAS the tail from itself to null if that fails then something is attempting to add itself to the list spin watching next until it is non null Mutual exclusion disadvantages deadlock due to circular waiting priority inversion problems data shared between an interrupt handler and the rest of an OS pre-emption whilst holding locks performance concerns programming with a few big locks is easy but prevents many valid concurrent actions programming with lots of little locks is tricky Non blocking data structures guarantees that the system as a whole can still make progress if any finite number of threads in it are suspended non blocking designs are categorised (strongest first) as wait free: per-thread progress bound lock free: system wide progress bound obstruction free: system wide progress bound if threads run in isolation CAS is a universal primitive for building wait free designs (i.e. it can build anything) linearisability e.g. one thread invokes get_key_for('red') on a lookup table another thread concurrently invokes insert(10, 'red'); delete(20) on the same table this is important for correctness of non blocking structures each operation should appear to take place atomically at some point between which when it was invoked and when it returned this is more strict than serialisability often implemented by making a single CAS operation which atomically which checks and/or updates all of the state that the result depends on Software transactional memory ensures all accesses within a transaction are linearisable is a research topic Summary lots of threads usually means lots of context switching using ~number of processors is good if the threads are CPU bound excessive contention and low locality will lead to poor performance try to ensure threads are just using local resources as much as possible designing scalable shared data structures is hard and depends on the workload the execution environment Internet properties as opposed to with POTS, with the internet all the intelligence is in at the sender and receiver, not the network itself connectivity ISPs want to limit connectivity connectivity turned off by default, on by config they only want to carry traffic if one of the sender or received is a paying customer autonomous system number (ASN) sort of like an IP number but for networks routers talking to routers (control plane) tend to use layer 2, except BGP routing computation is distributed among routers within a routing domain routing messages are usually not routed but exchanged on layer 2 between physically adjacent routers internal BGP and multi-hop external BGP are exceptions distributed routing link state runs Dijkstra locally all need to know is neighbour's best route see DigiComms fast but needs more state converges fast but doesn't scale vector state see DigiComms IP forwarding forwarding table includes destination next hop interface (the card you're going to send it out on) is configured statically by administrator dynamically routers exchange network reachability information is adaptive dynamic routing adjustments operate more slowly than fluctuations in traffic load addresses classful cider routing: establishing an end to end path forwarding: determining the next hop Shortest path shortest path problems single source: find a shortest path from a given vertex to each of the other vertices single pair: find a shortest path between a given pair of vertices all pairs: find a shortest path between al pairs of vertices Bellman-Ford algorithm O(V * E) Bellman-Ford(G, w, s) 1 for each vertex v ∈ V[G] do 2 d[u] ← ∞ 3 π[u] ← nil 4 d[s] ← 0 5 for i ← 1 to (|V[G]| - 1) do 6 for each edge (u, v) ∈ E[G] do 7 relax(u, v, w) 8 for each edge (u, v) ∈ E[G] do 9 if d[v] > d[u] + w(u, v) then 10 return FALSE 11 return TRUE Where it only returns true when there are no negative weight cycles w is the weighting function Relax (u, v, w) 1 if d[v] > d[u] + w(u,v) then 2 d[v] ← d[u] + w(u, v) 3 π[v] = u Dijkstra algorithm only works on graphs without negative weight edges more efficient than Bellman-Ford O(E + V log V) Dijkstra(G, w, s) 1 for each vertex v ∈ V[G] do 2 d[u] ← ∞ 3 π[u] ← nil 4 d[s] ← 0 5 S ← {} 6 Q ← V[G] 7 while Q ≠ {} do 8 u ← Extract-Min(Q) 9 S ← S ∪ {u} 10 for each vertex v ∈ Adj[u] do 11 Relax(u, v, w) Routing Information Protocol (RIP) distance vector simple to configure and implement doesn't scale up, is designed for small LANs routers exchange their entire distance vector every 30s count to infinity, infinity = 16 in this protocol Open Shortest Path First (OSPF) link state protocol scales well is complex has rapid, loop free, non count to infinity convergence areas OSPF networks are divided into areas to help scalability there is a backbone area and a set of k other areas non backbone areas the link state database inside each of these must be synchronised at all routers a router doesn't have detailed information about topology outside its area this helps reduce the size of the required database route summarisation and filtering is possible between different areas Autonomous routing domains an autonomous routing domain is a collection of physical networks “glued together” using IP, which have a unified administrative routing policy an autonomous system is an autonomous routing domain which has been assigned an Autonomous System Number (ASN) scalability AS graphs do not show topology, but only that particular autonomous systems are connected (somehow) throw away topology information (internal structure) to make it stable ASNs can be shared multihoming with CIDR this is when a customer wants to connect to the internet (advertise its IP address space) through two separate ISPs e.g. to reduce single point of failure to establish correct communication with the Internet, both ISP1 and ISP2 must advertise the subscriber's specific address space of 205.113.50.0/23 if ISP2 does not advertise this address, all the subscriber's incoming traffic passes through ISP1 if ISP2 advertises 205.113.50.0/23, whereas ISP1 advertises only its own CIDR block, all the subscriber's incoming traffic matches the more-specific route and passes through ISP2 requires ISP1 to modify its filters and “punch a hole” in its own CIDR block ISP2 to advertise some of a competitor's address space a transit AS permits traffic with neither source nor destination within AS to flow across the network a non-transit AS only permits traffic with source or destination within AS this is easy to do, e.g. for C all that happens is you don't advertise to B that you have a route to A, and you don't advertise to A that you have a route to B Inter domain routing preferences prefer customer routes over others use provider (higher level) routers as a last resort (as they are charged for) don't provide transit for peers the ISP closest to a customer is responsible to ensure blocks announced by the customer are owned by the customer i.e. verify the customer's address space! otherwise the customer could accidentally announce some domain and traffic would go there rather than the correct place, “black holing” the site Border Gateway Protocol (BGP) policy based horrible, horrible hacked things which no-one really understands e.g. on the path of where the thing (packet?) has been, when you give it to someone else you hack the list of hops so that it goes to yourself over and over without going anywhere else, in order to make this path (and hence using your resources) less preferable count to infinity is minimise by throwing away something if it has you in its path remember every route your neighbours ever told you if one goes down select the next best inter domain routing is not based on shortest path these paths may be contrary to commercial agreements e.g. going to a third regional ISP via a second from a first, the first is getting a free ride, since it's business is providing regional connectivity, the national ISPs are for getting between regions peers do not provide transit for other peers operation open a connection share best data after that send only keep-alives and changes is not guaranteed to converge can require lots of manual intervention can be very hard to figure out what went wrong “reboot and restart, put it down to a CISCO bug” types of BGO EBGP is for between different ASs IBGP is for within the same autonomous system dynamics of BGP, you can have any two of the following (at odds) goals fast convergence minimal updates path redundancy there are normally a lot of updates, the reason for this could be any or none of: there's always something going on in the big wide world of the internet, so it's just good housekeeping BGP exploring alternate paths IGP instabilities being exported to the internet bad trade offs in router software BGP sessions being reset due to congestion can reduce updates by rate limiting updates only send updates every 30s or so a router can change its mind many times about the best route without having to tell its neighbour every time this slows down BGP exploring alternate paths route flap dampening routes are given a penalty for changing the total penalty decays over time, but if it goes over a limit then the route is dampened this is bad for a mesh, as it punishes well connected destinations implementing customer-provider and peer-peer relationships enforce transit relationship (outbound route filtering) i.e. don't let any transit traffic through order of route preference provider < peer < customer Link state routing router knows entire network topology, computers shortest path itself link state packet (LSP) cost of link between itself and its neighbour, e.g. (from A) {A; B; 4} control flood the network with LSPs store LSPs if one comes in that is new send it out on all routes besides incoming one network with E edges will copy at most 2E times use sequence numbers to tell if a LSP is new sequence numbers run out/don't know which to use on boot up solutions: ageing put a timeout value in the header and control flood notification of the event so that routers that don't know the time can do it too on booting a router has to wait until all the old LSPs are purged and refreshed need a unique start sequence number a is older than b if if a router gets an older LSP tell sender about newer LSP recovering from a partition the two nodes on either side of the re-established link co-ordinate to update databases Distance vector routing node sends Distance Vector to all its neighbours that says what the node's best idea of distance to every node in the network, from it advantages distributed adapts to traffic changes and link failures suitable for networks with multiple administrative entities problem: count to infinity, can deal with using DV has path vector triggered updates (faster count to infinity) Memory models shared memory model collection of threads sharing address space read writes on memory locations implicitly and immediately globally visible used for tightly coupled systems advantages easy to use scalable disadvantages race conditions synchronisation issues message passing model collection of process with private address space used for loosely coupled systems advantages control protection disadvantages performance demand paged virtual memory run time mapping from logical to physical address space (by MMU) use page table to indicate whether a page should be accessible (valid) for that process is in memory page faults on a uniprocessor check whether the page is valid if not then kill the process otherwise page in the desired page and restart the process from the faulting instruction distributed shared virtual memory have a large logical DSVM (e.g. 264 bits, is roughly 18 pico-bytes) each processor has its own physical memory each page has a home processor can be mapped into a remote (other processor) address space on read then page in across network/interconnect on write access sort out ownership the DSVM library or OS is responsible for tracking current ownership copying data across the network setting access bits to ensure coherence implementation: centralised page manager a single manager (on a single processor) keeps, for each page p owner(p), the processor which created or last wrote to p copyset(p), the processors with a copy of p a read fault has four messages faulting processor contacts the manager the manager forwards to owner the owner sends page requester acks to manager a write fault faulting processor contacts the manager the manager invalidates the copyset (contacts all processors which have a copy?) manager contacts the owner the owner relinquishes the page the requester acks to manager we can reduce messages by making manager(p) = owner(p) however now we need to find out who is the manager(p) broadcast a request to find manager(p) this can lead to race conditions in broadcast at each processor keep a record of who you think is manager (from snooping broadcasts, etc.) we can reduce messages by changing the consistency model false sharing (is expensive) P1 owns p, P2 has read access to p P1 writes to p this copies the change to P2, but P2 doesn't care about the change sequential consistency (what we have used up to now) every read sees latest write is easy to use for the programmer is expensive (causes false sharing) release consistency read and writes occur locally processors will explicitly do release this is a serialisation point to ensure all processes can see the writes acquire become the owner (I assume includes the previous owner doing a release before relinquishing ownership) type specific coherence the consistency depends upon type labels given to variables/state/data structures gives the best performance types private memory: ignore (don't synchronise) write once: only permit others to read it (i.e. only service read faults) read mostly: writing is infrequent and owner broadcasts updates producer-consumer: the producer sends the changes to the consumer before it requests them (through a read fault) write many: release consistency + buffering synchronisation: strong consistency (e.g. for locks) this permits optimistic multiple writers almost as fast as hand coded message passing lazy release consistency update not on release but on next acquire reduced messages and higher complexity persistent virtual memory virtual memory is (or may be) backed by non-volatile storage (where pages that aren't in use are paged out to) why then should virtual memory have to be volatile? this is persistent virtual memory permits no more distinction between files and memory programmatic access to file structure or databases is easier (e.g. with linked structures) can benefit from a type system persistence of data is the length of time it exists orthogonal persistence the length of persistence of some data does not control how that data is accessed i.e. you don't have to treat files and variable separately implementation options functional/interpreted languages these could fake out an implementation in their language runtime imperative/compiled languages these prescribe the way to access data (e.g. pure OO, pointer chasing, etc.) we can use virtual memory to implement persistent memory Multics virtual memory no file system per se user sees large number of orthogonal linear regions (segments) each segment is backed by secondary store there is an overarching directory tree (like in Unix) which transcends segments segments are created by users and exist until explicitly deleted many concentric rings of privilege directories contain a set of branches (~ inodes) a branch contains a set of attributes for a segment a unique segment ID (identifies to which segment the branch belongs) an access control list for each UID a ring bracket and limit (ACL only applies within the bracket) a list of gates (procedure entry points) however this good security makes it harder to do things (Multics is a victim of complexity) secondary storage got bigger than 232 bits (~4 GB) the words size is 32 bits hence we cannot safely refer to (name) all the data an implementation using pointer swizzling any data in persistent pages is accessed by special 64 bit persistent pointers these persistent pointers are never directly accessed/dereferenced any resident persistent pages are marked as invalid on access, trap and for each persistent pointer p' allocate a new page P and mark it as invalid rewrite (swizzle) p' to be a normal pointer referring to P unprotect the original page and resume 64 bit word sized machines virtual addresses can serve as unique names we can have a single address space operating system recoverable virtual memory recoverable virtual memory refers to a region of virtual memory on which transactional semantics apply lightweight recoverable virtual memory full transactional semantics are too expensive just considers atomicity and some durability full transactional semantics can be built on top of LVRM the method: a process maps region(s) of some segment (memory/secondary storage etc.) into its virtual address space invokes begin_transaction(mode) invokes set_range(t, base_address, nbytes) LVRM will normally copy the range it makes a note in the log that it has done this if there is an abort it can hence restore the values, unless the mode is “no restore” this is expensive the redo log gets full – log truncation is complicated up to three copies of the data are made there are expensive synchronous disk writes alternate implementation (Rio Vista) controversial method uses a persistent (non volatile RAM, NVRAM) file cache on map just mmap a region of the NVRAM set_range(...) copies the range into the undo log which lives in NVRAM all updates are immediately durable – no redo log reduced cost no synchronous disk writes no redo log (and hence reduce the number of copies required) in the case of a crash early in the reboot flush the NVRAM contents to disk (this is possible since NVRAM persists) on a map lazily undo any transactions which had not committed at time of crash performance issues if the database doesn't fit into NVRAM capability based addressing a capability is a protected name for an object possession is necessary and sufficient for access need to be unforgeable can be manipulated in a defined and restricted set of ways (passing as a parameter, refined, etc.) these are implementable as hardware software (using cryptography) Cambridge CAP computer controlled what a process can write to registers, not which process (i.e. not who can write to registers) a capability consists of the values base address & limit (defines where and how large the segment is) access code (read only read/write, etc.) memory is divided into segments (like lines or blocks) data segments contain data and instructions, and may be transferred to and from arithmetic registers capability segments contain capability values, and may be transferred to and from capability registers mechanism for loading capabilities into C registers capabilities are passed as arguments in a function call loading is implicit whenever a capability is referred to there are both data and capability registers capability registers will hold the capability for the present process they will limit the data segments accessible i.e. you can't go and access some other process's memory, we'll give an address fault if you do the capability segments accessible i.e. when calling a function you can't pass it any capability segment (and hence any capability) you want, you can only pass those which you own yourself i.e. capabilities can only become more refined (less privileged), not more one has to hold the capability for both C registers and D registers for the same segment, in order to control and create capabilities only done by some highly trusted part of the kernel the CAP is to hardware what scoping is to software with this you can actually prove correctness of programs hierarchies are useful in organisation of flow of control but are unnecessarily restrictive for protection domains of protection the set of capabilities to which a process has access there is a special instruction needed to change domain of protection (ENTER) when leaving a protection domain you can't just leave capabilities lying around in capability registers ENTER and RETURN give a hierarchy of control but not of protection protection of processes this is necessary to support multiprogramming need to define a protection environment for each process (differing privileges) “kernel” ENTERs user process control RETURNs on trap or interrupt requires hardware support relative capabilities rather than each capability be for an absolute base address, we can permit the base of a capability to be relative to the base of some other segment {base, limit, access code, reference} reference is {capability | whole memory} hence a process can hand on a subset of its privileges to a sub-process in CAP the kernel capabilities are in a segment called the master resource list each process has a process resource list with capabilities relative to those in the MRL this makes revocation easy as you only have to revoke the top level process and all subprocesses are relative to it summary capabilities mean that processes run with minimum degree of required privileges hardware is expensive and complex system ends up being slow Microkernel operating systems kernel operating systems became very complex microkernels attempt to simplify the kernel build a more modular system support multiprocessors, distributed computing etc. the new structure attempted to move some functionality into user space servers access servers via some inter process communication (IPC) system increase modularity Mach microkernel provided compatibility with 4.3 BSD, OS/2 ... attempted to support diverse architectures and multiprocessors (SMP, NUMA ...) scale across network speeds have distributed operation a task is an execution environment a thread is a unit of execution IPC is based on ports and messages a port is a generic reference to a resource IPC is message passing between threads L3/L4 microkernels a problem with microkernels: many crossings in and out of the kernel is expensive (e.g. in terms of worse locality) approach: minimise what should be in the kernel make those primitives really fast L3/L4 did that by providing only recursive construction of address spaces threads IPC unique identifier support design/implementation address spaces are supported by the primitives grant: give pages to another address space map: share pages with another address space flush: take back mapped or granted pages threads execute with address space the microkernel manages thread to address space binding IPC is message passing between address space interrupts are handled as messages too EROS a persistent, software capability microkernel motivation behind EROS reliability requires system decomposition access delegation access policy is a run time problem persistence simplifies applications and improves IO mutually suspicious users attempted to challenge the idea that capabilities are slow microkernels are slow capabilities can't support discretionary access control (just passing them on) there are two disjoint “spaces” (like in the CAP) data space set of pages each page holds 4096 bytes we can read and write data to/from data registers capability space set of cgroups each cgroup holds 16 capabilities can read and write to/from capability registers each capability is (type, oid, authority) types basic types are page cgroup number schedule complex types include segment: corresponds to address spaces domain: corresponds to processes persistence is achieved by flushing objects to disk use a log for check-pointing before using capabilities they must be prepared a capability will (in general) be pointing at an object on disk the object is brought into memory the capability is rewritten to point at the object's memory location segment tree this is a tree of cgroups the leaf cgroups refer to data pages (others refer to more cgroups) speed-ups are achieved by keeping a cache (a page-table representation) of the segment tree associate each data page with a capability, like a key|-> value map i.e. don't have to walk the tree to get the capabilities for a data page we simply associate the evaluated capabilities with each page table entry fast capability based IPC scheme hand coded (similarly to L4) Virtual machine monitors virtual machine monitor = hypervisor multiplexes multiple OSs IBM's VM/CMS control program on top of which OSs sit control program provides each OS with virtual console virtual processor virtual memory virtual I/O devices performance is good since most instructions run directly on hardware Disco want to be able to run commodity OSs on ccNUMA fault tolerance between operating systems sharing between OSs to the OS the CPU looks like a real MIPS R10000 TLB fills etc. are emulated VMWare we want to have virtual machines for x86 there are 17 instructions which have different user/kernel semantic and don't trap these cannot be emulated solution: perform binary rewriting to manually insert traps (hack!) virtual physical addresses for an OS are mapped to actual physical addresses by shadow page tables performance issues don't have the source to make small modifications instead write special device drivers and other low level code Denalie uses VMM as an isolation kernel security isolation: no sharing between VMs performance isolation: VMM supports fairness mechanisms (fair queueing and LRP on network path) and static memory allocation uses para-virtualisation to improve performance uses x86-like ISA have to rewrite the OS to deal with this Yakima is a work in progress OS is written to use the para-virtualisation vision: XenoServers across the globe that are usable by anyone to host services, applications etc. use the Xen hypervisor to allow the running of arbitrary untrusted code (including OSs) insight guarantee resources in time and space, and then charge for them share and protect CPU, memory, network, disks etc. uses para-virtualisation, but with real OSs implementation (Xen 1) based on low level Linux (i.e. doesn't rewrite bootstrapping) has Xeno-aware device drivers in the OSs a special OS (domain 0) is started at boot time is used to create/suspend/resume/kill other domains (OSs) physical memory is allocated at start-of-day interrupts are converted into events write event to event queue in the domain the domain only “sees” the event when it is activated guest OSs can run a scheduler off of either virtual timer real time timer asynchronous queues are used for network and disk Xen 2 supports live migration of virtual machines copy all of memory from one server to another whilst copying watch to see where anything is modified during the continued running of the VM iterate, copying only the sections which are dirtied during the previous copy when down to ~20MB suspend and copy the rest, with a down time of around ~200ms Xen 3 supports h/w assisted full virtualisation 32/36/64 bit support conclusions VMs are popular as OS static size is small compared to memory OS level security is perceived as weak flexibility is desirable emerging applications internet suspend-and-resume at the end of the day suspend the VM to disk copy to another site (conference etc.) multi level secure systems many people use virtual private networks from home to work this risks leakage/cross contamination of viruses etc. instead run a VM with only VPN accessibility Extensibility this is basically about allowing variation/changes in some section of the system environment fixing mistakes supporting new features efficiency individualism per-process thread scheduling algorithms customising page replacement schemes different methods to implement this exist low level techniques i.e. give everyone their own (virtual) machine low level software provides virtual hardware some simple, secure multiplexing (getting N pieces of hardware from one) need to ensure not too much control is built into the Virtual Machine Monitor or we go back to the same problem users can choose, reboot or even recompile their OS without logging off kernel level schemes replacing the entire OS can be a little overkill sometimes we just want to replace/modify some part of the OS permit portions of the OS to be dynamically loaded and unloaded problems this requires clean and stable interfaces security: who do you trust to load into your OS? schemes to avoid security problems trusted compiler if you trust the compiler it can provide code with a digital signature, and then you can trust the code proof carrying code the compiler generates a proof, against which the program is checked at load time the proof is of certain properties doesn't read/write/execute outside some logical fault domain you end up needing formal specification language for safety policy formal semantics of the language for untrusted code language for expressing proofs algorithm for generating proofs algorithm for validating proofs sandboxing limit absolute memory references to per-module segments check for certain instructions method: transform untrusted code e.g. insert bounds checking of branches before they occur this does lead to code expansion SPIN operating system permits extensions to be downloaded into the kernel we want comparable performance to a procedure call, so use compiler check language safety the kernel is written mainly in Modula-3 capabilities are Modula-3 pointers the protection domain is enforced by language name space (not virtual addressing) the extensions define events and handlers applications register handlers for specific events e.g. a handler for “select a runnable thread” problems trusted compiler locks termination the VINO operating system downloads things called grafts into the kernel grafts written in C or C++ have free access to most kernel interfaces this is achieved safely by sandboxing has to use a trusted compiler prevents resource abuse by resource quotas and accounting prevents resource starvation by timeouts grafts must be preemptible safe graft termination assured by transactions transactions are wrapper functions around grafts all access to kernel data is via accessors transactions use two phase (fine grained) locking an in-memory undo stack to achieve the safety properties required user level schemes we can avoid the complexity required of putting extensions in kernel space by putting them in user space e.g. microkernels alternatively reconsider how the kernel works the kernel provides protection and abstraction only abstraction needs to be trusted implementations Exokernel run most of the OS in user-space library attempts to use as few abstractions as possible, as abstractions deny application specific optimisations discourage innovation impose mandatory costs functionality is limited to ensuring protection and multiplexing of resources this is vastly simpler than conventional microkernels' implementation of message passing applications may request specific memory addresses, disk blocks, etc. the kernel only ensures that the requested resource is free, and the application is allowed to access it Nemesis operating system guarantees each application a share of physical resources (in space and time) designed to support soft real-time applications which have QoS needs isolation: explicit guarantees to applications exposure: multiplexes real resources responsibility: applications must handle their own data path use interface description language (IDL, the language neutral way to specify communication between different languages) to allow user-space extensibility main principle: the majority of code could execute in the application process itself Nemesis therefore has an extremely small lightweight kernel, and performs most operating system functions in shared libraries which execute in the user's process this leads to a vertically-structured operating system conclusions it's more than just a “performance hack” as it simplifies system monitoring enables dynamic system tuning provides the possibility for better system and application integration it allows extensible applications to take advantage of OS flexibility Database storage naïve implementation (relations and directories in ASCII formant) e.g. to do select * from R, S where condition you have to read file the directory to get R and S's attributes read file containing R and for each line read file containing S and for each line create a join tuple between that and the R line check condition display if ok issues tuple layout on disk is inefficient search is expensive as there are no indexes no reliability can lose data can leave operations half undone no security no buffer management/concurrency control disk storage issues block size larger block size amortises I/O coast smaller blocks read in less useless stuff takes shorter time we need efficient use of the disk e.g. sorting of data I/O costs are likely to dominate (we want algorithms to do as little as possible) need to maximise concurrency use asynchronous I/O and a database specific buffer manager need to improve reliability use write ahead log to deal with failures mid-transaction representing records records can be fixed or variable format fixed or variable length fixed format this is the “traditional” model where you have a schema for the table which specifies number of fields types of each field order in the record the record isn't really interpretable without the schema each record self describes itself you still have to have some basic schema, e.g. the first byte is the number of fields the first byte of each field is a type code the second byte of each field is the type specific information ... fixed length each field is a fixed size e.g. a name field where the name has to be 30 characters or fewer variable length each field can vary in size e.g. a string can be as long as it needs to be hybrid schemes record header with schema id and length fixed record with a variable suffix block level storage of records storing records within a block fixed sized records don't need to be explicitly separated variable length ones might use a separation marker it might be better to have a header at the start of the block with pointers to the start of each variable length record spanning multiple blocks we can implement with pointers at the end of a block to indicate where the next part is mixing record types within a block this can provide a clustering locality benefit for related records normally is too messy order of records is important if we get them sequential and sorted it's easier to do merge-join efficient record retrieval instead of keeping the table in memory, keep its index instead we assume that the table is a sequential file ordered by key just because a key isn't in the index doesn't mean the record doesn't exist we could in theory just have the pointer to the first record and traverse the table each time (pointer chasing) dense index advantage: existence checks can be done without accessing the file is required for secondary indexes sparse index advantage: can fit more of it into memory can implement multilevel sparse indexes sort of like multi level page tables (with the second table being dense) entries in the top-level sparse index point to ranges in the lower-level sparse index these in turn contains entries that point to ranges of actual records can also include block pointers (which are for inside the block to which the index pointer points) insertions and deletions are messy an alternative don't focus on the index being sequential, instead focus on balance B+ trees each node has n keys and n+1 pointers non-leaf nodes each pointer points to nodes where the key values are less equal to than the key value for this pointer, for the left pointer are greater than the key value for this pointer, for the right pointer leaf nodes pointers point directly to record (or across to the next section) balanced tree all leaves are the same depth search is easy and fast O(log(n)) insertion if there is space in the leaf then fine “space” is defined by the order of the tree, e.g. 4 (maximum no of keys in each node) a function space also defines the minimum number of keys permissible at each node split the node, promoting the middle key to the next level up do this as far up the tree as necessary deletion is harder if the minimum number of keys permissible is not violated then fine otherwise need to one of redistribute keys (and propagate upward) coalesce siblings buffering: want to keep root and higher levels in memory the link between leaves provides for interesting optimisations of sequential access at leaf level B trees a B-tree of order d each node contain at most 2d keys and at most 2d+1 pointers must have at least d keys and d+1 pointers is at least half full insertion and deletion methods are utilised to ensure that the tree remains balanced the longest path in a B-tree of n keys is (approximately) at most log(d.n) nodes leaves are all at the same height in the tree at any node, each key has stored with its associated data value, therefore leaves are simply nodes without any forwards pointers to other nodes spacial indexes spatial data pertains to the space occupied by objects (points, lines, surfaces, volumes etc.) real life applications: roads, cities, countries, internet... this is hard to represent for conventional DBMS it is highly dimensional/continuous hence cannot store it simply as a relation we want to be able to express special queries (e.g. close to, encompasses, intersects etc.) B-tree cannot handle high dimensionality hashing cannot handle range queries possible approaches balanced trees in spatial occupancy (R-trees) multi-dimensional hashing (grid files) Postgres DBMS old DBMS were data management only, i.e. fixed format records traditional transactions queries we now need object management bitmaps vector graphics free text Postgres uses set oriented Postquel advantages variable persistence standard control flow has a small number of concepts so is simple for users disadvantages has a large memory footprint implementation doesn't use a write-ahead log uses a no-overwrite storage manager leave the old version of the record in the database the log is now just 2 bits per transaction stating if the transaction is in progress committed aborted advantages abort is cheap recovery is cheap can support historic “time travel” queries disadvantages need to flush new records to disk on a commit may need multiple indexes disk fills up have to archive it time travel queries are hard to express operating systems and databases operating systems do not work well with databases extra copies to/from disk buffer replacement most OSes use Least Recently Used DBMS accesses are sequential access to blocks without re-reference sequential access to blocks with cyclic reference (which kills LRU) random access to blocks without re-reference should discard immediately random access to blocks non-zero probability of re-reference this is the only case that works ok with LRU the DBMS knows about these things similar issues with prefetching DBMSes end up doing their own buffer pool management in user space issues with user space buffer management we want access to the blocks, not variable byte-length files the buffer pool is in virtual memory, so can suffer from “double paging” (interacting badly with the VM system) other problems multiple trees are wasteful multi-process DBMSes can suffer from priority inversion if there are user-space locks in the DBMS lower priority DBMS processes could hold a lock which a higher one needs, etc. single process DBMSes are complex and can't benefit from multiprocessor support possible solution OS accepts “hints” from the DBMS give DBMS raw partition access Distributed storage both file-systems and DBMSes want big, reliable disks RAID redundant array of inexpensive disks can get better performance through striping saving across disks, so can be writing multiple files at once can get better reliability though redundancy write the same thing to multiple disks at once distributed storage separation of data management from applications and servers permits centralised data management more scalability location fault tolerance mobility for access NAS uses file-based protocols such as NFS or SMB/CIFS where it is clear that the storage is remote, and computers request a portion of an abstract file rather than a disk block runs over TCP/IP more general purpose generally accessed by NFS, CIFS, SQL etc. storage area networks (SAN) appears to software as if it's local storage, i.e. it allows access to blocks the filesystem/DBMS runs on the host (see right) is accessed via encapsulated SCSI more specialised (usually enterprise level) runs over a fibre channel network file systems these are often used to access NAS are mainly remote procedure call (RPC) based, at some level NFS designed (originally, in v2) to be stateless no record of clients or open files no implicit arguments (e.g. a file pointer) to a request no write back caching on the server requests are idempotent (i.e. have the same result even if applied over and over) where possible only hard state is on the server local filesystem NFS v3 scalability permits 64 bit offsets removes some limits on file names etc. explicit asynchrony server can do asynchronous writes client sends a commit after some number of writes must keep a cached copy until the commit is successful NFS v4 a major rethink is a single, stateful protocol TCP access only (or suitable reliable transport layer) explicit open and close operations file level revocable locks (share reservations) server gives the client some revocable autonomy (delegation) arbitrary compound operations AFS all accesses in this protocol are RPC, even if the file system is local modifications to files are made on locally cached copies when a file is closed the required sections are written back callback: this is a procedure such that each client which has a copy of a file out is informed by the server if another client writes back a modification a volume is a tree of files, sub-directories and AFS mountpoints (links to other AFS volumes) volumes are created by administrators once created, users of the filesystem may create directories and files as usual without concern for the physical location of the volume live replication and relocation of volumes Coda supports disconnected operation under normal operation, a user reads and writes to the file system normally during this the client fetches, or "hoards", all of the data the user has listed as important in the event of network disconnection if the network connection is lost, the Coda client's local cache serves data from this cache and logs all updates (during this stage is called disconnected operation) upon network reconnection, the client moves to reintegration state; it sends logged updates to the servers then it transitions back to normal connected-mode operation cache manager operates in one of three modes hoarding emulating (for disconnected operation) reintegrating (assisted conflict resolution) extensions can support weakly connected (e.g. wireless) operation LBFS Low-bandwidth Network File System used for wide area/wireless etc. networks need to compress an order of magnitude better maintain a persistent cache on the client method server divides its files into chunks and computes a secure hash on each client does the same if the client wants to write a portion of a file first just transfer hashes for the relevant chunks if the server already has the chunks then you can avoid the transfer otherwise it requests whatever chunks are missing possible issues edit in place? fetching files never seen before? more detail of method (deals with issues) file chunk boundaries are based on the contents (data) a Rabin fingerprint is computed over every overlapping 48 byte region of every file if the lower 13 bits match a magic boundary then that position is deemed to be a chunk boundary expected chunk size is 8K use 2K and 64K thresholds to deal with pathological cases hence we can have hits if the contents appear anywhere in any file we know about ~20% redundancy between unrelated files what about hash collisions? serverless file systems all nodes hold some data think peer to peer in the local area xFS this file system includes clients cleaners managers storage servers any machine can be almost any subset of these to read a file look up the manager in a globally replicated map contact the manager with request manager redirects to cache or disk to read a file obtain a write token from manager append all changes to the log when hit threshold, flush to stripe group (RAID) can approach 10x better than NFS, as has co-operative caching parallelism via software RAID (striping) avoids read-modify-write by using a log structure managers are replicated for fault tolerance JetFile serverless internet storage aims support shared personal file access in LAN and WAN provide comparable performance to the local file system implementation uses scalable reliable multicast (SRM) receiver driven: send out a request to the multicast group for the data needed hopefully receive 1 or a few responses version numbers are kept on all the data, so the receiver can retry to guarantee eventual delivery file name/identifier is (org, vol, fileID, version) the first three are hashed to get a file address to retrieve the file first multicast a data request responses will indicate who has it pick one and use unicast with them to transfer the file updates are handled by versioning write on close semantics (this increases the version) the client is now the server for the new version explicit version requests, plus “current table” are multicast over per-volume channel file systems for SANs SAN has a pool of disks accessed via iSCSI when a SAN is being accessed by multiple clients coordination is needed methods to build a Shared Disk File System (SDFS) asymmetric add a metadata manager there is exclusive access to metadata disks clients access data disks directly are simpler but less scalable symmetric clients access data and metadata directly distributed locking used for synchronisation implementation lower level – network storage pool driver combines all disks into a single address space supports striping for performance/reliability higher level – file system almost a standard Unix structure (inodes etc.) device locks and global locks for synchronisation network attached secure disks (NASD) concept: a less stupid SAN still have network attached disks but disks export variable length object interface i.e. allow storage devices to transfer data directly to clients most client operations like Read/Write go directly to the disks less frequent operations like authentication go to the file manager disks transfer variable-length objects instead of fixed-size blocks to clients integrity/privacy available on transfers security a disk and the file manager share a secret key after file system specific checks the file manager can issue derived capabilities to clients this permits clients to securely access the disk directly advantages data path operations are fast and secure a lot of work is offloaded from the file manager multiple NASDs can be accessed in parallel disadvantages less allocation control Venti this is distributed archival storage archived storage is normally on tape magnetic disk is cheap, ubiquitous and fast it is also subject to overwrite use software to provide immutable storage every file update a new version the address of a block is a 160-bit SHA hash of the data this enforces write-once since no other data block should be able to be found with the same address directories map names to inode content hashes a nice property if a file changes in a leaf directory then some of the block hashes will change hence the data in the inode will change hence the inode content hash will change hence some directory block(s) will change this carries on all the way up the tree issues all addressing is via hashing – how do we map back to block location? Venti uses log-based (append only) storage blocks are stored sequentially in an arena each block has a header including its hash a trailer on the arena points back to blocks build an index to map hashes to block headers the index can be a bottleneck hence add an index cache plus a regular block cache
Thread pools
More complicated thread pools
Cache coherent shared memory multiprocessor, with


Write back cache coherency
SMP snoop requests on the bus
queue based spin locks

transit ASs
lollipop
![]()
![]()
![]()
XenoServers



variable format

if there is not space
network attached storage (NAS)
