AST

AST


  • Scalable Synchronization
  • Threads in Java
  • Fine grained parallelism
  • User language threads
  • Thread Pool
  • Spin locks
  • Queue based locks
  • Non blocking data structures
  • Routing information protocol
  • OSPF
  • Link state vs Vector routing
  • Autonomous Systems
  • Border Gateway Protocol
  • Hot potato routing
  • Cold potatato routing
  • BGP Wedgie
  • Distributed Shared Virtual Memory
  • Recoverable Virtual Memory
  • Mach kernel
  • Extremely Reliable Operating Systems
  • Scalable Synchronization:

    Work sharing: Coarse granularity- little communication once started. Fine granularity- potential for lots of communication.

    Identify tasks that can usefully operate in parallel -> Ensure they operate safely while doing so -> Make it unlikely they get in each others way (reduce contention, increase locality) -> Reduce overheads introduced by their co-ordination.

    It's the bottleneck resource that's important.

    Eg; With a web server could have a single thread, bottleneck on hard drive. Could have thread per client, bottleneck on hard drive and have lots of open threads -overload. Could use thread pool to cap number of threads active.

    Threads in Java: Implement runnable, Thread(class).start();

    If a thread calls a synchronized method it blocks until it can acquire the target object's mutex. Each object also has an associated condition variable which controls when threads acting on the object get blocked and woken up. If a thread has an objects mutex it can can call: wait() releases the mutex on this and blocks on its condition variable. notify() selects one of the threads blocked on this's condvar and releases it ( the woken thread must then re-acquire a lock on the mutex before continuing). notifyAll() releases all of the threads block on this's condvar.

    Fair lock: Use currentTurn and nextTicket. synchronized awaitTurn {wait() while not current turn}, synchronized finished().

    Fine grained parallelism: Operations on some data can be executed in parallel. Can use OpenMP to tell processor that concurrent execution on a section of code is safe (parallel), by only one thread (single) and by only one thread at a time (critical). Also have barrier points which synchronize all threads in a team and reduction clauses to avoid data dependencies.

    Kernel threads: OS manages threads. OS can balance threads across multiple CPUs, and individually block when they submit I/O. Can be expensive to create.

    User language threads: Implemented within an application runtime. Multiplexed into a single kernel visible process. Fast creation and context switch, but can't exploit CPU parallelism and one thread can block entire application.

    Commands: Things to be done, light weight to create. eg each object in a garbage collector.

    public interface Runnable {

    public abstract void run();

    }

     

    Executors: Responsible for arranging the execution of commands. Costly to create, long lived.

    public interface Executor {

    public void execute(Runnable command)

    throws InterruptedException;

    }

    Can have synchronous, asynchronous, pooled (bound number of threads), queued (one at a time) executors.

    Thread pool: Minimum and maximum pool size. Commands enter a single queue. Threads take commands from the queue and execute them to completion before taking the next. New threads created if fewer than min are running. If overflow could block caller, run immediately, signal, discard. On blocking could just increase the max pool size. Or could use asynchronous I/O so that a command is generated in response to each I/O completing. Could provide separate queues for each active worker thread. If C1 generates command C2 then put it on the queue that C1 came from. If queue runs empty just take from another queue - specialized work queues to make this ok.

    Work queues:


    Head always points to a dummy node. Tail points to the last node in the list. Separate mutexes protect the two separations.

    Reducing contention: Confinement: Guarantee that some objects must always remain thread local, no need to lock them. Accept stale/changing data: Sometimes ok, particularity during reads. Copy on write: Access data through indirection and copy when updated.

    Reduce locking granularity: Lots of small locks instead of a few large ones. Eg; ConcurrentHashMap: Table divided into segments. One lock per segment, when reading try without locking first then try locking if fail.

    Uniprocessor cache coherency: Clean lines set to dirty when modified. Dirty lines written back to cache wehn replaced. DCI - dirty, clean, invalid.

    Shared processor cache coherency: MSI - modified, shared, invalid. Also can have owned and exclusive state.

    ccNUMA cache coherence: CPU's and their local memory grouped into nodes. Message passing between nodes. One directory per node, tracks the caching of memory blocks belonging to its node. CPU issues read/write requests to the home node of the memory block. Can get three-hop miss.

    Memory consistency: Superscalar processors execute out of order. Aliased writes by different CPUs are serialized. Sequential Consistency: Strong, but expensive. Speculative processor ordering: Modern x86.

    Compare and Swap: cas(&x, old_value, new_value). If value at x == old_value then write new_value into x. Atomic.

    Basic Spin Lock:

    class BasicSpinLock {

    private boolean locked=false;

     

    void lock() {

    // while locked==true, spin

    // if locked ==false, set locked=true and exit

    while ( CAS(&locked,false,true) != false {}

    }

     

    void unlock () {

    locked=false;

    }

    }

    But CAS requires exclusive access to the cache block holding locked, slows down. also processors ping pong between each other, crap interconnect.

    Better Spin Lock:

    Could make spin on locked instead (as just accessing isn't atomic) ie

    while ( CAS(&locked,false,true) != false {

    while (locked == true) { spin here instead }

    }

    Though this way you get a stampede.

    Multireader Single Writer:

    //if readers =0, lock with -1, otherwise spin.

    CAS(&readers, 0, -1) !=0 {}

    Linux big reader locks: Built from per-CPU MRSW locks.

    Queue based spin locks:

    Basic idea: each thread spins on an entirely separate location and keeps a reference to who gets the lock next. Each qnode has a next field and a blocked flag. In this case thread 3 holds the lock and will pass it to 1 and then to 2. A shared tail reference shows which thread is last in the queue.


    If thread 4 wants the lock: Creates a new qnode, uses CAS to update tail to refer to its node, writes to next field of previous tail to point to it's qNode. Thread 4 now spins watching the flag in its qnode.

    Disadvantages of mutual exclusion: Ensure safety, but may reduce liveness. Deadlock due to circular waiting, priority inversion problems.

    Non blocking data structure: Guarantees the system as a whole will still make progress if any finite number of threads in it are suspended. Three kinds of non blocking design, each weaker than before: Wait free - per thread progress bound. Lock free - system wide progress bound. Obstruction free - system wide progress bound if threads run in isolation. CAS is a universal primitive for building wait free designs, ie can build anything. Can use CAS to link nodes in command queues.

    Linearisability: Each operation should appear to take place atomically. A linearizable non-blocking implementation can be used knowing only the operations it provides, not the detail of how they are implemented - just use a CAS that checks and/or updates all of the state the result depends on.

    Software transactional memory: Holds ordinary values and supports operations such as StartTransaction. Ensures that all of the accesses within a transaction appear to be executed in a linearizable manner.

    Routing

    IP routing: Clever ends, dumb middle.

    ICP: Interior gateway protocol. Metric based. OSPF, IS-IS, RIP, EIGRP.

    EGP: Exterior gateway protocol. Policy based. BGP. The routing domain of BGP is the entire internet.

    Routers: Talk to other routers in the control plane to exchange routing information. Routing messages aren't normally routed, instead exchanged via layer two between adjacent routers.

    Link state: Topology information is flooded within the routing domain. Best end-to-end paths are computed locally at each router. Best end-to-end paths determine next hops. Based on minimizing some notion of distance. Works only if policy is shared and uniform. Eg; OSPF, IS-IS.

    Vectoring: Each router knows little about network topology. Only best next hops are chosen by each router for each destination network. Best end-to-end paths result from composition of all next-hop choices. Does not require any notion of distance. Does not require uniform policies at all routers. Eg; RIP, BGP.

     

    Link State

    Vectoring

    IGP

    OSPF, IS-IS

    RIP

    EGP

     

    BGP

     

    IP header: TTL, Protocol, Source, Destination, Checksum. IP is a network layer protocol.

    IP Forwarding table: Destination (normally a network, may be a host or default) | Next hop (network or router) | Interface (Physical interface to send out on).

    IP Forwarding: Remove a packet from an input queue. Check sanity, decrement TTL. Match packets destination to IP forwarding table entry. Place packet on correct otuput queue.

    IPv4: 32 bit. IPv^ is 128 bits.

    Classless Inter-Domain Routing: CIDR uses two 32 bit numbers to represent a network. Network number = IP address + Mask. Allows a hierarchy in routing, rather than just a flat address space. Can route to best hierarchical match in forwarding table.

    Populating Forwarding tables: Can do statically, allows for more control but doesn't scale. Can do dynamically, which adapts to change and scales well, but complex.

    Routing vs. Forwarding: Forwarding determines the next hop. Routing establishes end to end paths. Forwarding always works, routing can be badly broken.

    Network Congestion: IP routing protocols don't dynamically route around network congestion.

    Shortest path: Single source problem- Find a shortest path from a given source to each of the vertices. Single pair- given two vertices, find the shortest path between them. All pairs - find the shortest paths for every pair of vertices (use a dynamic programming algorithm).

    Negative Edges: Are ok so long as there are no negative weight cycles. Shortest paths will have no cycles.

    Relaxation: For each vertex v in the graph, maintain an estimate of the shortest path from s. Relaxing an edge (u,v) means testing whether we can improve the shortest path (s,v) by going through u.

    Dijkstra's algorithm: Non negative edge weights. Greedy algorithm, bit like breadth first search.


    Maintain a set S of solved vertices. At each step select "closest" vertex u, add it to S, and relax all edges from u.

    Running times:

    Extract-min executed |V| time. Decrease key executed |E| time. Time |V|+|E|.

    Q

    Running time

    Array

    O(V2)

    Binary heap

    O(E lg V)

    Fibonacci heap

    O(V lg V + E)

     

    Bellman ford algorithm: Dijkstra doesn't work when there are negative edges. BF detects negative cycles (returns false) or returns the shortest path tree. Runs in O(VE).


    Eg;


    Runs in O(VE).

    Routing information protocol: Doesn't scale well, designed for small LANs. Distance vector. Simple. RIP routing table: Destination | Next hop | Metric (How many hops).

    Periodically (30 seconds) exchanges list of destinations and metrics with all neighbouring routers. If finds smaller metric updates table. Count to infinity problem. Uses Bellman Ford.

    Open Shortest Path First: OSPF has rapid, loop free convergence. Doesn't count to infinity. Link state protocol. Scales well, complex. Each router has a link state database that represent the entire network constructed from local knowledge at each router, the routing table's metrics are computed locally using Dijkstra. Synchronization of the distributed, replicated link state database is complex. Hierarchical OSPF by having separate areas, using distance vector to exchange routes between areas.

    Link state vs. Vectoring: Link state has faster convergence, but requires more memory, CPU and message overhead. Vectoring requires few resources, but convergence can be very slow. Counting to infinity can be a problem. Both protocols can induce transient forwarding loops during convergence (addressed by EIGRP).

    Autonomous routing domains: A collection of physical networks glued together using IP, that have a unified administrative routing policy.

    Autonomous Systems: An autonomous routing domain that has been assigned an autonomous system number. AS Numbers are 16 bit values that represent units of routing policy. ASN's can be shared. ARD's normally have no ASN's (edges of the internet are statically routed).

    AS Graphs don't show topology, BGP is designed to throwaway such information. They depend on your point of view.

    Transit vs. Nontransit: A transit AS (eg ATT) allows traffic with neither source nor destination to flow across the network. A nontransit (eg MIT) AS allows only traffic originating from AS or traffic destination within AS.

    Policy vs. Distance routing: Minimizing hop count can violate commercial relationships, so may use policy based rather than distance based routing.

    Peers: provide transit between their respective customers. They don't provide transit between peers due to cost. Peering allows shortcuts.

    Reasons to peer: Reduce upstream transit costs, can increase end-to-end performance, may be only way to connect customers to some part of the internet. Reasons not to peer: Would rather have customers, peers are usually your competition, may require boring negotiation. Peering agreements are often confidential.

    Blackholes: Accidental or malicious announcement of your prefix can black hole destinations.

    ISP's: Prefer customer routes over all others. Use provider routes only as a last resort. Dont provide transit between peers or providers. Verify customer address space.

    Border Gateway Protocol: A policy based routing protocol, used for EGP. Quite simple, but configuration errors distribute. Works by:

    Establish session on TCP port 179 with another AS. Exchange all active routes. Exchange incremental updates- while connection is alive exchange route update messages. Four types of BGP messages: Open, Keep Alive, Notification (bye), Update (update routes).

    Receive BP updates -> Apply import policies (tweaks) -> Best route selection -> Best route table (IP forwarding table) -> Apply export policies -> Transmit BGP Updates.

    Route selection works by first highest local preference (tagged values of routes), then shortest ASpath, then just lowest router ID (give up essentially). Shortest AS path isn't neccesarily better, one AS could have twenty hops in it. BGP won't accept paths with loops. Packet may end up not following AS path (due to filters on routers).

    Community attributes: Used for signalling between ASes, so can tell what packets to transport/filter.

    Two types of BGP sessions: External Neighbour (EGBP) in a different Autonomous System. Internetal Neighbour (IBGP, using IGP) in the same autonomous system. Every time a route announcement crosses an AS boundary, the Next Hop attribute is changed to to the IP address of the border router that announced the route. EHP is joined with IGP for connectivity.

    Hot potato routing: Go for the closest egress point as getting traffic off your network as soon as possible is cheaper. Often you send out a tiny http request mainly down a high bandwidth provider backbone, then get a huge http reply down the low bandwidth customer backbone.

    Cold potato routing: Using MEDs (multi exit discriminator attribute) routers prefer lower MED values, which are considered before IGP distance. Some providers ignore MEDs. AS's can tweak, particularly outbound traffic, by filtering inbound/outbound routes.

    Backup links: Through local setting preference higher for primary link than backup link can force outbound traffic to take primary link, unless link is down. Can shed inbound traffic with ASPATH padding ie set backup link ASPATH to 2 2 2 and as loops are dropped will force to take primary link. May still send to backup link anyway because prefers customer routes and local preference is considered before ASPATH length. Can set community reference to force traffic down primary link.

    BGP Wedgie: BGP policies make sense locally, and interaction of local policies allows multiple stable routings. Some routings are consistent with intended policies, some aren't. A full wedgie (as opposed to half) is when an unintended routing is installed, and no single group of network operators has enough knowledge to debug the problem. Can end up accidentally using back up link when path was supposed to avoid it. Can be recovered with manual intervention by bringing down faulty AS sessions then bringing them back up. There is no guarantee that a BGP configuration has a unique routing solution. When multiple exist, the unpredictable order of updates will determine which one wins. There is also no guarantee that a BGP configuration has any solution. Complex policies increase the chances of routing anomalies. A system can end up in many stable states, which cant be un-wedged with session resets.

    Example:


    Which leads to...


    Inter domain routing: Often asymmetric. The goals of fast convergence, minimal updates and path redundancy are all at odds. IGP tie breaking can export internal instability to the whole internet. MEDs can export internal instability. Congestion can take down BGP, eg the SQL Slammer Worm. Can get MED oscillation, when lower MED routers knock higher MED routes out of the picture, then cant find low IGP routes and ends up back in initial state again.

    Squashing updates: Have min route advertisement interval. Hides internal changes but effective in dampening oscillations inherent in the vectoring approach.

    Route flap dampening: Routes are given a penalty for changing. If penalty exceeds suppress limit the route is dampened. When the route is not changing the penalty decays exponentially. If the penalty goes below the reuse limit, then it is announced again. However will punish small updates which are common on routers with lots of connections.

    Advanced Operating Systems:

    Shared Memory model: Collection of threads sharing address space. Read/writes on memory locations implicitly and globally visible. Eg; X := X + 1.

    Easy to use, transparent and scalable. But race conditions, synchronisation and cost.

    Message passing model: Collection of processes with private address spaces. Explicit co-ordination through messages eg;

    send message("fetch(x)")

    Received message

     

    send_message(“X”)

    tmp := recv message(P2)

     

    tmp := tmp + 1

     

    send message("tmp")

    x: = recv message(P1)

    Message passing provides control, protection and performance.

    Demand Paged Virtual Memory: Run time mapping from logical to physical addresses performed by special hardware.

    Variants: Segmentation, capabilities, paging. Typically use demand paging: Create process address space (setup page tables), mark page table entries as either invalid or non resident. Add PCB to scheduler.

    Whenever we receive a page fault:

    Check PTE to determine if invalid -> If invalid, kill process -> If not page in the desired page by finding a free frame (may require page replacement) then read in from disk and restart the process as the faulting instruction.

    Distributed Shared Virtual Memory: Shared memory on tightly coupled systems. Message passing on loosely coupled systems.

    DVSM provides shared memory on clusters. Each page has a home processor, can be mapped into remote address spaces, on read access page in across network, on write access sort out ownership.

    DVSM tracks current ownership, copies data across the network and sets access bits to ensure coherence.

    Simple case: Centralized page manager runs on a single processor, maintains owner(p) (processor which created or last wrote to page p) and copyset(p) (all processors with a copy of p) per page.

    Then on read fault need four messages: Contact manager, manager forwards to owner. Owner sends page, requester acks to manager.

    On write fault contact manager, invalidate copyset. Manager contacts owner, relinquishes page, requester acks to manager.

    Load balance: manager(p) is (p % #processors). Reduce messages: manager(p) = owner(p).

    can broadcast to find manager(p), or keep a per processor hint that is updated on forwarding or invalidate.

    DVSM can still be expensive, eg false sharing when a processor writes to a page everyone with read access has to be updated, even if they dont care.

    Can reduce traffic by using weaker memory consistency, where rather then every read seeing every write use explicit acquire() and release() for synch. The best performance is achieved by doing type-specific coherence: Private memory-> Ignore. Write-once-> Just service read faults. Read-mostly->Owner broadcasts updates. Producer-Consumer->Live at P, ship to C. Write-many -> release consistency & buffering. Synchronization-> Strong consistency.

    Persistence: Virtual memory means memory is backed by non volatile storage. Could make this the default case, with no distinction between files and memory. Could do so by faking at run time or prescribing how data is accessed.

    Persistence of data: Length of time it exists. Orthogonal persistence: Manner in which data is accessed is independent of how long it persists.

    Multics Virtual Memory: Many concentric rings of privilege. No file system per se, users see large numbers of segments of virtual address space, each backed by a secondary store. Directories contain branches (bit like inodes) with contain unique segment id, access control list, a ring bracket (where the ACL applies) and a list of gates (procedure entry points). Flexible security as processes jump through gates, but complex. Names translated to segments through pathnames or reference names. Known and active segments.

    Persistent Virtual Memory: Can't safely name all data. Possible solution via pointer swizzling: can allocate objects on persistent heap. All data in persistent pages are addressed by special 64 bit persistent points, which are never directly accessed. On access a new page is allocated and the pointer is swizzled to point to it. Now we have 64 bit address space means virtual addressed can serve directly as unique names - single address space.

    Recoverable Virtual Memory: Regions of virtual memory on which transactional semantics apply.

    Lightweight RVM: Cheaper than full transaction semantics, just considers atomicity and some durability. Processes map regions of external segments into their virtual address space then begins transaction, sets the range to work on which is copied and added to undo log (so on abort can restore), on end transaction LVRM writes all ranges to the redo log. When the redo log gets full push contents to external segments and truncate log.

    Making RVM faster: LVRM is faster but up to three copies of data (undo,redo,truncate) and expensive synchronous disk writes. Rio Vista uses a persistent NVRAM cache. On set_range copy to undo log in NVRAM, all updates immediate so no redo log. If the machine crashes, flush NVRAM to disk. On map, lazily undo any transactions which had not committed at time of crash. Problem if DB doesn't fit in NVRAM.

    Capability based addressing: A capability is a protected name for an object. Possession is necessary and sufficient for accesses. Can implement in software (crypto) or hardware).

    Cambridge Cap Computer: Fine grained memory protection. Controls what can be written to registers but not who. Base limit registers and their contents become capability registers and capabilities.

    A capability consists of base, limit and access code. In normal systems all control of privilege lies with the OS designer (coarse grained). CAP has rings of protection, rather than being hierarchical. Hierarchies are useful for control, but unnecessarily restrictive for protection. Domains of protection: Set of capabilities to which a process has access. Enter instruction used to change domain of protection, return. Define protection environment for processes to give different privileges to different processes. Kernel ENTERs use process; control RETURNs on process trap or interrupt. Requires hardware support. Relative capabilities allow the base address of a capability to be relative to the base of some other segment. Now capability is (base, limit, access code, reference) and a reference is (capability|whole memory). Processes can now hand on subset of memory access to sub processes.

    Kernel lives in a segment called the master resource list. Processes have a process resource list.

    Key features of CAP: Segment swapping, local naming, flexible control of sub processes. Minimum privilege and modularity in programs. But hardware is complex, slow and security vs. usability.

    Microkernel Operating Systems: Simplify kernel and build modular system. Support multiprocessors and distributed computing. Moves functions to user space servers, access servers via interprocess communication. More modular, so more robust and scalable.

    Mach Microkernel: Allows you to run different OS's through a software emulation layer. Support diverse architectures, scale across network speeds, distributed operation. User space: User process -> Software emulation layer (containing BSD etc.). Microkernel: Threads,IPC,Virtual memory, Scheduling -> Mach. A task is an execution environment, a thread is the unit of execution. IPC uses buffered comms channel between ports to send messages. Memory objects are a source of memory eg; memory manager, file on file server.

    L3/L4 Microkernels: Too much in micro kernel, lost benefits. Too little, too slow. Minimise kernel, make it really fast. L3 systems provides just: recursive construction of address spaces, threads, IPC, unique identifier support. Address spaces support by three primitives: Grant (give pages to another address space). Map (Share pages). Flush (take back mapped or granted pages). Microkernel manages thread to address space binding. IPC is optimised message passing between address spaces, interrupts handled as messages too.

    Extremely Reliable Operating System: EROS is a persistent software capability microkernel. Capabilities allow decomposition and so accesses delegation. Data space: Set of pages using data registers. Capability space: Set of cgroups uses capability registers. Each capability is (type,oid,authority). Types such as page, cgroup, number, segment, domain. Segments correspond to address spaces, domains correspond to processes. Persistence achieved by flushing objects to disk, circular log. Capabilities are prepared before being used (bring object into memory). Only unprepared capabilities written to disk. Capability based IPC scheme: Names capability to be invoked, operation code, data.

    Virtual Machine Monitors: Use hypervisor (beyond supervisor, ie normal OS) to multiplex multiple OSes. Eg; IBM's VM/CMS, VMWare, Denali.

    CMS: provides each OS with virtual console, processor, memory and I/O devices. Can even run another VM.

    Disco: Designed to run commodity OS's on ccNUMA. CPU looks like real MIPS R10000.

    VMWare: Founded by Disco dudes. Virtual machines for x86. 17 instructions on the x86 aren't virtualizable (different user/kernel semantics that don't trap) so can't be emulated- have to perform binary rewriting to manually insert traps. Physical to machine address mapping realized by using shadow page tables. VMWare writes special device drivers (eg display) and other low level code optimizations to improve speed.

    Denali: Use VMM as an isolation kernel. Supports security and fair performance. Uses para virtualization: Invents new x86 like ISA, write/rewrite OS to deal with it.

    XenoServers: Control Plane (User software) -> Guest OS (XenoLinux, XenoXP) with Xeno device drivers -> Xen Hypervisor (Virtual processor, memory,networks) -> Hardware. Guarantee resources and charge for them, share and protect hardware. Uses paravirtualisation, but real operating systems.

    Xen: Based on low level parts of linux. Device drivers, special interface to XEN booted at run time. Can see real addresses. Interrupts converted to events. Asynchronous queues for network and disk. Fast. Now moved device drivers into driver domains, h/w assisted full virtualization.

    Why VMM: OS static size small compared to modern memory. Flexibility and security. Can work at office in day, suspend VM to disk and copy to home pc. Avoid using home pc with viruses for work.

    Extensibility: Easy to fix mistakes, add new features, run time specialisation. Low-level scheme: With a virtual machine you can recompile and reboot OS without logging off. Kernel level scheme: Often just want to change small part of OS. Allow OS parts to by dynamically [un]loaded. In Linux register on load.

    Security in Kernel level Extensibility: Trusted compiler and digital signature. Proof carrying code. Sandboxing - limit memory references. Timeouts prevent CPU hoarding.

    Proof carrying Code: Take code, check it and run if checker says it can't access outside some logical fault domain. Easier to check proof than generate it. Need formal language and algorithms for generating and validating proofs.

    Sandboxing: Transform code to make it safe. Software fault isolation: Insert instructions to check memory access bounds. Need trusted optimising compiler.

    SPIN operating system: Kernel resources defined by capabilities (unforgeable reference to a resource). Protection domain is enforced by language name space (not virtual addressing). Extensions define events and handlers. Need trusted compiler, termination.

    VINO operating system: Download C++ grafts into kernel. Free access to most kernel interfaces achieved by SFI (sandboxing). Stop hogging with quotas. Prevent resource starvation with timeouts. Safe graft termination assured by two phase locking + kernel data access via accessors, and wrapper functions around grafts.

    User level schemes: Simpler than kernel level schemes. Eg; Nemesis.

    The Exokernel: Seperatae concepts of protection and abstraction. Abstractions impose costs. Exokernel compiles packets you wish to receive using DPF to fast, unsafe machine code. Untrusted Deterministic functions allow exokernel to sanity check block allocations.

    Nemesis: Designed to support soft real time applications. Isolation: Explicitly guarantees to applications. Exposure: Multiplex real resources. Responsibility: Application must do data path. Guarantees each application a share of physical resources (space and time), use IDL to allow user space extensibility.

    Extensibility Summary: Simplifies system monitoring. Allows dynamic system tuning. Allows better system/application integration. Can do it with virtual machine monitors, downloading untrusted code (and checking), putting things in user space, pushing protection boundary to rock bottom.

    Database Storage: Could use standard flat files, go through line by line checking if query condition is set. Inefficient as deletions expensive, no indexes so search expensive, need brute force query processing. No reliability or security, no buffer or concurrency control.

    Block size: Large blocks amortise I/O costs but reads in more useless stuff. Need to sort data on disk. Use double buffering to maximise concurrency. Need write ahead log to deal with failures mid transaction.

    Representing Records: A record is a collection of related data (fields), can be fixed format or variable length. Fixed format by using schema - holds #fields, types, meaning, order. Variable format with self describing records. Normally hybrid schemes.

    Could store records directly in blocks, but with variable length records need separation marker. Usually too messy to mix record types within a block.

    Efficient Record Retrieval: Spare index: Smaller, more index in memory. Better for inserts. Dense: Existence check without accessing files. Needed for secondary indexes.

    Can use block pointers, or if contiguous can omit. Need a balanced tree eg; B+- tree.

    B+- tree: All nodes have n keys and n+1 pointers. In non leaf nodes, each pointer points to nodes with key values <right key, > left key.

    Balanced tree (all leaves the same depth): Keep > [(n+1)/2] in non leaves, keep > [(n+1)/2].

    Search is fast: Binary search at each level - O(log(n)). With N records, height lognN.

    Insertion: Space in leaf ok. Otherwise split, if root split new root and new height. Deletion: If min bounds not violated ok, otherwise need to propagate keys upward and coalesce siblings.

    LRU buffering is a bad idea, keep root and higher levels in memory.

    Can get a B tree which avoids key duplication, smaller but harder deletion.

    Spatial Indexes: Want to express spatial queries( close to, intersects etc.). Can store with multi dimensional hashing or balanced trees in spatial occupancy.

    Postgres DBMS: Provides object management. Uses set orientated Postquel, simple but big memory footprint. Rather than using a write ahead log leaves old version of record in database, and attaches 2 bits per transaction stating if in progress, committed or aborted. Cheap abort, but must flush new records to disk on commit, when disk fills up flush to write once media.

    Operating Systems and Databases: Most OSes use LRU for buffer replacement, but DBMS accesses are often sequential (with nested loops) or random access. So DBMSes end up doing their own buffer pool management in user space. However doing this in user space can get double paging, multiple trees. Could give OS "hints". NTF directly supports an infinite write ahead log.

    Distributed Storage: Redundant Array of Inexpensive Disks (RAID). Better performance through striping. More reliable via redundancy (just mirror). Scalable, fault tolerance. Can do distributed storage, where data is managed by a central policy.

    Network attached storage: NAS runs over TCP/IP exporting NFS,SQL... and distributes storage at the file server/ database level. Good general purpose.

    Storage area networks: SAN distributes storage at the block level. It runs over a fibre channel and is accessed via encapsulated SCSI. Specialised.

    Network File Systems: NFS originally designed to be stateless (no record of clients or open files, no write back caching). This is good for discovery but synchronous disk write on server sucks, can't help client caching. Now can do asynchronous writes, stateful (including mount, lock) and only supports TCP. Share reservations (locks) and delegation (server gives client revocable autonomy).

    AFS: Remote, even local access is through a client. Persistent client caching with delegation. Increment file version # on every change.

    Coda: Extends AFS by supporting disconnected operation. Cache manager operates in hoarding (when connected cache copies of working set for user),emulating (when disconnected), reintegrating (when reconnected performs conflict resolution).

    LBFS: Wide area low bandwidth (wireless) networks. Aggressive compression on wire/air and use persistent cache on client. Server divides files into chunks and for each chunk computes a SHA-1 hash. Client does the same, can now do some things by just exchanging hashes. Chunk boundaries based on contents, computer fingerprint over every overlapping regions of every file. Now we get hits if contents appear anywhere in any file that we know about.

    Serverless File Systems: Trend towards serverless file-systems. Eg; xFS has clients, cleaners, managers, storage servers. To read a file: look up manager in globally replicated map, contact manager with request, manager redirect to cache or disk (imap). To write obtain write token from manager, append all changes to log, when hit threshold flush to stripe group. 10x faster than NFS due to co-operative caching, parallelism via software RAID (striping). Managers replicated for fault tolerance.

    JetFile: Serverless internet storage. Shared personal file access.

    Multicast request for data, receive 1 or few responses. Pick a response and unicast for data. Files named by (org, vol, fileID, version). Hash (org,vol,fileID) to get file address. Client now server for new version.

    File systems for SANs: Pool of disks accessed via iICSI. Can build a shared disk file system (SDFS) either: asymmetric- add a data manager, exclusive access to metadata disks, clients access data disks directly. Or symmetric- clients access data and meta data directly, distributed locking used for synchronisation. Asymmetric is simpler but less scalable and fault tolerant.

    Network Attached Secure Disks: Makes less stupid SAN. Disks provide variable length object interface (create, read, write) etc. Disk and file manager share secret key, after file system specific checks, file manager issues derived capabilities to clients -> clients can securely access disk directly. Fast and secure data path operations. But less allocation control.

    Venti: Distributed Archival Storage. Rather than using typical tape for archiving, use magnetic disk. Cheap, ubiquitous and fast but subject to overwrite. Use software to provide immutable storage, every file update a new version. Use content hashes to access everything, inode is a tree of block content hashes. Directories map names to inode content hashes. If a file changes in leaf directory, then some of its block hashes change -> inode changes -> root hash uniquely captures file system snapshot. Now addressing is via hashes need some way to map from hash to actual block location. Venti uses log based (append only storage). Index cache to stop index being a bottleneck, regular block cache too.