Bioinformatics

 

<!-- @page { size: 8.5in 11in; margin-left: 0.41in; margin-right: 0.36in; margin-top: 0.23in; margin-bottom: 0.33in } P { margin-bottom: 0.08in } H1 { margin-bottom: 0.08in } H1.western { font-family: "Arial", sans-serif; font-size: 16pt } H1.cjk { font-family: "Lucida Sans Unicode"; font-size: 16pt } H1.ctl { font-family: "Tahoma"; font-size: 16pt } TD P { margin-bottom: 0in } -->

Bioinformatics Notes

  • Human Cells
  • Edit & Hamming Distance
  • Sequence Alignment Graphs
  • Needleman Wunsch Algorithm
  • Smith Waterman Algorithm
  • Four russians speedup
  • Fasta
  • Pattern Hunter
  • Phylogeny Trees
  • Small Parsimony Problem
  • Fitch Margoliash
  • Hidden Markov Models
  • Gene Expression Microarrays
  • Gillespie Algorithm

    Human Cells: The human genome consists of 46 chromosomes which are formed from genes. DNA formed from four nucleotides (G-C,A-T). Genes are transcribed to RNA then translated into amino acids (protein building blocks).

  • Sequence alignment: Given two strings an alignment is an assignment of gaps so the sequences match. Between two sequences there can be matches, mismatches and insertions/deletions (where there is a gap in one sequence).

    Edit Distance: The minimum number of operations (insertions, deletions,substitutions) to transform one string into another. Eg;


    requires the insertion of – at the beggining, and deleting T from the end. Edit distance=2. Hard to compute.

    Hamming Distance: Aligning sequences without insertions and deletions, ie compute how many mismatches. Trivial to compute. Eg;


    requires 8 changes, hamming distance-8.

    Alignment Graphs: Score a 1 for a match (diagonal), 0 for an insertion/delete (horizontal/vertical). Eg;


    Has score 5.

    A Scoring function: F(from,to) = (#matches x m) - (#mismatches x s) - (#gaps x d).

    Needlemann-Wunsch:

    Go through all cells computing the score of the path to there. Say +1 for a match m, -1 for a mismatch s, and -2 for a gap d. At each point choose the optimal neighbor. The bottom right is the best match for the whole sequence.

    Eg;

    Start by filling in matches all with 0 (-2 for each time).

     

    0 _

    1 A

    2 G

    3 C

    0 _

    0

    -2

    -4

    -6

    1 A

    -2

     

     

     

    2 A

    -4

     

     

     

    3 A

    -6

     

     

     

    4 C

    -8

     

     

     

    The cell at (1,1) gets a +1 for a match (A,A).

    We could then match the A with a _ by going to (2,1) which gets -2, so 1-2=-1.

    We could then go right again by matching with a _, scoring -2, so -1-2=-3.

    Then onto the next row, going from the 1 at (1,1) we match a _, scoring -2, so 1-2=-1. We have:

     

    0 _

    1 A

    2 G

    3 C

    0 _

    0

    -2

    -4

    -6

    1 A

    -2

    1

    -1

    -3

    2 A

    -4

    -1

     

     

    3 A

    -6

     

     

     

    4 C

    -8

     

     

     

    Continuing on we have:

     

    0 _

    1 A

    2 G

    3 C

    0 _

    0

    -2

    -4

    -6

    1 A

    -2

    1

    -1

    -3

    2 A

    -4

    -1

    0

    -2

    3 A

    -6

    -3

    -2

    -4

    4 C

    -8

    -5

    -4

    -1

    With the path (the minimal at each point) highlighted.

    The Smith-Waterman Algorithm:

    This finds the best local alignments out of a whole string, whereas the NW algorithm only considers the whole string. Eg;

    x = aaaacccccgggg

    y = cccgggaaccaacc

    It works the same as NW, except there are no penalty for gaps but there still is for mismatches. Once the matrix is filled in the highest scored cell is traced back (not neccesarily the bottom right cell).


     

    The lowest common subsequence algorithm: Finds the lowest common subsequence of two strings, running in O(nm).

    Convex Gap Sequences: Gap penalty decreases with distance rather than linearly.

    Normal Gap penalty

    Convex Gap penalty

    Affine Gap




    Convex gap runs in O(N2M). Affine gap is an approximation.

    Banded DP:

    If two strings are close, most alignments will be diagonal.

    Only need to compute diagonal around matrix.

    Takes O(nk) rather than O(n2).



    Four russians speedup:

    Split a sequence into chunks.


    An entire block in the first string is aligned with an entire block in the second string. Entire blocks are inserted and deleted. A block path must travel through the corners of blocks. By putting mini alignments into a lookup table for scores of blocks, runs in O(n2/log n).

    FASTA:

    Search regions for exact word matches. Select top 10 scoring local diagonals, trim off the ends to achieve the highest score. Join diagonals by introducing gaps, Compute the Z score = (x - mean of random scores)/standard distribution to estimate the number of sequences that would randomly produced. This is the E value. The higher the e value, the higher the less likely the sequence is aligned purely by chance.

    E = Kmn eS

    E is the number of hits you would expect from your search with scores greater than S where:

    m is the size of the query

    n is the size of the database being searched

    λ scales for the specific scoring matrix used (decay constant from the extreme value distribution)

    K is a constant

    p=1-e-E

    BLAST:

    BLAST takes as input a query sequence and a sequence database.

    BLAST returns sub-sequences in in the query that are similar to the database.

    W=Size of words to split query sequence into

    T=Threshold to consider a match

    M=% of matches required to consider a sequence still matched

    E=If alignment is of enough interest to be reported, its score will be above E

    - Split the query into words of size W

    - Scan the database for these words, for any ungapped matches within a threshold of T (dont have to be exact)

    For example, given the sequences AGTTAC and ACTTAG and a word length W = 3, BLAST would identify the matching substring TTA that is common to both sequences. By default, W = 11 for nucleic seeds.

    - For each match, keep extending in both directions until the percentages of matches is below M

    For the example, the ungapped alignment between the sequences AGTTAC and ACTTAG centered around the common word TTA would be:

    ..AGTTAC..

    | | | |

    ..ACTTAG..

    - Report the matched sequences found, if their score is above E

    - Continue scanning for matches until all words have been considered

    • Then try gapped sequences


    Runs in O(nm). A segment pair is maximal if it has the best score over all segment pairs. A segment pair is locally maximal if its score can't be improved by extending or shortening. Gapped BLAST searches for two non overlapping seeds occurring within a fixed size window. PSI Blast uses significant alignments to construct "position specific" score matrix. Longer seeds speedup, but lose sensitivity. BLASTZ uses a spaced seed with dont care positions to increase sensitivity.

     

    Pattern hunter:

    Spaced seed eg 111010010100110111: Eleven required matches and seven dont care positions.

    Phylogeny Trees: A root in a phylogeny tree is the point furthest back in time. By setting the length of an edge to the Hamming distance the parsimony score is the sum of lengths of the edges.

    Small Parsimony Problem:

    Input: Tree T with each leaf labeled by an m-character string.

    Output: Labeling of internal vertices of the tree T minimizing the parsimony score.

    With the weighted small parsimony problem the Input includes a k * k scoring matrix

    describing the cost of transformation of each of k states into another one.

    Fitch-Margoliash:

    1) Find the mostly closely related pairs of sequences (A, B).

    2) Treat the rest of the sequences as a composite. Calculate the average distance from A to all others; and from B to all others.

    3) Use these values to calculate the length of the edges a and b.

    4) Treat A and B as a composite. Calculate the average distances between AB and each of the other sequences. Create a new distance table.

    5) Identify next pair of related sequences and begin as with step 1.

    6) Subtract extended branch lengths to calculate lengths of intermediate branches.

    7) Repeat the entire process with all possible pairs of sequences.

    8) Calculate predicted distances between each pair of sequences for

    each tree to find the best tree.

    Sankoffs Algorithm:

    Check children's every vertex and determine the minimum between them.

    Calculate and record a minimum parsimony score of the subtree for every possible label at each vertex. The score is the minimum of the left child+right child.

    Identical results to fitch.

    Tree distance: Length of path between two leaves.

    Edit distance: The distance matrix is computed by the distances between leaves. Eg;



    UPGMA:

    Computes the distance between clusters using average pairwise distance. It assigns a height to every vertex, unlike Fitch it assumes a molecular clock. The distance from the root to any leaf is the same.

    Initialization:

    Assign each xi to its own cluster Ci

    Define one leaf per sequence, each at height 0

    Iteration:

    Find two clusters Ci and Cj such that dij is min

    Let Ck = Ci U Cj

    Add a vertex connecting Ci, Cj and place it at height dij /2

    Delete Ci and Cj

    Termination:

    When a single cluster remains

    Ultrametricity: All tips are an equal distance from the root.

    Additivity: Distance between any two tips equals the total branch length between them.

    Likelihood criterion: Given two trees, the one maximizing the probability of the observed data is best.

    The likelihood for site i on a tree:



    Markov chains: For a zero order Markov chain we estimate the frequency of a word from base composition alone. p(GGATCC) = p(G)p(G)p(A)p(T)p(C)p(C).

    Hidden Markov Models: Consist of a hidden stochastic process that satisfies the property of a Markov chain and a series of observable variables generated by the process. The transmission matrix represents the probability of a transition. Consists of an alphabet, a set of states. Transition matrix of transition probabilities between states, and an emission matrix of emission probabilities within each state. A hidden markov model is memory-less, at each time step t the only thing that affects future states is the current state. A parse of a sequence is a sequence of states. The probability of a parse is the probability of the next letter at each state.

    Forward HMM algorithm: at(i)= P(The observed sequence, ending in state i at base t). What is the probability of the observed sequence? Forward.

    Backward HMM algorithm: bt(i)= P(observation after t | ending in state i at base t). What is the probability that the 3rd state is B, given the observed sequence? Backward.

    Viterbi: Max P(observation, ending in state i at base t). What is the most likely die sequence? Viterbi.

    Evaluation problem: How likely is a sequence, given a model? Find the sequence of states that maximize P(x,|M).

    Decoding problem: How was the sequence generated? Find the sequence π of states that maximize P(x,π|M).

    Learning Problem: What are the probabilities of the states, and from which states was a sequence produced? Find parameters 0 that maximize P(x|0).

    Viterbi algorithm: A dynamic programming algorithm for finding the most likely sequence of hidden states. Runs in O(K2N) time and O(KN) space. The probabilities multiplied together are very small, often take the log.

    The Forward algorithm: Finds P(x) = probability of x, given the HMM. Works by summing over all the possible ways of generating x, P(X)=ΣπP(x|π)P(π). To avoid summing over an exponential number of paths π define fk(i)=P(x1,...,xj, π i=k) - the forward probability.



    Backward algorithm: Computes P(πi= k|X), the probability distribution on the ith position given x. P(πi= k|X) = P(X1,.., Xi, πi =k)P(Xj+1...Xn | πi=k) = fk(i) x bk(i). Runs in O(K2N) time and O(KN) space. Often multiply by a constant at each point (As with forward algorithm) to prevent underflows.



    Posterior decoding: Answers "What is the most likely state at position i of sequence x.

    π^i = argmaxk x P(πi= k|X). Where P(πi= k|X) = fk(i) x bk(i) /P(X).

    Gene expression microarrays: cDNA arrays (long sequences, spot unknown sequences, more variability) and Oligonucleotide arrays (short sequences, spot known sequences, more reliable data). Clustering: Genes with similar functions have similar expression patterns. Classification: Determining which states (eg health, cancer) cells are in algorithmically.

    K-Means Clustering Algorithm: 1 Randomly initialize k cluster means. 2 Assign n genes to the nearest cluster mean. 3 Recompute cluster means and go to 2, unless the clustering has converged. The random selection of initial center points creates on-determinism and may produce clusters without patterns. Linear complexity.

    Hierarchical clustering: Start with each point its own cluster. At each iteration, merge the two clusters with the smallest distance.

    Principal component analysis: Reduces the dimensionality of a data set by finding a new set of bariables, smaller than the original set of variables, that nonetheless retains most of the samples information. Principal component analysis: Reduces the dimensionality of a data set by finding a new set of variables, smaller than the original set of variables, that nonetheless retains most of the samples information. Works by computing Cov(x), eigenvalues and eigenvectors, construct matrix.

    MCL Clustering algorithm: Take a random walk on the graph described by similarity matrix but after each step we weaken the links between distant nodes and strengthen the links between nearby nodes. Given a network with n vertexes, it takes the corresponding n×n adjacency matrix A and normalizes each column to obtain a stochastic matrix M. It takes the k th power Mk of this matrix (expansion) and then the r th power mr(ij) of every element (inflation).

    Photolithography: Masks added to each base, if present there will be a hole in the corresponding mask. Creates high density arrays, but sequence length is limited.

    Genetic network: A group of genes which influence the activity (eg expression) of each other.

    Genetic perturbation: An experimental manipulation of gene activity.

    Digraph: Used to represent genetic networks. The adjacency list is the list of nearest neighbors, the list of direct regulatory interactions. The accessibility list shows all nodes (Genes) that cn be influenced from a given gene. The adjacency matrix of a graph contains elements that are 1 if a directed edge exists between the two nodes. The accessibility matrix has a 1 if a path exists between the two nodes. For any list of perturbation effects there exists exactly one genetic network with fewer edges than any other network with the same list of perturbation effects. An accessibility list Acc and a digraph G are compatible if the graph has Acc as its accessibility list. The range of an edge e connecting two nodes i and j is the length of the shortest path between i and j in the absence of e. An edge e with range greater than two, but not infinity, is called a shortcut. A shortcut free graph compatible with an accessibility list is a unique graph with the fewest edges among all graphs compatible with the accessibility list. For any accessibility list Acc of a digraph there exists a compatible graph that is free of shortcuts. Assume that Acc is the accessibility list of a digraph G. For each node x, the adjacency list Adj(x) of a shortcut-free graph Gpars compatible with Acc is a subset of the adjacency list Adj(x) of any graph compatible with Acc.


    In words, for each node i the adjacency list Adj(i) of the most parsimonious genetic network is equal to the accessibility list Acc(i) after removal of all nodes that are accessible from any node in Acc(i).


    A strongly connected component or strong component of a directed graph G is a maximal subset of nodes of G in which every two nodes are mutually accessible. That means, for any two nodes i and j, there is a path from i to the j, and vice versa.

    Gillespie algorithm:

    Initialise the system at time t=0.

    For every reaction calculate aj.

    Calculate a0=Saj.

    Simulate the time to the next reaction using exponential distribution with parameter (a0).

    Increase time by this amount.

    Generate a random number r from U[0,1].

    Pick reaction j such that – (a1+ …aj-1) < r a0 U (a1+ …aj)

    Execute reaction Rj and update the number of molecules in the system.