Experimentation plan

Experiments carried out

  • Model parameters
    • number of genes in the root (n)
    • number of genomes in the leaves (m)
    • number of rearrangements per edge (k)
    • percentage of deletions (D)
    • genomes are circular ? (C)
    • generate outgroup ? (O)
  • Algorithm parameters
    • use heursitic 5, favour short reversals (H5)
    • use “best rearrangements” only (B or H9)
    • normalize all the genomes at the start (E)
    • use iterative tree building only, do not use “good” rearrangements (G)
  • Performance metrics
    • score of the recovered tree
    • percentage of edges which correspond to edges in the model tree
      • computed as (num_edges - RF_dist) / num_edges
    • reversal distance between recovered root with the gene order of the true root (for data sets with outgroup)
  1. For 30 makers and 3 genomes, plot ratio of rearrangements/markers vs average score difference of the trees for 10%, 20%, 30% and 40% single gene deletion events
  2. For k = 2 and k = 3 using 30 markers, plot number of genomes vs average score difference. Model tree is that of a complete binary tree with k rearrangements per edge.
  3. For 30 markers and 10 genomes, plot the percentage of good edges vs number of rearrangements per edge for mgr, mgr using only the iterative tree building, best rearrangements, best rearrangements using only iterative tree building. Topology is generated using the Beta-splitting model with beta = -1.
  4. For 30 markers and 10 genomes, generate datasets with outgroup and compute the average reversal distance between the recovered ancestor (taken to be the parent of the outgroup) and the actual root of the model tree. The outgroup is generated by first generating the topology of 9 genomes and then generating another genome from the root to form the outgroup. The number of rearrangements used to generate the outgroup is 2 + (height of the smallest complete tree with >= 9 leave nodes).
    • Problem: average reversal distance decreases as the percentage of deletions increases
    • Solution: Maximum reversal distance between two genomes of size n is n, therefore a solution to the problem is to use normalized reversal distance instead, i.e. distrev/n

Tree models

Beta-splitting model (Aldous 1995)

Procedure for generating a tree topology base on the beta-splitting model:

  1. Place n genomes on the unit interval at uniform random position
  2. Split the interval according to a random point chosen from a probability density function f.
    • In the case of the beta-splitting model, f ∝ xβ(1 − x)
  3. Rescale the two subintervals to unit length repeat recursively on subintervals for as long as subintervals contain at least two particles

Aspects of the beta-splitting model:

Beta Description Median split
−2 Completely unbalanced 1
-1.5 PDA model 1.5
−1 Unnamed √m
0 Markov model m/4
An almost completely balanced model m/2

Experiments in literature

Genome-Scale Evolution (Bourque and Pevzner 2002)

  • For 30 and 100 markers, plot ratio of reverals/markers vs average score difference of the trees and ratio of reverals/markers vs average reversal distance of recovered ancestral permutation to identity
    • Starting with the identity permutation, perform k reverals each to get G1, G2 and G3.
    • Similar as above except perform k, k and 2k random reversals
    • Small tree with 4 genomes as leaves and two internal nodes, pick one of the internal nodes as identity
  • For k = 2 and k = 3 using 30 markers, plot number of genomes vs average score difference
    • Larger complete unrooted binary trees and simulated k random reversals on each branch
  • Hepesvirus data, 3 genomes
  • Human, fruit fly and sea urchin mtDNA data, 3 genomes (ref Sankoff et al. 1996)
  • Metazoan mtDNA data, 11 genomes (ref Blanchette et al. 1999)
  • Campanulacea cpDNA data, 13 genomes (ref Cosnet et al. 2000a,b and Moret et al. 2001)

Reconstructing Optimal Phylogenetic Trees (Moret and Warnow 2002)

  • Trees generated using Jukes-Cantor model with varing number of taxa and rates of evolution
  • Randomly generate model tree topologies from the uniform distribution on binary leaf-labelled trees
  • Used 5, 10, 20, 40 and 80 taxa and 8 different expected evolutionary rates. For each evolutionary rate and problem size, we generate a total of 100 topologies, grouped into 10 runs of 10 trials.
  • For 40 taxa, sequence length 500 and 2000, plot % quartets vs % tree edges for various evolutionary rates.

Phylogenetic Reconstruction from Arbitrary Gene-Order Data (Tang et al. 2004)

  • Datasets of 10 and 11 genomes, genome sizes of 50 and 100 (typical mtDNA and cpDNA)
  • Evolutionary rates of 2, 4 and 6 expected events per tree edge
  • Each tree node has a 5% chance to lose one segment of genes, length of this segment is at most 10% of number of genes in its parent
  • For each combination of settings, ran 10 datasets
  • Table 1 and 2: average false positives and false negatives of number of edges in error
  • cpDNA dataset of 7 genomes
    • new method
    • neighbour-joining on distance matrix
    • regular GRAPPA using breakpoints medians
    • regular GRAPPA with inversion medians

Reversing Gene Erosion (Young et al. 2004)

  • 10 gamma-proteobacteria
  • Simulated data was created using the same tree as for the bacterial dataset, edge lengths were interpreted as the number of operations per gene.
  • Allowed operations are insertions, deletions and inversions. In moving from root to leaves, a particular gene can only be inserted along one edge of the tree.
  • Simulated 100 labellings of the tree with a root genome size of 200 genes for each of 5 scenarios: inversion only, no deletions, no insertions, low levels of insertion and deltion and high levels of insertion and deletion. Leaf genomes produced ranged from 70 genes to 400 genes
  • Compared number of operation over all edges, compared edge lengths in reconstructed trees with those in true trees by computing the ratio of the lengths for each edge.
  • Calculated the number of operations needed to convert the reconstructed genome labels at internal nodes into the corresponding labels of the true tree. Distances are normalized by dividing by the size of the true tree genome.

Experimentation in Phylogeny (Moret et al. 2004)

  • An assessment must take into account
    • accurary of the reconstruction
    • biological significance of data
    • scaling up of resource consumption (time and space)
  • Simulations must be based on the best possible models, in this case we need accurate models of speciation and extinction, gene gain/loss and of genome rearrangements
  • A typical simulation run as follows:
    1. generate a rooted binary tree (model speciation and extinction) with the appropriate number of leaves
    2. assign a length to each edge of the tree (model of species divergence)
    3. place a genome of suitable size and composition at the root
    4. evolve the genomes down the tree (model of genome evolution)
    5. collect the genomes thus generated at the leaves and use them as input to the reconstruction algorithm
    6. compare the topology of the reconstructed tree and the model tree
  • Guidelines
    • tree shape plays a surprising large role
    • evolutionary models for divergence and genome evolution are important, most reconstruction exhibit poor accurary when the diameter (ratio of the largest to smallest pairwise distance) of the dataset is large
    • testing a large range of parameters using many runs for each setting to estimate variance are essentially parts of any testing strategy
 
mgr/experimentation_plan.txt · Last modified: 2007/12/29 09:37 (external edit)
 
Recent changes RSS feed Creative Commons License Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki