| Project | Rearrangements in genomes with unequal content |
|---|---|
| Date | 04/11/05 |
| Version | 1.0 |
The purpose of a prototype is to work out the entire cycle end-to-end, from solution proposal, development and testing to experimentation and analysis to gain a deeper understanding of the process and the inherent difficulties. Therefore the prototype should be something that can be completed in a relatively short period of time.
The solution I propose for the prototype is a simple generalization of the distance function. Whenever we need to compare the genomic distance between two gene sequences, we will only consider genes which are common to the two sequences under consideration. By applying a preprocessing step to extract the common genes and we can then compare their gene orders using the standard method derived from HP theory.
The first sets of test data consists of only 3 input genomes, we start with the identity permutation as the ancestral genome and apply a number of reversal and single gene deletions to obtain 3 new genomes as input to the algorithm. The prototype algorithm is then used to reconstruct the phylogeny and the ancestral gene order.
Analysis of the experiments reveals that the new distance function does not behave as expected. Pairs of genomes with fewer genes in common tend to produce smaller distances whereas genomes with many genes in common have large distances. The is due to the fact that the number of rearrangement events is limited by the number of genes. This is counter intuitive as we expect genomes with widely different gene content to have a larger genomic distance where genomes with similar gene content to have a smaller genomic distance. The reason for this anomaly is that the distance function does not take into consideration the gene content of the genomes.
Possible extensions to this approach would be to incorporate differences in the gene content as part of the distance function and add insertion/deletion of genes into the list of allowed rearrangement operations. However the problem is that an insertion/deletion operation is usually not a “good” rearrangement operation so it may only be used when solving for the median.
In the prototype, the main approach is to “delete” away redundant genes when computing the genomic distance between two genomes. An approach that has the opposite effect would be to insert “missing” genes so that two genomes under comparison have the same content.
The idea here is the exact opposite of what was used in the prototype. We first preprocess all the genomes and for each genome we try to randomly insert “missing” genes (genes which are found in other genomes but not in the current one). The resulting set of genomes will have the same content and the original MGR algorithm can be used to recover the phylogeny. The output can then be processed to removed the genes inserted in the preprocessing phase.
A few initial experiments showed that this approach is worse then the prototype, due to the fact that the additional genes are inserted randomly. When a large number of genes are inserted this way, this tends to lead to a lack of “good” rearrangements in the resulting set of genomes so that algorithm resorts to the iterative tree building step. Occassionally it also results in triplets which cannot be solved by the median algorithm.
Lessons learnt from the previous two iterations showed that deleting genes is more accurate than randomly inserting genes. Therefore the approach taken in this iteration is similar to that of the prototype.
See Minutes 06/12/05