The Laboratory for Biocomputing and Informatics concentrates on developing algorithms for the discovery, analysis, and annotation of DNA sequence features. In the last year, the lab has been working on the following topics:
- Seed design. Homologous DNA sequences have a common evolutionary ancestor and are similar but not identical because of mutations. Seeds are used to search for homologies or repetitive sequences. A seed is a small DNA word which is used to find exact matches which may be part of longer approximate matches. In the previous year, the lab developed a new seed model for detecting homologous sequences when they differ by insertion and deletion mutations. These indel seeds were shown to be more sensitive to repeat detection than an earlier model, spaced seeds. In the last year, we have extended our research on indel and spaced seeds. We showed a new algorithm and data structure, based on the Aho-Curasick pattern matching tree, that allows the determination of optimally sensitive spaced seeds independent of the mutation characteristics of the homologous regions. All previous work in this area detected optimal seeds only for homologous regions with specific mutation characteristics. We are now investigating the use of multiple seeds (both indel and spaced) to improve sensitivity of search and decrease the number of false positive detections. We are able to calculate the sensitivity of multiple seeds using a modification of the previously mentioned algorithm and data structure.
- Tandem Repeats. We have contributed to a study of tandem repeat plasticity in the Dog genome. Previously, it has been argued by one of our collaborators that microsatellite tandem repeats (with periods of 1-6 nucleotides) exhibit increased tendency to undergo slippage mutations which change the number of copies present in individual repeats. This increased ability to mutate was argued to be responsible for the wide diversity in dog morphology. In essence, slippery tandem repeats have allowed the breeding of dogs of all shapes and sizes. The question under investigation was whether the increased slippage tendency was present in dogs before or after domestication began. Examination of many mammalian species showed that in other species of Canidae, the evidence of increase slippage is also present, suggesting that dogs were domesticated because they had a pre-existing plasticity. In other ongoing work, we are investigating approaches for whole genome clustering of tandem repeats by sequence similarity. We have also recently published a paper on our Tandem Repeats Database, a publicly available repository of information on these repeats.
- Transposons. We have been involved with the analysis of transposon history in the human and other mammalian genomes. Transposons are nucleic acid sequences that have the ability to copy themselves from one place to another in a genome. One family of transposons in the human genome has over 10,000 individual copies. Each copy is subject to mutation and over time, most transposons become inactive. There have been hundreds of transposon familes active over the course of vertebrate evolution. This research has concentrated on accurately placing the active period of these families along a time line. This was accomplished by observing the interruption of existing transposons by members of another family. An interruption looks like this, where the older transposon is in upper case and the newer one in lower case.
These types of interruption have happened thousands of times. We reconstructing the older interrupted transposons and determined ordering by collecting statistics on which family member was intact and which family member was split in each interruption.
- Copy Number Variation. We are working in collaboration with researchers from the BU medical school to identify copy number variation in participants from the Framingham heart study. Copy number variation refers to small parts of a chromosome, perhaps 10k to 100k long which are deleted or present in multiple copies. It was recently discovered that copy number variation is common across the human population. It most likely contributes to disease, but is also present in "normal" phenotypes. We are building a database for 100k SNP data (whole genome, 100k single nucleotide polymorphism data) to quickly identify copy number variants in some 170 Framingham participants. Ultimately, these variations will be correlated with disease phenotypes and family transmission.
See our list of papers for more details.