Comprehensive Sequence Analyses of 5' Flanking Regions of Primate Alu Elements

Yoshimi Toda[1][2] (
Rintaro Saito[1][2] (

[1] Laboratory for Bioinformatics
[2] Graduate School of Media and Governance
[3] Faculty of Environmental Information, Keio University
5322 Endo, Fujisawa 252-0816, Japan


Retrotransposons have been generally known to integrate randomly into host genomes. Jurka (1997) [3], however, showed consensus sequence patterns at integration sites of certain mammalian retrotransposons, and suggested involvement of sequence specific enzymes that mediate integration.

We have conducted comprehensive sequence analyses of 5' flanking regions of primate Alu elements. In contrast to the small but clean data set Jurka (1997) [3] used, (1) larger number of samples were used, (2) wider region of 5' end of Alu elements was analyzed, and (3) comparisons were made among different subfamilies for comprehensive analyses in order to identify characteristic sequence pattern(s) preceding 5' end of Alu elements. The nucleotide occurrences at each position within 500 bases of 5' end of Alus were counted to obtain profiles. Information content at each nucleotide position in the same region was, then, computed. Distinctive difference in the nucleotide composition and information content values that divides the region into two was observed. The region between -20 and 5' end of Alu elements is found to be highly adenine-rich and shows significantly higher information content values compared to the rest of the region, implying the existence of certain characteristic sequence pattern in this region. Also, younger subfamilies of Alu elements show higher information content values than older subfamilies. This implies that certain characteristic sequence pattern already existed in the region between -20 and 5' end of Alu elements at the time of Alu integration, and accumulation of mutation in the course of time resulted in the less distinctive sequence pattern in older sequences. Frequencies of all possible triplets (total of 64) were measured in the same region in order to identify characteristic sequence pattern(s). Observation that frequencies of triplets "aaa," "taa" and "tta" in the 5' flanking sequences were high is consistent with Jurka (1997) [3]. Frequencies of some other triplets such as "gaa," "caa," "aac," "ctt," "gtt," "atg," etc. which do not comprise the primary candidates for the nick site in Jurka (1997), also show significantly high frequencies.

[ Full-text PDF | Table of Contents ]

Japanese Society for Bioinformatics