IRF Definitions
The FASTA format is a plain text format which looks something like this:
>myseq
AGTCGTCGCT AGCTAGCTAG CATCGAGTCT TTTCGATCGA GGACTAGACT TCTAGCTAGC TAGCATAGCA
TACGAGCATA TCGGTCATGA GACTGATTGG GCTTTAGCTA GCTAGCATAG CATACGAGCA TATCGGTAGA
CTGATTGGGT TTAGGTTACC
The first line starts with a greater than sign ">" and
contains a name or other identifier for the sequence. This is the sequence
header and must be in a single line. The
remaining lines contain the sequence data. The sequence
can be in upper or lower case letters. Anything other
than letters (numbers for example) is ignored. Multiple sequences can be present
in the same file as long as each sequence has its own header.
The summary table includes the following information:
- Left repeat copy (left stem) indices relative to the start of the sequence.
- Length of the left stem.
- Right repeat copy (right stem) indices relative to the start of the sequence.
- Length of the right stem.
- Loop length (distance between stems).
- Percent of matches (pairings) in the alignment between left and right stems.
- Percent of indels in the alignment between left and right stems.
- Alignment score.
- Percent composition for A or T nucleotides in the stems.
- Percent composition for C or G nucleotides in the stems.
- Percent of complementary matchings (pairs) in the alignment that are A-T.
- Percent of complementary matchings (pairs) in the alignment that are G-C.
- Percent of matchings (pairs) in the alignment that are G-T.
- Index for center of loop times two.
- Average center for pairings times two.
If the output contains more than 120 repeats, multiple linked
tables are produced. The links to the other tables appear at the top and bottom
of each table.
Note: If you save multiple linked summary table files,
use the default names supplied by your browser to
preserve the automatic linking.
The alignment is presented as follows:
- The first line shows the 10 nucleotides in each flanking sequence. The left flanking sequence (LF) immediately precedes the left stem and the right flanking sequence (RF) immediately follows the right stem.
- In each subsequent pair of lines, the left stem is on the top and the right stem is on the bottom.
- All the nucleotides are shown in the left stem.
- Symbol * in the right stem indicates a matching pair.
- A nucleotide letter in the right stem indicates a mismatch pair.
- Symbol - in the right or left stem indicates an insertion or deletion.
- Statistics:
- Percentages for matches (matching pairs), mismatches (mismatching pairs) and indels overall between the left and right stems.
- Percentages for pairs (AT, CG, GT) among the matches.
Note: If you save the alignment file, use the default name supplied by your browser to preserve the automatic cross-referencing with the summary table.
Input to the program consists of a sequence file and the following parameters:
- Alignment Parameters. Weights for match, mismatch and indels. These parameters are for Smith-Waterman style local alignment. Lower weights allow alignments with more mismatches and indels. Match weight is +2 in all options on this website. Mismatch and indel weights (interpreted as negative numbers) are either 3, 5, or 7. A 3 is more permissive and a 7 less permissive of these types of alignments choices.
- Minimum Alignment Score. The alignment score must meet or exceed this value for the repeat to be reported.
- Maximum Stem Length. The stem lengths (repeat copy lengths) must be no larger than this value for the repeat to be reported. Limited to 10,000 on this website.
- Maximum Loop Length. The loop (sequence between the stems) must be no longer than this value for the repeat to be reported. Limited to 14,800 on this server.
- Detection Parameters. Matching probability Pm and indel probability Pi.
Pm = .80 and Pi = .10 by default and cannot be modified in this version of the program.
- Flanking sequence. Flanking sequence consists of the 500 nucleotides on each side of a
repeat. Flanking sequence is recorded in the alignment file. This may be useful for PCR primer determination.
- Masked Sequence File. The masked sequence file is a FASTA format file containing a copy of the sequence with every character that occurred in an inverted repeat changed to the letter N. The word "masked" is added to the sequence description line just after the >
character.
- Data File. The data file is a text file which contains the same information, in the same order, as the repeat table file, plus consensus and repeat sequences. This file contains no labeling except for header lines for each submitted sequence and is suitable for additional processing, for example with a python script, outside of the program.
- Wobble Pairings. Allows a GT match (wobble base pairing in RNA). Matching weight is set to +1.
- Mirror Repeats. Detects mirror repeats rather than inverted repeats. In mirror repeats, matches are between identical nucleotides rather than complementary nucleotides. Useful for establishing a background frequency of repeats in a sequence to compare to the detection of inverted repeats in a separate run.
|
Last revised July 5, 2023
Send any questions or comments to:
Gary Benson
|