Using IRF on the Command Line
Example usage: irf3.exe Human21.fa 2 3 5 80 10 40 100000 500000 -t7 500000 -d -l
Things to note:
- This will use alignment scoring parameters: +2,-3,-5 (match, mismatch, indel) and a minimum score of 40
- The probability of a match is 80% and the probability of an indel is 10% (these values cannot be change)
- A maximum stem length of 100K is allowed
- A maximum loop length of 500K is allowed
- A tuple of length 7 (least sensitive) will look back at most 500K
- A data file will be produced
- Lowercase letters will be ignored during the detection phase
Once the program is installed you can run it with no parameters to obtain information on proper usage syntax. For example, if the program was installed as irf.exe
, then by typing irf.exe
on the command line, you will see the following:
Please use: irf.exe File Match Mismatch Delta PM PI Minscore Maxlength MaxLoop [options]
Where: (all weights, penalties, and scores are positive)
File = sequences input file
Match = matching weight
Mismatch = mismatching penalty
Delta = indel penalty
PM = match probability (whole number)
PI = indel probability (whole number)
Minscore = minimum alignment score to report
MaxLength = maximum stem length to report (10,000 minimum and no upper limit, but system will run out memory if this is too large)
MaxLoop = filters results to have loop less than this value (will not give you more results unless you increase -t4,-t4,-t7 as well)
[options] = one or more of the following :
-m masked sequence file
-f flanking sequence
-d data file
-h suppress HTML output (this automatically switches -d to ON)
-l lowercase letters do not participate in a k-tuple match, but can be part of an alignment
-gt allow the GT match (gt matching weight must follow immediately after the switch)
-mr target is mirror repeats
-r set the identity value of the redundancy algorithm (value 60 to 100 must follow immediately after the switch)
-la lookahead test enabled. Results are slightly different as a repeat might be found at a different interval. Faster.
-a3 perform a third alignment going inward. Produces longer or better alignments. Slower.
-a4 same as a3 but alignment is of maximum narrowband width. Slightly better results than a3. Much slower.
-i1 Do not stop once a repeat is found at a certain interval and try larger intervals at nearby centers. Better(?) results. Slower.
-i2 Do not stop once a repeat is found at a certain interval and try all intervals at same and nearby centers. Better(?) results. Much slower.
-r0 do not eliminate redundancy from the output
-r2 modified redundancy algorithm, does not remove stuff which is redundant to redundant. Slower and not good for TA repeat regions, would not leave the largest, but a whole bunch.
-t4 set the maximum loop separation for tuple of length4 (default 154, separation <=1,000 must follow)
-t5 set the maximum loop separation for tuple of length5 (default 813, separation <=10,000 must follow)
-t7 set the maximum loop separation for tuple of length7 (default 14800, limited by your system's memory, make sure you increase maxloop to the same value)
-ngs more compact .dat output on multisequence files, returns 0 on success.
Note the sequence file should be in FASTA format:
>Name of sequence
aggaaacctg ccatggcctc ctggtgagct gtcctcatcc actgctcgct gcctctccag
atactctgac ccatggatcc cctgggtgca gccaagccac aatggccatg gcgccgctgt
actcccaccc gccccaccct cctgatcctg ctatggacat ggcctttcca catccctgtg
- File: The sequence file to be analyzed in FASTA format. Multiple sequences in the same file are allowed.
- Match, Mismatch, and Delta: These are alignment weights for match (complementary pairs - AT, CG), mismatch (other pairs) and indels in Smith-Waterman style local alignment. Lower weights allow alignments with more mismatches and indels. A match weight of 2 has proven effective with mismatch and indel penalties in the range of 3 to 7. Mismatch and indel weights are interpreted as negative numbers. A 3 is more permissive and a 7 less permissive. The recomended values for Match Mismatch and Delta are 2, 7, and 7 respectively.
- PM and PI: Probabilistic data is available for PM values of 80 and 75 and PI values
of 10 and 20. The best performance can be achieved with values of PM=80 and PI=10. Values of
PM=75 and PI=20 give results which are very similar, but often require as much as ten times the
processing time when compared with values of PM=80 and PI=10.
- Minscore: The alignment must meet or exceed this alignment scoreto be reported. For example, if we set the matching weight to 2 and the minimun score to 50,
assuming perfect alignment, we will need to align at least 25 characters to meet the minimum score.
- MaxLength: Only alignments with stem sizes (repeat copies) less than or equal to this value will be reported. Minimum is 10,000. Larger sizes are permitted, but affect total memory usage.
- Maxloop: Only alignments with loop sizes less than or equal to this value will be reported.
- -m: This is an optional parameter and when present instructs the program to generate
a masked sequence file. The masked sequence file is a FASTA format file containing a copy of the
sequence with every location that occurred in an inverted repeat changed to the letter 'N'. The
word "masked" is added to the sequence description line just after the '>' character.
- -f: If this option is present, flanking sequence around each repeat is recorded in
the alignment file. This may be useful for PCR primer determination. Flanking sequence consists
of the 500 nucleotides on the outer ends of a repeat.
- -d: A data file (.dat) is produced if this option is present. This file is a text file
which contains the same information, in the same order, as the summary table file. This file contains no labeling except header lines for each sequence processed, and is suitable for additional processing, for example with a python script, outside of the program.
- -h: suppress HTML output (this automatically switches -d to ON).
- -l: lowercase letters do not participate in a k-tuple match (during detection of a repeat), but can be part of an alignment.
- -gt: allows a GT match (wobble base pairing in RNA) and must be followed by an integer for the match weight, similar in magnitude to the match parameter.
- -mr: detects mirror repeats rather than inverted repeats. In mirror repeats, matches are between identical nucleotides rather than complementary nucleotides.
- -r: sets the identity value of the redundancy algorithm and must be followed by an integer between 60 to 100.
- -la: lookahead test enabled. Results are slightly different as a repeat might be found at a different interval. Faster.
- -a3: performs a third forward alignment. Produces longer or better alignments. Slower.
- -a4: same as a3 but the alignment is of maximum narrowband width. Slightly better results than a3. Much slower.
- -i1: do not stop once a repeat is found at a certain interval and try larger intervals at nearby centers. Better(?) results. Slower.
- -i2: do not stop once a repeat is found at a certain interval and try all intervals at same and nearby centers. Better(?) results. Much slower.
- -r0: do not eliminate redundancy from the output.
- -r2: modified redundancy algorithm, do not remove repeats which are redundant to other redundant repeats. Slower and performs poorly for TA repeat regions. May not leave the longest repeat, but rather leaves many smaller repeats.
- -t4: sets the maximum loop separation for tuples of length 4 (default is 154), and must be followed by an integer separation <=1,000.
- -t5: sets the maximum loop separation for tuples of length 5 (default is 813), and must be followed by an integer separation <=10,000.
- -t7:sets the maximum loop separation for tuples of length 7 (default is 14800), and must be followed by an integer separation, only limited by your system's memory. Make sure to increase maxloop to the same value.
- -ngs: more compact .dat output on multisequence files, returns 0 on success. You may pipe input in with this option using - for file name. Short 50 nucleotide flanks are appended to .dat output. .dat output actually goes to stdout instead of a file. Sequence headers are displayed in output as @header. Only headers containing repeats are shown.
Using recommended parameters the command line will look something like:
irf yoursequence.txt 2 7 7 80 10 50 500 -f -d -m
Once the program starts running it will print update messages to the screen. The word
"Done" will be printed when the program finishes.
For single sequence input files there will be at least two HTML format output files, a
repeat table file and an alignment file.
If the number of repeats found is greater than 120, multiple linked repeat tables are
produced. The links to the other tables appear at the top and the bottom of each table.
To view the results start by opening the first repeat table file with your web browser. This file has the
extension ".1.html". Alignment files can be accessed from the repeat table
files. Alignment
files end with the ".txt.html" extension.
For input files containing multiple
sequences a summary page is produced that links to the output of individual
sequences. This file has the extension "summary.html". You should
start by opening this file if your input had multiple sequences in the same
file. Also note that the output files of individual sequences will have an
identifier of the form ".sn." ( n an integer) embedded in the name indicating
the index of the sequence in the input file. The identifier is omitted for
single sequence input files.
For more information on the output please see
Table Explanation and
Alignment Explanation.
|
Last revised July 5, 2023
Send any questions or comments to:
Gary Benson
|