Using Tandem Repeats Finder on the Command Line
Once the program is installed you can run it with no parameters to obtain information on proper usage syntax. If you installed the program as trf then by typing trf on the command line you will see the following output:
Please use: trf File Match Mismatch Delta PM PI Minscore MaxPeriod [options]
Where: (all weights, penalties, and scores are positive)
File = sequences input file
Match = matching weight
Mismatch = mismatching penalty
Delta = indel penalty
PM = match probability (whole number)
PI = indel probability (whole number)
Minscore = minimum alignment score to report
MaxPeriod = maximum period size to report
[options] = one or more of the following:
-m masked sequence file
-f flanking sequence
-d data file
-h suppress html output
-r no redundancy elimination
-l <n> maximum TR length expected (in millions) (eg, -l 3 or -l=3 for 3 million)
-ngs more compact .dat output on multisequence files, returns 0 on success.
Output is printed to the screen, not a file. You may pipe input in with
this option using - for file name. Short 50 flanks are appended to .dat
output.
Note the sequence file should be in FASTA format:
>Name of sequence
aggaaacctgccatggcctcctggtgagctgtcctcatccactgctcgctgcctctccag
atactctgacccatggatcccctgggtgcagccaagccacaatggccatggcgccgctgt
actcccacccgccccaccctcctgatcctgctatggacatggcctttccacatccctgtg
- File: The sequence file to be analyzed in FASTA format. Multiple sequences in the same file are allowed.
- Match, Mismatch, and Delta: These are alignment weights for match, mismatch and indels in Smith-Waterman style local alignment using wraparound dynamic programming. Lower weights allow alignments with more mismatches and indels. A match weight of 2 has proven effective with mismatch and indel penalties in the range of 3 to 7. Mismatch and indel weights are interpreted as negative numbers. A 3 is more permissive and a 7 less permissive. The recomended values for Match Mismatch and Delta are 2, 7, and 7 respectively.
- PM and PI: Probabilistic data is available for PM values of 80 and 75 and PI values
of 10 and 20. The best performance can be achieved with values of PM=80 and PI=10. Values of
PM=75 and PI=20 give results which are very similar, but often require as much as ten times the
processing time when compared with values of PM=80 and PI=10.
- Minscore: The alignment of a tandem repeat must meet or exceed this alignment score
to be reported. For example, if we set the matching weight to 2 and the minimun score to 50,
assuming perfect alignment, we will need to align at least 25 characters to meet the minimum score
(for example 5 copies with a period of size 5).
- Maxperiod: Period size is the program's best guess at the pattern size of the tandem
repeat. The program will find all repeats with period size between 1 and 2000, but the output
can be limited to a smaller range.
- -m: This is an optional parameter and when present instructs the program to generate
a masked sequence file. The masked sequence file is a FASTA format file containing a copy of the
sequence with every location that occurred in a tandem repeat changed to the letter 'N'. The
word "masked" is added to the sequence description line just after the '>' character.
- -f: If this option is present, flanking sequence around each repeat is recorded in
the alignment file. This may be useful for PCR primer determination. Flanking sequence consists
of the 500 nucleotides on each side of a repeat.
- -d: A data file (.dat) is produced if this option is present. This file is a text file
which contains the same information, in the same order, as the summary table file, plus
consensus pattern and repeat sequences. This file contains no labeling and is suitable for additional processing, for example with a python script, outside of the program.
- -h: suppress HTML output (this automatically switches -d to ON).
- -l <n>: Specifies that the longest TR array expected in the input is at most n million bp long. The default is 2 (for 2 million). Setting this option too high may result in an error message if you did not have enough available memory. We have only tested this option up to value 29. Omit brackets when using, i.e., -l 2 or -l=2.
- -u: Prints the help/usage message above.
- -v: Prints the version information.
- -ngs: More compact .dat output on multisequence files, returns 0 on success. You may pipe input in with this option using - for file name. Short 50 nucleotide flanks are appended to .dat output. .dat output actually goes to stdout instead of a file. Sequence headers are displayed in output as @header. Only headers containing repeats are shown.
Using recommended parameters the command line will look something like:
trf yoursequence.txt 2 7 7 80 10 50 500 -f -d -m
Once the program starts running it will print update messages to the screen. The word
"Done" will be printed when the program finishes.
For single sequence input files there will be at least two HTML format output files, a
repeat table file and an alignment file.
If the number of repeats found is greater than 120, multiple linked repeat tables are
produced. The links to the other tables appear at the top and the bottom of each table.
To view the results start by opening the first repeat table file with your web browser. This file has the
extension ".1.html". Alignment files can be accessed from the repeat table
files. Alignment
files end with the ".txt.html" extension.
For input files containing multiple
sequences a summary page is produced that links to the output of individual
sequences. This file has the extension "summary.html". You should
start by opening this file if your input had multiple sequences in the same
file. Also note that the output files of individual sequences will have an
identifier of the form ".sn." ( n an integer) embedded in the name indicating
the index of the sequence in the input file. The identifier is omitted for
single sequence input files.
For more information on the output please see
Table Explanation and
Alignment Explanation.
|
Last revised October 20, 2022
Send any questions or comments to:
Gary Benson
|