fasta

fasta is a very simple script to help mangle fasta files.

  • Supports converting multiline sequences into single line
  • Supports splitting fasta file into separate files each named after the identifier
  • Supports disambiguating ambiguous sequences

Usage

fasta --help

Examples

The following examples all use the test fasta file found under tests/testinput/col.fasta

>sequence1 some description !@#$%^&*()_+-=[]{}.,></?';:"
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
>sequence2!@#$%^&*()_+-=[]{}.,></?';:"
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC

Convert column fasta into single lines

The following is a simple shell pipeline using fasta to ensure all sequences are on one line

$> cat tests/testinput/col.fasta | fasta -

Or if you want to you can read straight from a fasta file

$> fasta tests/testinput/col.fasta

Convert single line fasta into column fasta

The following would convert single line fasta sequences into column formatted fasta. It defaults to using 80 characters for each column

$> fasta tests/testinput/col.fasta

You can verify that it is wrapping correctly by simply piping the fasta command back into itself and then comparing to the original input file.

Here you can see we do that and then use diff to show there is no difference between the original file(col.fasta) and the new one(newline.fasta)

$> cat tests/testinput/col.fasta | fasta - | fasta --wrap - > newfile.fasta
$> diff tests/testinput/col.fasta newline.fasta

There will be no output as there is no difference between newfile.fasta and tests/testinput/col.fasta

Simple shell pipeline using fasta

The following is a simple shell pipeline to count how many A’s there are in the sequence lines. There should be 160 since col.fasta is 80 characters per line and only the first line of each sequence has A and there are 2 sequences.

$> fasta tests/testinput/col.fasta | grep -v '>' | grep -Eo '[Aa]' | wc -l
160

Split fasta file into separate files named after identifiers

The following example shows how you can split a fasta file into multiple fasta files each named after an identifier in the original

$> fasta tests/testinput/col.fasta --split
$> ls -l *.fasta
sequence1.fasta
sequence2____________________________.fasta

Note The reason sequence2 has such a long name is because it is replacing all punctionation characters with underscores. col.fasta is a test file that has a bunch of punctuation, hence all the underscores.

Similar to above, you can use input from standard input as the fasta input file

$> cat tests/testinput/col.fasta | fasta --split -
$> ls -l *.fasta
sequence1.fasta
sequence2____________________________.fasta

Disambiguate ambiguous sequences

You can turn sequences that have ambiguous bases in them into all permutations of the same sequence with the ambiguous bases turned into non-ambiguous bases.

There is an upper limit of 100 for how many sequences can be generated to avoid creating thousands of sequences or consuming all of your computer’s RAM.

If a sequence would generate more than 100 sequences, it will generate a message such as:

Sequence too_many has 7 ambiguous bases that would produce 128 permutations and was skipped

and it will be skipped.

$> fasta --disambiguate tests/testinput/ambiguous.fasta > disambiguous.fasta