parallel_blast

Parallel blast is a wrapper script around the blast commands as well as diamond. It utilizes GNU Parallel to run the commands in parallel by splitting up the input fasta files and distributes them across multiple subprocesses. If it detects that it is running inside of a PBS or SGE job it will run the job on multiple hosts that may be allocated to the job.

parallel_blast requires that you have gnu parallel installed and in your environments PATH as well as diamond and/or blastn/blastx/blastp.

Usage

You can get all the arguments that can be supplied via the following

$> parallel_blast --help

Examples

For the examles below assume you have an input fasta in the current directory called input.fasta

Running blastn

$> parallel_blast input.fasta output.blast --ninst 4 --db /path/to/nt \
--blast_exe blastn --task megablast --blast_options "--evalue 0.01"
[cmd] /path/to/parallel -u --pipe --block 10 --recstart > --sshlogin 4/: /path/to/blastn -task megablast -db /path/to/nt -max_target_seqs 10 -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore" -query -

Notice how we had to quote the additional --blast_options

Running diamond

Diamond v0.7.9 is the version that was tested with parallel_blast. As diamond is still in development the options may change in future versions and parallel_blast may not run them correctly. Please submit a new issue if you find any issues.

$> parallel_blast input.fasta out.blast --ninst 4 --db /path/to/diamondnr \
--blast_exe diamond --task blastx --blast_options "--tmpdir dtmp"
[cmd] /path/to/parallel -u --pipe --block 10 --recstart > --cat --sshlogin 1/: /path/to/diamond blastx --threads 4 --db /path/to/diamondnr --query {} --compress 0 -a out.blast

Notice how even though we specified --ninst 4 that --sshlogin 1/: was used and --threads 4 was set instead.

Note In recent versions of diamond, diamond outputs a daa binary file instead of a tab separated file. parallel_blast automatically converts the diamond output from daa to tab format for you but leaves the daa file behind(Same name as the output file you specify, but with the extension .daa)

Command that is run

You will notice in the examples above that when you run parallel_blast that it outputs the command that it is running in case you want to copy/paste it and run it yourself sometime.

You might notice that the command does not include all the quoted arguments such as the --recstart argument which should be --recstart ">" as well as the --outfmt which should be quoted as --outfmt "6 ...". If you intend on rerunning the command you will have to add the quotes manually.

Running inside of a PBS or SGE Job

parallel_blast is able to detect if it is running inside of a PBS or SGE job by looking to see if PBS_NODEFILE or PE_HOSTFILE is set in the environment’s variables.

If it finds either of them it will run the job by supplying --sshlogin for each host it finds in the file.

PBS_NODEFILE and PE_HOSTFILE have different syntax so parallel_blast first builds a CPU,NODENAME list from them.

PBS_NODEFILE

This file is parsed and counts how many of each unique host is listed such that the following PBS_NODEFILE:

node1.localhost
node2.localhost
node2.localhost
node3.localhost
node3.localhost
node3.localhost

would run 1 instance on node1.localhost, 2 instances on node2.localhost and 3 instances on node3.localhost

PE_HOSTFILE

This file is almost in the exact syntax that parallel_blast uses so it is almost a 1-to-1 mapping.

Diamond and multiple hosts

Since diamond utilizes threads much more efficiently than blast, for each unique host in a job only 1 instance is launched but the -p option is set to the number of CPUS for each host listed in the PE_HOSTFILE or PBS_NODEFILE