I have been working as a bioinformatician for ~8 years, but for some reason I never had to set up my own local nucleotide Blast database. Until now.
And I found it to be a bit complicated if you don’t know what to do. So, here it is. A quick little guideline to set up and use Blast in your local cluster.
Shout-out to this blog for setting me on the right track.
Step 0: Load your module
module load blast/2.12.0+
Step 1: Download the database
We need to generate a folder where we will download both the Blast database and the taxonomy database.
mkdir -p my_blastdb
Let’s get into our new folder:
cd /PATH/TO/my_blastdb
Here, we are downloading the nucleotide database and asking the software to automatically decompress the files.
update_blastdb.pl --decompress nt
A bunch of files starting with nt.XX.YY
should be in your database folder.
Step 2: Download taxonomy
To add taxonomy information, we need to manually download the taxonomy database and decompress.
cd /PATH/TO/my_blastdb
update_blastdb.pl taxdb
tar -zxvf taxdb.tar.gz
Two additional files should now be in your database folder: taxdb.bti
and taxdb.btd
.
Step 3: Let’s Blast
We are ready to actually Blast. Let’s go to our working directory.
To use our local database, we need to add the path + the prefix of our files. In my example:
/PATH/TO/my_blastdb/nt
To use our taxonomy database, we need to export the path as follows:
export BLASTDB="/PATH/TO/my_blastdb"
Also, to obtain the species of our subject sequences, we need to manually request the staxids
(Subject Tax IDs). Blast formats 6, 7, and 10 can be customize to add staxids
.
Addressing these two points, we have the following command:
blastn -num_threads 34 -db /PATH/TO/my_blastdb/nt -query input.fa -out output.out -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore staxids sscinames scomnames"
(Here, I am requesting default parameters + subject’s taxonomy id, scientific name, and common name.)
Default column names for format 6
1 | qseqid | query or source (gene) sequence id |
2 | sseqid | subject or target (reference genome) sequence id |
3 | pident | percentage of identical positions |
4 | length | alignment length (sequence overlap) |
5 | mismatch | number of mismatches |
6 | gapopen | number of gap openings |
7 | qstart | start of alignment in query |
8 | qend | end of alignment in query |
9 | sstart | start of alignment in subject |
10 | send | end of alignment in subject |
11 | evalue | expect value |
12 | bitscore | bit score |
Additional columns for custom format
qseqid | Query Seq-id |
qgi | Query GI |
qacc | Query accesion |
qaccver | Query accesion.version |
qlen | Query sequence length |
sseqid | Subject Seq-id |
sallseqid | All subject Seq-id(s), separated by a ';' |
sgi | Subject GI |
sallgi | All subject GIs |
sacc | Subject accession |
saccver | Subject accession.version |
sallacc | All subject accessions |
slen | Subject sequence length |
qstart | Start of alignment in query |
qend | End of alignment in query |
sstart | Start of alignment in subject |
send | End of alignment in subject |
qseq | Aligned part of query sequence |
sseq | Aligned part of subject sequence |
evalue | Expect value |
bitscore | Bit score |
score | Raw score |
length | Alignment length |
pident | Percentage of identical matches |
nident | Number of identical matches |
mismatch | Number of mismatches |
positive | Number of positive-scoring matches |
gapopen | Number of gap openings |
gaps | Total number of gaps |
ppos | Percentage of positive-scoring matches |
frames | Query and subject frames separated by a '/' |
qframe | Query frame |
sframe | Subject frame |
btop | Blast traceback operations (BTOP) |
staxids | Subject Taxonomy ID(s), separated by a ';' |
sscinames | Subject Scientific Name(s), separated by a ';' |
scomnames | Subject Common Name(s), separated by a ';' |
sblastnames | Subject Blast Name(s), separated by a ';' (in alphabetical order) |
sskingdoms | Subject Super Kingdom(s), separated by a ';' (in alphabetical order) |
stitle | Subject Title |
salltitles | All Subject Title(s), separated by a '<>' |
sstrand | Subject Strand |
qcovs | Query Coverage Per Subject |
qcovhsp | Query Coverage Per HSP |
Original source.
And that’s it. Enjoy Blasting!