How to set up your local BLAST nucleotide database

Bioinformatics diaries No. 1

I have been working as a bioinformatician for ~8 years, but for some reason I never had to set up my own local nucleotide Blast database. Until now.

And I found it to be a bit complicated if you don’t know what to do. So, here it is. A quick little guideline to set up and use Blast in your local cluster.

Shout-out to this blog for setting me on the right track.

Step 0: Load your module

module load blast/2.12.0+

Step 1: Download the database

We need to generate a folder where we will download both the Blast database and the taxonomy database.

mkdir -p my_blastdb

Let’s get into our new folder:

cd /PATH/TO/my_blastdb

Here, we are downloading the nucleotide database and asking the software to automatically decompress the files.

update_blastdb.pl --decompress nt

A bunch of files starting with nt.XX.YY should be in your database folder.

Step 2: Download taxonomy

To add taxonomy information, we need to manually download the taxonomy database and decompress.

cd /PATH/TO/my_blastdb
update_blastdb.pl taxdb
tar -zxvf taxdb.tar.gz

Two additional files should now be in your database folder: taxdb.bti and taxdb.btd.

Step 3: Let’s Blast

We are ready to actually Blast. Let’s go to our working directory.

To use our local database, we need to add the path + the prefix of our files. In my example:

/PATH/TO/my_blastdb/nt

To use our taxonomy database, we need to export the path as follows:

export BLASTDB="/PATH/TO/my_blastdb" 

Also, to obtain the species of our subject sequences, we need to manually request the staxids (Subject Tax IDs). Blast formats 6, 7, and 10 can be customize to add staxids.

Addressing these two points, we have the following command:

blastn -num_threads 34 -db /PATH/TO/my_blastdb/nt -query input.fa -out output.out -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore staxids sscinames scomnames"

(Here, I am requesting default parameters + subject’s taxonomy id, scientific name, and common name.)

Default column names for format 6
1  qseqid    query or source (gene) sequence id
2  sseqid    subject or target (reference genome) sequence id
3  pident    percentage of identical positions
4  length    alignment length (sequence overlap)
5  mismatch  number of mismatches
6  gapopen   number of gap openings
7  qstart    start of alignment in query
8  qend      end of alignment in query
9  sstart    start of alignment in subject
10 send      end of alignment in subject
11 evalue    expect value
12 bitscore  bit score
Additional columns for custom format
qseqid     Query Seq-id
qgi        Query GI
qacc       Query accesion
qaccver    Query accesion.version
qlen       Query sequence length
sseqid     Subject Seq-id
sallseqid  All subject Seq-id(s), separated by a ';'
sgi        Subject GI
sallgi     All subject GIs
sacc       Subject accession
saccver    Subject accession.version
sallacc    All subject accessions
slen       Subject sequence length
qstart     Start of alignment in query
qend       End of alignment in query
sstart     Start of alignment in subject
send       End of alignment in subject
qseq       Aligned part of query sequence
sseq       Aligned part of subject sequence
evalue     Expect value
bitscore   Bit score
score      Raw score
length     Alignment length
pident     Percentage of identical matches
nident     Number of identical matches
mismatch   Number of mismatches
positive   Number of positive-scoring matches
gapopen    Number of gap openings
gaps       Total number of gaps
ppos       Percentage of positive-scoring matches
frames     Query and subject frames separated by a '/'
qframe     Query frame
sframe     Subject frame
btop       Blast traceback operations (BTOP)
staxids    Subject Taxonomy ID(s), separated by a ';'
sscinames  Subject Scientific Name(s), separated by a ';'
scomnames Subject Common Name(s), separated by a ';'
sblastnames Subject Blast Name(s), separated by a ';'   (in alphabetical order)
sskingdoms   Subject Super Kingdom(s), separated by a ';'     (in alphabetical order)
stitle      Subject Title
salltitles   All Subject Title(s), separated by a '<>'
sstrand    Subject Strand
qcovs      Query Coverage Per Subject
qcovhsp    Query Coverage Per HSP

Original source.

And that’s it. Enjoy Blasting!

Share: X (Twitter) Facebook LinkedIn