Taxblast 1.2 Readme (May 2018)
Intalling Taxoblast
Both taxoblast and taxsever are delivered as executable Jar files and will
require a recent and working installation of Java (tests were performed with
Java 1.7). Please check the oracle website for instructions on how to install
Java on your computer: https://www.oracle.com/java/. TaxoblastGUI requires the
nodes.dmp file, which can be found in the archive taxdmp.zip on the NCBI
ftp server: ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdmp.zip
The Taxoblast pipeline: an overview
Background: Raw genomic sequences of a target organism are frequently
contaminated with sequences of other organisms (e.g. bacteria). The
identification and removal of such contaminants is essential for the
interpretation of genomic data. Attempting to remove these sequences prior to
assembly is difficult because one cannot distinguish between horizontal gene
transfers and contaminations.
Approach: Frequently the genomic context of these sequences can help
distinguish the two scenarios, and cleaning may therefore be more efficient
based on assembled sequences rather than raw reads. Taxoblast splits long
genomic scaffolds into subsequences of defined length, and for each of them it
determines the taxon the closest related sequences were found in. Given a
target taxon (e.g. eukaryota) and a potential contaminant taxon (e.g.
bacteria), it then summarizes this information for the entire scaffold,
taking into account the taxonomic ontology. Scaffolds that exclusively match
potential contaminants may be safely removed while sequences matching
partially contaminants and partially the target organism may constitute
horizontal transfers or assembly artifacts and need to be examined manually.
Application: Taxoblast is a simple pipeline that has been successfully used to
remove major bacterial contaminations from the two Ectocarpus genomes in or
laboratory. A current limitation is the amount of time required for BLAST
searches, especially when submitting blast searches to the NCBI server.
Reduced query databases and the use of KLAST and PLAST algorithms may greatly
improve these aspects.
*Step 1*
of the pipeline simply splits each supercontig / scaffold in a fasta file in
to sequences of a defined length (500 or 1000 bps are length that appear to
work well; this parameter ca be set by modifying the split length field). The
resulting file with the extension .split is also fasta-formatted, but each
header has the extension |partX were X is the number of the part. The last
sequence is likely to be shorter than the defined sequence length depending
on the input sequence. Please note that each step will only be run if the
checkbox next to the Step is checked.
*Step 2*
of the pipeline consists in searching for homologous sequences to each of the
sequence fragments in public databases, notably the NCBI nt (or possibly nr)
databases. Please note that it is essential that your sequences have not
already been published in these databases at the time of your analysis. There
are two ways of perfroming these searches.
* Online (using the NCBI server): If Step 2 is checked, Taxoblast will send
search requests directly to the NCBI blast server (currently only blastn is
supported). The number of sequences submitted in parallel is determined by
the number of threads used by the program. Each thread will submit one
sequence at a time and check for results at a given interval (30 seconds by
default; changing these parameters is discouraged). Once a result has been
obtained, the next search is launched. Excessive use of NCBI resources may
result in your IP address being banned. Please review these instructions
from NCBI.
* Using your local cluster (recommended): If you have access to a local
cluster I strongly encourage you to use this rather than the NCBI server.
To do so uncheck Step 2 and manually run the blast searches on your
cluster with the output file of Step 1. Please note that, in our tests
megablast and discontinuous megablast were not sufficiently sensitive to
reliably detect bacterial contaminations in eukaryote genomes (unless the
genomes of all of the organism you will be examining including contaminants
are already sequenced and available in your database), while blastn yielded
the best results, detecting also contaminations that were more distantly
related to model species. Pleas note that with the introduction of BLAST+
the blastn command uses megablast by default, unless you add -task
blastn as option. The output format needs to be tabular (-outfmt 6 in
blast+ or m 8 in legacy blast), and only one hit per sequence should be
reported (as reporting several hits for the same part of the sequence would
bias the analyses). Here is an example (running 8 threads):
blastn -task blastn -db nt -query Query.split -outfmt 6 -evalue 0.01
-num_threads 8 -max_target_seqs 1 -out Query.split.blast
The use of KLAST or PLAST rather than BLAST is encouraged, but has not been
tested. BlastX searches against the nr database yield similar results as
blastn against nt, but the analyses are significantly slower. Once blast
searches have terminated proceed to Step 3.
*Step 3* requires three files as input: the split fasta file from step 1; the
blast results of that exact file from step 2, and a file describing the
relationships between different taxa (nodes.dmp). The latter file can be
obtained from the NCBI ftp server (ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/)
and is contained in the taxdmp archive. When using the WebBlast option it
needs to be updated regularly, otherwise it needs to be updated every time you
update your blast database. An out of date nodes.dmp file will result in
unidentified taxa which will be assigned to the root node (i.e. 1). You will
also need to specify the taxa you are trying to separate. By default, Taxoblast
will distinguish between bacterial and eukaryote hits, but this can be narrowed
down (e.g. you could look for for fungal sequences in a plant genome). To do so
search for the corresponding NCBI taxon ids on the NCBI taxonomy page and enter
the numbers into the corresponding field. Clicking on update will perform a
quick online check of the validity of these taxon IDs and display the
corresponding descriptions.
Taxonomic assignments are generated using the following procedure:
1. For each BLAST hit, the GI number is extracted
2. The taxon corresponding to this GI is requested using NCBI E-utilities.
3. This taxon is then compared to your query taxa considering hierarchical
relationships (e.g. Arabidopsis is a green plant is an Eukaryote as defined
in the nodes.dmp file.
4. Results are summarized for each scaffold in your original query file. The
complete statistics can be found in the ".taxSum" output file, while the
".tax1" and the ".tax2" files list all scaffolds with at least 90% of hits with
your query taxon 1 or 2.
Legacy instructions (version 1.1)
How to setup and run a local taxonomy server:
NCBI E-utilities have a rather short response time, and it is thus easiest to
use this web service to retrieve the taxonomic associations of the blast hits.
However, if you are running a large number of analyses installing a local
taxonomy server may save some time. To do so download the corresponding
database (gi_taxid_nucl.dmp or gi_taxid_prot.dmp) from the NCBI ftp server
(ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/) and in a terminal window run the
TaxServer.jar application on a computer with sufficient free memory
(>=8GB as of October 2014). The command is structured as follows:
java -Xmx8000m -jar NcbiTaxServer.jar portnr database
Portnr is the internet port the server will be listening to (default 9000,
please make sure that whatever port you choose is free and not blocked by your
firewall), and database is the complete path to the gi_taxid_nucl.dmp file
including the file name (default gi_taxid_nucl.dmp, i.e. this parameter is
not required if the file is located in the present working directory).
Once the server application as started, you can specify the server name (or ip
address) in the local server field of the graphical user interface, and the
corresponding port just below.
Release notes:
Version 1.0beta (August 2016): Initial release
Version 1.1 (July 2017): Update to fix compatility issues with new NCBI the BLAST
interface and Java 1.8
Version 1.21beta (May 2018):
* Update to fix connection with NCBI server
* Remove support for local taxonomy server
* Avoid redundant queries NCBI using local hash map
* support input from diamond searches (diamond blastx --query XXX.split --db nr --out XXX
--evalue XXX --outfmt 6 --max-target-seqs 1 --more-sensitive)