Download Latest Version Taxoblast1.21beta.jar (71.8 kB)
Email in envelope

Get an email when there's a new version of Taxoblast

Home / NCBITaxSever
Name Modified Size InfoDownloads / Week
Parent folder
NcbiTaxServer.jar 2016-08-19 8.5 kB
Totals: 1 Item   8.5 kB 0
Taxblast 1.2 Readme (May 2018)

Intalling Taxoblast

Both taxoblast and taxsever are delivered as executable Jar files and will 
require a recent and working installation of Java (tests were performed with 
Java 1.7). Please check the oracle website for instructions on how to install 
Java on your computer: https://www.oracle.com/java/. TaxoblastGUI requires the 
“nodes.dmp” file, which can be found in the archive “taxdmp.zip“ on the NCBI
ftp server: ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdmp.zip


The Taxoblast pipeline: an overview

Background: Raw genomic sequences of a target organism are frequently 
contaminated with sequences of other organisms (e.g. bacteria). The 
identification and removal of such contaminants is essential for the 
interpretation of genomic data.  Attempting to remove these sequences prior to 
assembly is difficult because one cannot distinguish between horizontal gene 
transfers and contaminations.

Approach: Frequently the genomic context of these sequences can help 
distinguish the two scenarios, and cleaning may therefore be more efficient 
based on assembled sequences rather than raw reads. Taxoblast splits long 
genomic scaffolds into subsequences of defined length, and  for each of them it
determines the taxon the closest related sequences were found in. Given a
target taxon (e.g. eukaryota) and a potential contaminant taxon (e.g.
bacteria), it then summarizes this information for the entire scaffold,
taking into account the taxonomic ontology. Scaffolds that exclusively match
potential contaminants may be safely removed while sequences matching
partially contaminants and partially the target organism may constitute
horizontal transfers or assembly artifacts and need to be examined manually.

Application: Taxoblast is a simple pipeline that has been successfully used to
remove major bacterial contaminations from the two Ectocarpus genomes in or
laboratory.  A current limitation is the amount of time required for BLAST
searches, especially when submitting blast searches to the NCBI server.
Reduced query databases and the use of KLAST and PLAST algorithms may greatly
improve these aspects.

*Step 1*
of the pipeline simply splits each supercontig / scaffold in a fasta file in 
to sequences of a defined length (500 or 1000 bps are length that appear to 
work well; this parameter ca be set by modifying the split length field). The 
resulting file with the extension “.split” is also fasta-formatted, but each 
header has the extension “|partX” were X is the number of the part. The last 
sequence is likely to be shorter than the defined sequence length depending 
on the input sequence. Please note that each step will only be run if the 
checkbox next to the Step is checked.

*Step 2*
of the pipeline consists in searching for homologous sequences to each of the 
sequence fragments in public databases, notably the NCBI nt (or possibly nr) 
databases. Please note that it is essential that your sequences have not 
already been published in these databases at the time of your analysis. There 
are two ways of perfroming these searches.

  * Online (using the NCBI server): If “Step 2” is checked, Taxoblast will send
    search requests directly to the NCBI blast server (currently only blastn is
    supported). The number of sequences submitted in parallel is determined by
    the number of threads used by the program. Each thread will submit one
    sequence at a time and check for results at a given interval (30 seconds by
    default; changing these parameters is discouraged). Once a result has been
    obtained, the next search is launched. Excessive use of NCBI resources may
    result in your IP address being banned. Please review these instructions
    from NCBI.

  * Using your local cluster (recommended): If you have access to a local 
    cluster I strongly encourage you to use this rather than the NCBI server.
    To do so uncheck “Step 2” and manually run the blast searches on your
    cluster with the output file of Step 1. Please note that, in our tests
    megablast and discontinuous megablast were not sufficiently sensitive to
    reliably detect bacterial contaminations in eukaryote genomes (unless the
    genomes of all of the organism you will be examining including contaminants
    are already sequenced and available in your database), while blastn yielded
    the best results, detecting also contaminations that were more distantly 
    related to model species. Pleas note that with the introduction of BLAST+ 
    the “blastn” command uses “megablast” by default, unless you add “-task
    blastn” as option. The output format needs to be tabular (“-outfmt 6” in 
    blast+ or “–m 8” in legacy blast), and only one hit per sequence should be 
    reported (as reporting several hits for the same part of the sequence would
    bias the analyses). Here is an example (running 8 threads):
    
    blastn -task blastn -db nt -query Query.split -outfmt 6 -evalue 0.01 
    -num_threads 8 -max_target_seqs 1 -out Query.split.blast

    The use of KLAST or PLAST rather than BLAST is encouraged, but has not been
    tested. BlastX searches against the nr database yield similar results as 
    blastn against nt, but the analyses are significantly slower. Once blast 
    searches have terminated proceed to Step 3.


*Step 3* requires three files as input: the split fasta file from step 1; the 
blast results of that exact file from step 2, and a file describing the 
relationships between different taxa (nodes.dmp). The latter file can be 
obtained from the NCBI ftp server (ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/) 
and is contained in the “taxdmp” archive. When using the WebBlast option it 
needs to be updated regularly, otherwise it needs to be updated every time you 
update your blast database. An out of date “nodes.dmp” file will result in 
unidentified taxa which will be assigned to the root node (i.e. 1). You will 
also need to specify the taxa you are trying to separate. By default, Taxoblast 
will distinguish between bacterial and eukaryote hits, but this can be narrowed 
down (e.g. you could look for for fungal sequences in a plant genome). To do so 
search for the corresponding NCBI taxon ids on the NCBI taxonomy page and enter 
the numbers into the corresponding field. Clicking on update will perform a 
quick online check of the validity of these taxon IDs and display the 
corresponding descriptions.

Taxonomic assignments are generated using the following procedure:

1. For each BLAST hit, the GI number is extracted
2. The taxon corresponding to this GI is requested using NCBI E-utilities.
3. This taxon is then compared to your query taxa considering hierarchical 
   relationships (e.g. Arabidopsis is a green plant is an Eukaryote as defined 
   in the nodes.dmp file.
4. Results are summarized for each scaffold in your original query file. The
complete statistics can be found in the ".taxSum" output file, while the 
".tax1" and the ".tax2" files list all scaffolds with at least 90% of hits with
your query taxon 1 or 2.



Legacy instructions (version 1.1)
	How to setup and run a local taxonomy server:

	NCBI E-utilities have a rather short response time, and it is thus easiest to
	use this web service to retrieve the taxonomic associations of the blast hits.
	However, if you are running a large number of analyses installing a local
	taxonomy server may save some time. To do so download the corresponding
	database (gi_taxid_nucl.dmp or gi_taxid_prot.dmp) from the NCBI ftp server
	(ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/) and in a terminal window run the
	TaxServer.jar application on a computer with sufficient free memory
	(>=8GB as of October 2014).  The command is structured as follows:

		   java -Xmx8000m -jar NcbiTaxServer.jar portnr database

	Portnr is the internet port the server will be listening to (default 9000,
	please make sure that whatever port you choose is free and not blocked by your 
	firewall), and database is the complete path to the “gi_taxid_nucl.dmp” file 
	including the file name (default “gi_taxid_nucl.dmp”, i.e. this parameter is 
	not required if the file is located in the present working directory).

	Once the server application as started, you can specify the server name (or ip 
	address) in the “local server” field of the graphical user interface, and the 
	corresponding port just below.


  
Release notes:

Version 1.0beta (August 2016): Initial release

Version 1.1 (July 2017): Update to fix compatility issues with new NCBI the BLAST 
interface and Java 1.8

Version 1.21beta (May 2018): 
	* Update to fix connection with NCBI server
	* Remove support for local taxonomy server
	* Avoid redundant queries NCBI using local hash map
	* support input from diamond searches (diamond blastx --query XXX.split --db nr --out XXX 
	  --evalue XXX --outfmt 6 --max-target-seqs 1 --more-sensitive)


  
Source: Readme.txt, updated 2018-05-04