Cerulean: Hybrid Genome Assembler Wiki
Brought to you by:
virajbdeshpande
Welcome to your wiki!
This is the default page, edit it as you see fit. To add a new page simply reference it within brackets, e.g.: [SamplePage].
The wiki uses Markdown syntax.
Cerulean Hybrid Genome Assembler v0.1
This software extends contigs assembled using short read datasets like Illumina
paired-end reads using long reads like PacBio RS long reads.
A] Requirements:
Ubuntu 12.04 (may run on other operating systems, but not tested)
Python 2.7.1 (may run on older versions, but not tested)
numpy, matplotlib libraries for Python
ABySS assembler: http://www.bcgsc.ca/platform/bioinfo/software/abyss
SMRT Analysis tookit (for BLASR): http://pacbiodevnet.com/
PBJelly: https://sourceforge.net/projects/pb-jelly/
B] Inputs and Pre-processing:
i) Assembled contigs from ABySS short read assembler
ii)Mapping of Pacbio reads to ABySS contigs using BLASR
i) Assembly of Illumina paired-end reads:
If the paired-end reads are stored in fastq format in the files reads1.fastq
and reads2.fastq, then contigs may be assembled by:
$ abyss-pe k=64 n=10 in='reads1.fastq reads2.fastq' name=
<dataname>This will generate 2 files used for inputs to Cerulean:
*
<dataname>-contigs.fa #This contains the contig sequences*
<dataname>-contigs.dot #This contains the graph structureii)Mapping PacBio reads to ABySS contigs using BLASR:
Note: sawriter and blasr are part of SMRT Analysis toolkit
Note: You need to set the environmental variables and path:
$ export SEYMOUR_HOME=/opt/smrtanalysis/
$ source $SEYMOUR_HOME/etc/setup.sh
Suppose PacBio reads are stored in
<dataname>_pacbio.fasta$ sawriter
<dataname>-contigs.fa$ blasr
<dataname>_pacbio.fa<dataname>-contigs.fa -minMatch 10 \-minPctIdentity 70 -bestn 30 -nCandidates 30 -maxScore -500 \
-nproc <numthreads> -noSplitSubreads \
-out
<dataname>_pacbio_contigs_mapping.fasta.m4</numthreads>Make sure the fasta.m4 file generated has the following format:
qname tname qstrand tstrand score pctsimilarity tstart tend tlength \
qstart qend qlength ncells
The file format may be verified by adding the option -header to blasr.
C] Execute:
Cerulean requires that all input files are in the same directory <basedir>:
i)
<basedir>/<dataname>-contigs.faii)
<basedir>/<dataname>-contigs.dotiii)
<basedir>/<dataname>_pacbio_contigs_mapping.fasta.m4</basedir>To run:
$ python src/Cerulean.py --dataname
<dataname>--basedir<basedir>\--nproc
<numthreads>This will generate:
i)
<dataname>_cerulean.fastaii)
<dataname>_cerulean.dotNote: The dot does not have same contigs as fasta, but intermediate graph.
D] Post-processing:
Currently Cerulean does not include consensus sequence of PacBio reads in gaps
The gaps may be filled using PBJelly.
$ python $JELLYPATH/fakeQuals.py
<dataname>cerulean.fasta <dataname>_cerulean.qual$ python $JELLYPATH/fakeQuals.py
<dataname>_pacbio.fasta<dataname>_pacbio.qual$ cp $JELLYPATH/lambdaExample/Protocol.xml .
$ mkdir PBJelly
Modify Protocol.xml as follows:
Set
<reference>to $PATH_TO</dataname><basedir>/<dataname>cerulean.fastaSet
<outputDir>to $PATH_TO<basedir>/PBJellySet
<baseDir>to $PATH_TO_<basedir>Set
<job>to<dataname>_pacbio.fastaSet
<blasr>option -nproc<numthreads>Note: PBJelly requires that the suffix be .fasta and not .fa
Next run PBJelly:
($ source $JELLYPATH/exportPaths.sh)
$ python $JELLYPATH/Jelly.py
<stage>Protocol.xmlwhere
<stage>has to be in the order:setup
mapping
support
extraction
assembly
output
The assembled contigs may be view in <basedir>/PBJelly/assembly/jellyOutput.fasta</basedir>
Last edit: Viraj Deshpande 2013-08-05