Download Latest Version TrApWithDependencies.jar (5.1 MB)
Email in envelope

Get an email when there's a new version of klugerlab

Home / Arpeggio / input_data
Name Modified Size InfoDownloads / Week
Parent folder
input_data.csv 2013-05-03 199.2 kB
properties.txt 2013-05-03 538 Bytes
Totals: 2 Items   199.7 kB 0
####################################################################################################
# Arpeggio complete analysis
#
# We herein include a step-by-step guide to reproduce the data preprocessing and analysis steps 
# done in the paper:
#
# "Arpeggio: Harmonic compression of ChIP-seq data reveals protein-chromatin interaction signatures"
# 
# The tutorial is divided as follows: in step 1 we will install the external software needed, 
# in step 2 we will create the input files and in step 3 we will run ArpeggioJava
#
# Notes: 
# - The installation assumes we are running on a linux64 machine and we are creating the datasets 
#   under the folder /raid1. Of course, the locations can be changed and the external software may be
#   available for other operating systems and architectures.
# - The programs may require a large amount of storage and a good internet connection.
#   Running the software on the dataset provided (~800 experiments) will require around 
#   4 Tb of storage and will download around 560Gb of data from the SRA repository.   
# - It's a good idea to run these steps after creating a shell with "screen", since the 
#   computation may take hours or even days. 
# 
#
#
####################################################################################################
# Step 1a: Install bowtie and build reference genomes
####################################################################################################
mkdir -p /raid1/software
cd /raid1/software
wget http://sourceforge.net/projects/bowtie-bio/files/bowtie/0.12.9/bowtie-0.12.9-linux-x86_64.zip/download
mv download bowtie-0.12.9-linux-x86_64.zip
unzip bowtie-0.12.9-linux-x86_64.zip
rm bowtie-0.12.9-linux-x86_64.zip 

cd /raid1/software/bowtie-0.12.9
# Build genomes 
scripts/make_mm9.sh && rm *.fa
scripts/make_hg19.sh && rm *.fa 
scripts/make_d_melanogaster_fb5_22.sh && rm dmel-all-chromosome-r5.22.fasta

####################################################################################################
# Step 1b: Install sra toolkit
####################################################################################################
wget http://ftp-private.ncbi.nlm.nih.gov/sra/sdk/2.2.2a/sratoolkit.2.2.2a-centos_linux64.tar.gz
tar xvf sratoolkit.2.2.2a-centos_linux64.tar.gz
rm sratoolkit.2.2.2a-centos_linux64.tar.gz
mv sratoolkit.2.2.2a-centos_linux64/ sratoolkit.2.2.2a

####################################################################################################
# Step 2a: Create folders to store data. 
# Note: this may require a few Tb of data storage!
####################################################################################################
mkdir -p /raid1/sra
mkdir -p /raid1/sra/data/sra
mkdir -p /raid1/sra/data/fastq
mkdir -p /raid1/sra/data/sam
mkdir -p /raid1/sra/data/tmp


####################################################################################################
# Step 2b: Copy the ArpeggioJava software from the sourceforge repository.
# Download the Arpeggio java software and store it on /raid1/software/ArpeggioJava/
#
# the ArpeggioJava.jar file is located on  https://sourceforge.net/projects/klugerlab/files/Arpeggio/
####################################################################################################



####################################################################################################
# Step 2c: Create a property file (property.txt) with default options. 
#
# The files.xxx options specify where the sequence files should be located
# The bowtie option specify the location of the executables, bowtie.opt the options to be passed to 
#     bowtie and the bowtie.genome.xxx the locations of the reference genomes (see step 1a)
# The property.txt file should look like this
# The arpeggio.window_size option specifies the window size used for computing the autocorrelation
# The arpeggio.remove_duplicate_reads option removes duplicate reads when computing autocorrelations
#
# An example of property file is:
#
# files.sra=/raid1/sra/data/sra
# files.fastq=/raid1/sra/data/fastq
# files.sam=/raid1/sra/data/sam
# files.tmp=/raid1/sra/data/tmp
# bowtie=/raid1/software/bowtie-0.12.9/bowtie
# bowtie.opt=-n2 -k1 -m1 --best --strata --chunkmbs 512
# bowtie.genome.mm9=/raid1/software/bowtie-0.12.9/mm9
# bowtie.genome.hg19=/raid1/software/bowtie-0.12.9/hg19
# bowtie.genome.d_melanogaster_fb5_22=/raid1/software/bowtie-0.12.9/d_melanogaster_fb5_22
# fastq-dump=/raid1/software/sratoolkit.2.2.2a/bin/fastq-dump
# arpeggio.window_size=8192
# arpeggio.remove_duplicate_reads=true
####################################################################################################


####################################################################################################
# Step 2d: create a data.csv input file using Excel or any csv editor
#
# The following columns are sufficient to run the pipeline:
# 
# experiment.name : unique experiment identifier
# genome          : reference genome to use for mapping (hg19, mm9, d_melanogaster_fb5_22)
# exp_type	      : set to ChIP-seq
# layout	      : use "single" if the experiment single-end, 
#                   use "5'-3'-3'-5'", "3'-5'-5'-3'", "fr", "rf", or "ff" for paired-end
# Run		      : This specifies the SRA identifier for the experiment (e.g. SRR054876 or ERR011988)
# DNA_shearing    : DNA shearing (e.g. Sonication, MNase). This information is used for matching 
#                   experiments to controls
# Protein         : The protein investigated by the experiment. "IgG" and "DNAInput" are used to 
#                   identify control experiments. 
#
#
# Note: If "Run" is not available, sequence files in format [experiment.name].sra, [experiment.name].fastq 
# or [experiment.name].sam should be copied to the relative /raid1/sra/xxx folder. 
#
# 
####################################################################################################


####################################################################################################
# Step 2e: Create a folder (e.g. "input_data") with the properties.txt and data.csv files.
####################################################################################################
mkdir -p /raid1/sra/input_data
cp data.csv /raid1/sra/input_data/
cp properties.txt /raid1/sra/input_data/

####################################################################################################
# Step 3: Run and builds all the autocorrelation profiles. Note: this may take a long time and 
# need a good internet connection to download the data. 
#
# The program will download data from the SRA, map it and build autocorrelation profiles. 
####################################################################################################
java -Xmx4g -server -cp /raid1/software/ArpeggioJava/dist/ArpeggioJava.jar  arpeggio.gui.DataFill /raid1/sra/input_data bda     >> arpeggiolog.txt

# Once the autocorrelation profiles are calculated, the controls can be matched by:
java -Xmx4g -server -cp /raid1/software/ArpeggioJava/dist/ArpeggioJava.jar  arpeggio.gui.DataFill arpeggioInput/input_data ctr  >> arpeggiolog.txt

# Finally, arpeggio profiles can be computed by:
java -Xmx4g -server -cp /raid1/software/ArpeggioJava/dist/ArpeggioJava.jar  arpeggio.gui.DataFill arpeggioInput/input_data arp  >> arpeggiolog.txt

# Note: the Arpeggio profiles and matching controls in the paper were calculated using R code, so there may be slight numerical differences.  



Source: README.txt, updated 2013-05-05