| Name | Modified | Size | Downloads / Week |
|---|---|---|---|
| Parent folder | |||
| input_data.csv | 2013-05-03 | 199.2 kB | |
| properties.txt | 2013-05-03 | 538 Bytes | |
| Totals: 2 Items | 199.7 kB | 0 | |
#################################################################################################### # Arpeggio complete analysis # # We herein include a step-by-step guide to reproduce the data preprocessing and analysis steps # done in the paper: # # "Arpeggio: Harmonic compression of ChIP-seq data reveals protein-chromatin interaction signatures" # # The tutorial is divided as follows: in step 1 we will install the external software needed, # in step 2 we will create the input files and in step 3 we will run ArpeggioJava # # Notes: # - The installation assumes we are running on a linux64 machine and we are creating the datasets # under the folder /raid1. Of course, the locations can be changed and the external software may be # available for other operating systems and architectures. # - The programs may require a large amount of storage and a good internet connection. # Running the software on the dataset provided (~800 experiments) will require around # 4 Tb of storage and will download around 560Gb of data from the SRA repository. # - It's a good idea to run these steps after creating a shell with "screen", since the # computation may take hours or even days. # # # #################################################################################################### # Step 1a: Install bowtie and build reference genomes #################################################################################################### mkdir -p /raid1/software cd /raid1/software wget http://sourceforge.net/projects/bowtie-bio/files/bowtie/0.12.9/bowtie-0.12.9-linux-x86_64.zip/download mv download bowtie-0.12.9-linux-x86_64.zip unzip bowtie-0.12.9-linux-x86_64.zip rm bowtie-0.12.9-linux-x86_64.zip cd /raid1/software/bowtie-0.12.9 # Build genomes scripts/make_mm9.sh && rm *.fa scripts/make_hg19.sh && rm *.fa scripts/make_d_melanogaster_fb5_22.sh && rm dmel-all-chromosome-r5.22.fasta #################################################################################################### # Step 1b: Install sra toolkit #################################################################################################### wget http://ftp-private.ncbi.nlm.nih.gov/sra/sdk/2.2.2a/sratoolkit.2.2.2a-centos_linux64.tar.gz tar xvf sratoolkit.2.2.2a-centos_linux64.tar.gz rm sratoolkit.2.2.2a-centos_linux64.tar.gz mv sratoolkit.2.2.2a-centos_linux64/ sratoolkit.2.2.2a #################################################################################################### # Step 2a: Create folders to store data. # Note: this may require a few Tb of data storage! #################################################################################################### mkdir -p /raid1/sra mkdir -p /raid1/sra/data/sra mkdir -p /raid1/sra/data/fastq mkdir -p /raid1/sra/data/sam mkdir -p /raid1/sra/data/tmp #################################################################################################### # Step 2b: Copy the ArpeggioJava software from the sourceforge repository. # Download the Arpeggio java software and store it on /raid1/software/ArpeggioJava/ # # the ArpeggioJava.jar file is located on https://sourceforge.net/projects/klugerlab/files/Arpeggio/ #################################################################################################### #################################################################################################### # Step 2c: Create a property file (property.txt) with default options. # # The files.xxx options specify where the sequence files should be located # The bowtie option specify the location of the executables, bowtie.opt the options to be passed to # bowtie and the bowtie.genome.xxx the locations of the reference genomes (see step 1a) # The property.txt file should look like this # The arpeggio.window_size option specifies the window size used for computing the autocorrelation # The arpeggio.remove_duplicate_reads option removes duplicate reads when computing autocorrelations # # An example of property file is: # # files.sra=/raid1/sra/data/sra # files.fastq=/raid1/sra/data/fastq # files.sam=/raid1/sra/data/sam # files.tmp=/raid1/sra/data/tmp # bowtie=/raid1/software/bowtie-0.12.9/bowtie # bowtie.opt=-n2 -k1 -m1 --best --strata --chunkmbs 512 # bowtie.genome.mm9=/raid1/software/bowtie-0.12.9/mm9 # bowtie.genome.hg19=/raid1/software/bowtie-0.12.9/hg19 # bowtie.genome.d_melanogaster_fb5_22=/raid1/software/bowtie-0.12.9/d_melanogaster_fb5_22 # fastq-dump=/raid1/software/sratoolkit.2.2.2a/bin/fastq-dump # arpeggio.window_size=8192 # arpeggio.remove_duplicate_reads=true #################################################################################################### #################################################################################################### # Step 2d: create a data.csv input file using Excel or any csv editor # # The following columns are sufficient to run the pipeline: # # experiment.name : unique experiment identifier # genome : reference genome to use for mapping (hg19, mm9, d_melanogaster_fb5_22) # exp_type : set to ChIP-seq # layout : use "single" if the experiment single-end, # use "5'-3'-3'-5'", "3'-5'-5'-3'", "fr", "rf", or "ff" for paired-end # Run : This specifies the SRA identifier for the experiment (e.g. SRR054876 or ERR011988) # DNA_shearing : DNA shearing (e.g. Sonication, MNase). This information is used for matching # experiments to controls # Protein : The protein investigated by the experiment. "IgG" and "DNAInput" are used to # identify control experiments. # # # Note: If "Run" is not available, sequence files in format [experiment.name].sra, [experiment.name].fastq # or [experiment.name].sam should be copied to the relative /raid1/sra/xxx folder. # # #################################################################################################### #################################################################################################### # Step 2e: Create a folder (e.g. "input_data") with the properties.txt and data.csv files. #################################################################################################### mkdir -p /raid1/sra/input_data cp data.csv /raid1/sra/input_data/ cp properties.txt /raid1/sra/input_data/ #################################################################################################### # Step 3: Run and builds all the autocorrelation profiles. Note: this may take a long time and # need a good internet connection to download the data. # # The program will download data from the SRA, map it and build autocorrelation profiles. #################################################################################################### java -Xmx4g -server -cp /raid1/software/ArpeggioJava/dist/ArpeggioJava.jar arpeggio.gui.DataFill /raid1/sra/input_data bda >> arpeggiolog.txt # Once the autocorrelation profiles are calculated, the controls can be matched by: java -Xmx4g -server -cp /raid1/software/ArpeggioJava/dist/ArpeggioJava.jar arpeggio.gui.DataFill arpeggioInput/input_data ctr >> arpeggiolog.txt # Finally, arpeggio profiles can be computed by: java -Xmx4g -server -cp /raid1/software/ArpeggioJava/dist/ArpeggioJava.jar arpeggio.gui.DataFill arpeggioInput/input_data arp >> arpeggiolog.txt # Note: the Arpeggio profiles and matching controls in the paper were calculated using R code, so there may be slight numerical differences.