GenomeRunner Blog

Annotation and enrichment of Next-Gen sequencing data

Brought to you by: mikhaildozmorov

GenomeRunner / Blog: Recent posts

New databases available for download

After long last, the updated SQLite databases for hg19 human genome assembly are available for download. Due to their large size, they are hosted outside of SourceForge.

Go to http://www.genomerunner.org, "Help" page - the links are available at the bottom. Please, provide your comments/suggestions, and I'll adjust the tables to your needs.

Posted by 2014-01-04 | Edit

Database update

Due to many inquiries about the databases for GenomeRunner, I am currently working on compiling new version of genome annotation data. Now, the database will reflect the structure of the trackDb table.

The ENCODE data will be separated from UCSC annotations, and will have similar hierarchical organization, as in UCSC. That is, the data will be split into "data source/type" categories (like BroadHistone), and by cell types, as defined by the ENCODE data coordination center at UCSC. The goal is to make it much easier to run cell type-specific analyses, and/or to focus on best quality (Tier 1) data.... read more

Posted by 2013-11-20 | Edit

Genome annotation databases

At the heart of GenomeRunner lies genome annotation data from the ENCODE project. The human genome is big, and so are the (growing amount of) data generated by the ENCODE. Therefore, for the human genome I provide only a part of the genome annotation data, the best quality Tier 1 data hg19tier1.sqlite, and Tier2 data hg19tier2.sqlite. These files should be opened in GenomeRunner individually, to perform annotation/enrichment analyses against Tier 1 or Tier 2 data, respectively. Also, note the hg19tier100.sqlite file, containing cell type-specific transcription factor binding sites.... read more

Posted by 2013-10-27 | Edit

Most biologically interesting (easily interpretable) genome annotation data

If you want to quickly get some biological insight into what functional elements may be affected by your data, use the following genomic features:

Tier1/Regulation/wgEncodeRegTfbsClustered* - experimentally validated TFBSs
Tier1/Regulation/tfbsConsSites* - computationally predicted TFBSs
Tier1/Histone Modifications/wgEncodeBroadHmmGm12878HMM* – chromatin segmentation states
Tier1/Genes/knownAlt* - alternative splicing sites... read more

Posted by 2013-08-25 | Edit

GenomeRunner video overview

On 10/26/2011 I gave a video overview of GenomeRunner on the BioConference LIVE. It outlines layers of genome organization, and how GenomeRunner can effectively use genome annotation data to interpret genome-wide experimental regions. Enjoy!

Posted by 2013-07-14 | Edit

GenomeRunner v4.0 featured on SoftPedia

With the new release of GenomeRunner, it has been noted by SoftPedia web site for the second time.

As you may already know, Genome Runner, one of your products, is part of
Softpedia's database of software programs for the Windows operating system.
It is featured with a description text, screenshots, download links and
technical details on this page:
http://www.softpedia.com/get/Science-CAD/Genome-Runner.shtml... read more

Posted by 2013-07-14 | Edit

GenomeRunner v4.0

GenomeRunner v4.0 is here! This version number change is due to change the heart of GenomeRunner - genome annotation data handling. Now, GenomeRunner is no longer require permanent Internet connection to access remote MySQL database. Instead, you should download one of the SQLite database - GenomeRunner will use it locally. See updated screenshot for an example.... read more

Posted by 2013-06-29 | Edit

MySQL retirement

GenomeRunner was initially designed and works with MySQL database containing various genome annotation elements and epigenomic features. It works locally, but often poises a problem to other users accessing MySQL database hosted by us. Network load and latency, severe weather disruptions of our servers, security considerations - many factors lead to the decision to make GenomeRunner completely stand-alone.... read more

Posted by 2013-06-20 | Edit

Visualization

By default, GenomeRunner outputs n x m matrix of -log10 transformed p-values. n (rows) are genomic features (GFs) and m (columns) are features of interest (FOIs). Each cell shows enriched association (or depletion, if "-" is present) of a FOI with a GF. Such matrix is outputted into a tab-separated file "*_ChiSquare_Matrix.gr", which can be opened in Excel or any text editor.

How to visualize it? I have commented and elaborated on this before, and provided an R script for visualizing such a matrix. But what about those who are not familiar with R? Thanks to an excellent heatmap visualization tool, you can upload GenomeRunner' matrix to it, click "Submit", and voila! High quality PDF heatmap is ready. Also, it provides lots of ways for tweaking colors, clustering parameters (although default settings work OK). And it gives an R code to make such a heatmap, for your learning and use. So, head on to the heatmap visualization tool and try it out.

Posted by 2013-05-25 | Edit

Transcription Factor Binding Sites enrichment analysis

I've been asked which genome annotation elements should one use to get quick insight into easily interpretable biological functions. TFBS enrichment analysis is a place to start.

Example (for human hg19 genome assembly): You have ChIP-seq data, identified peaks, and have their locations in BED format.

To identify whether they are enriched in experimentally validated TFBSs, add Tier 1/Regulation/wgEncodeRegTfbsClustered genomic feature. Click "Run Enrichment for all names" button - it will expand this single table into 148 individual TFBSs, like NFkB, IRF1 etc. Running "Enrichment analysis" will answer whether your peaks are statistically significantly enriched in any of these TFBSs.... read more

Posted by 2013-04-14 | Edit Labels: Tips

Database fully functional!

Earlier than anticipated, the MySQL database is fully functional! Use the link with the tooltip "Change database connection settings" to load default settings, click "OK" and it will connect. HG19, HG18 and MM9 genome annotation data are available, updated on 02/14/2013.
In a mean time, we're integrating the ability to handle SQLite database. This will simplify work with GenomeRunner and eliminate the need of accessing remote database. Keep checking!

Posted by 2013-02-16 | Edit

Database restoration

We're working on restoring database access for GenomeRunner. Due to shipping delay of hard drives we expect MySQL database up and running by ~02/23/2013. We'll plan adding SQLite support into VB version of GenomeRunner, to provide downloadable databases - this will solve dependency on remotely hosted database.

Posted by 2013-02-08 | Edit

Database unavailable

We've experienced hard disk failure on our main server hosting the public MySQL database for GenomeRunner. Currently, the server is offline, expected to be restored by the end of December or early January. Apologies for the inconvenience, please find contact e-mail in the README file.

Posted by 2012-12-15 | Edit

GenomeRunner updated to v3.1

This release includes numerous minor adjustments for ease of use. Main improvements include ability to use any genomic feature as a background, output what fraction of user supplied features of interest overlaps with a (epi)genomic annotation feature. Check README.txt file for more. The code is also updated in the consolidate branch.... read more

Posted by 2012-11-25 | Edit Labels: Updates

ENCODE data

September 2012 will be remembered for a huge release of genome annotation data from ENCODE. Despite being so massive, there are still many gaps remaining.

What are these data for GenomeRunner? They'll add enormously to the power of GenomeRunner to find yet unknown associations of experimental genome-wide data with (epi)genomic elements. The work is under way to incorporate these data into GenomeRunner database without overwhelming the user. It may take some time for me, as ENCODE project release was created over >3 years by 442 authors, again, massive data. Check some facts about it here. ... read more

Posted by 2012-09-26 | Edit

Calculating p-value from binomial distribution

A new code has been pushed into the branch "consolidate". It adds p-value calculation using expected by random chance calculations and binomial distribution. Although not in the compiled release yet, you may explore its usefulness and speedup now.

Posted by 2012-08-20 | Edit

HG19 database update

Human genome annotation database HG19 was updated on 07/23/2012. As usual, only a few tables were affected, including gene-/mrna-related tables. From now on, the updates are scheduled every 6 month, starting January, 2013.

Posted by 2012-07-24 | Edit

First paper using GenomeRunner results published

Fine mapping and conditional analysis identify a new mutation in the autoimmunity susceptibility gene BLK that leads to reduced half-life of the BLK protein.
Delgado-Vega AM, Dozmorov MG, Quirós MB, Wu YY, Martínez-García B, Kozyrev SV, Frostegård J, Truedsson L, de Ramón E, González-Escribano MF, Ortego-Centeno N, Pons-Estel BA, D'Alfonso S, Sebastiani GD, Witte T, Lauwerys BR, Endreffy E, Kovács L, Vasconcelos C, da Silva BM, Wren JD, Martin J, Castillejo-López C, Alarcón-Riquelme ME.
Ann Rheum Dis. 2012 Jul;71(7):1219-26.
PMID: 22696686... read more

Posted by 2012-06-20 | Edit

Table Creator for GenomeRunner

A separate program for local re-creation of MySQL database for GenomeRunner is posted on GitHub.
git@github.com:MikhailD/GenomeRunnerTC.git, branch experimental

It was developed with hope of one-click-do-it-all, so a single button would do everything. It is still in development, that's why it is not available through sourceforge yet. More, new changes are coming, which would greatly simplify data handling and provide closer integration woth UCSC genome database. Keep checking back!

Posted by 2012-05-24 | Edit

Update for Analytical method

In the original publication Analytical method was called "experimental", as we didn't test it thoroughly. Now, GenomeRunner is at the stage of thoroughly testing and comparing analytical method with the default Monte-Carlo coupled with Chi-squared test.

The goal of this testing is to identify discrepancies, if any, in analytical method. Analytical method was designed to perform statistical calculations based on binomial distribution, which increases the speed of calculations significantly. All updates regarding analytical method testing will be posted.... read more

Posted by 2012-03-14 | Edit

GenomeRunner at F1000

A presentation about GenomeRunner is available at Faculty of 1000 web site, at http://f1000.com/posters/browse/summary/1089942. This presentation won first place at MCBIOS2012 meeting http://www.mcbios.org, held February 17-18 in the University of Mississippi, Oxford MS at the Inn at Ole Miss.

Posted by 2012-03-14 | Edit

Visualization of GenomeRunner's enrichment results - clustering

Following heatmaps.R script was a question - what clustering method and dissimilarity metric (distance) to use? It depends. I prefer "euclidean" distance and "ward" clustering, but often other combinations give better results. Here's a code to view all combinations of clustering parameters, each heatmap outputted on a separate page of a .pdf file:

~~~~~~
:::R
dist.methods<-c("euclidean", "manhattan", "binary", "minkowski") #"canberra","maximum",
hclust.methods<-c("ward", "single", "complete", "average", "mcquitty", "median", "centroid")
pdf("Output_File_Name.pdf")
par(oma=c(5,0,0,5)) #Make right and bottom margins larger
for (d in dist.methods) {
for (h in hclust.methods){
# With breaks, greenred colors
# heatmap.2(as.matrix(mtx), distfun=function(x){dist(x,method=d)}, hclustfun=function(x){hclust(x,method=h)}, breaks=my.breaks, lwid=c(1.5,3), lhei=c(1.5,4), key=T, keysize=0.1, density.info="none", trace="none", cexCol=1.5, cexRow=1.5, main=paste("Dist : ",d,"; Hclust : ",h), col=greenred(2*granularity-1))
# Without breaks, brewer colors. Best for enrichment only
heatmap.2(as.matrix(mtx),trace="none",col=color,distfun=function(x){dist(x,method=d)}, hclustfun=function(x){hclust(x,method=h)}, density.info="none",cexCol=1,cexRow=1, notecex=1.5, main=paste("Dist : ",d,"; Hclust : ",h)) # cellnote=mtx.raw, notecol='darkgreen',
}
}
dev.off()
~~~~~~... read more

Posted by 2012-02-14 | Edit

heatmaps.R - Visualization of GenomeRunner's enrichment results

I got a very relevant and expected question - what to do with the output from GenomeRunner's enrichment analysis. Well, if you have a single set of genomic features of interest, exploring log file may be sufficient, as it lists enrichment p-values that can be sorted to find the most significant over/underrepresentations.

When one has multiple conditions, like ChIP-seq peaks in untreated and treated cells, it may be desrable to see p-values side-by-side, to contrast the differences between the conditions. While it's possible to copy-paste values from the log file, it's already done for you in a *_Matrix.gr file, with genomic features listed vertically, conditions horizontally. p-values are -Log10 transformed, and a '-' is added in case of underrepresentation. This matrix can be visualized, and red/green gradient will signify over/underrepresentation.... read more

Posted by 2012-02-06 | Edit

GenomeRunner's code

Those who follows code development may have noticed creation of another branch, 'consolidate'. This branch contains reorganized structure of code files, allowing to have GUI and command line version within one project. Thanks, Cory. This convenience comes at the cost of manually disabling one or the other group of files for compiling one or the other version. Also, when compiling into command line version, application type should be set to 'Console application'. And it whould be 'Windows Forms Application', when compiled to GIU. Details are in 'BUILD' file.... read more

Posted by 2012-02-05 | Edit

GenomeRunner version 3.0.0.1

This update includes cosmetic code changes, hiding menu items that are not fullu functional, and making remaining fully functional. Some important changes include:

! When running traditional Monte-Carlo simulation, which takes a while, zero p-values are now correctly -Log10-transformed into System.Double.MaxValue. Other methods for processing p-values rarely encounter zero p-values and won't be affected by this... read more

Posted by 2012-02-05 | Edit

<< Older Entries