GenomeRunner Home

Annotation and enrichment of Next-Gen sequencing data

Brought to you by: mikhaildozmorov

Home

Latest news: check our blog @ http://sourceforge.net/p/genomerunner/blog/

GenomeRunner is a tool for automating genome exploration. It performs annotation and enrichment analyses of user-provided genomic regions (SNPs, ChIP-seq binding sites etc.) against >6,000 (human genome) epigenomic features available from the UCSC genome browser.

Input - any genome-wide data data in .bed format (tab-delimited text file with chrom, chromStart, chromEnd).

Annotation analysis output - detailed annotation of each genomic region in input data. Used to prioritize individual genomic regions by the total number of epigenomic features they co-localize with.

Enrichment analysis output - p-values of statistically significant co-localizations of input genome-wide data with genome annotation features selected for the analysis. Used to prioritize epigenomic features associated with user data.

GenomeRunner video overview
GenomeRunner poster

06/29/2013 Version 4.0 released. Databases are now available for download.
11/25/2012 Version 3.1 of GenomeRunner is released.

GenomeRunner v4.0 workflow

As featured on SoftPedia (http://www.softpedia.com/get/Science-CAD/Genome-Runner.shtml)

Genome Runner 4.0.0.0 - 100% Clean

Using spot background instead of sampling from the whole genome

New databases available for download

by Mikhail Dozmorov 2014-01-04

After long last, the updated SQLite databases for hg19 human genome assembly are available for download. Due to their large size, they are hosted outside of SourceForge.

Go to http://www.genomerunner.org, "Help" page - the links are available at the bottom. Please, provide your comments/suggestions, and I'll adjust the tables to your needs.

Database update

by Mikhail Dozmorov 2013-11-20

Due to many inquiries about the databases for GenomeRunner, I am currently working on compiling new version of genome annotation data. Now, the database will reflect the structure of the trackDb table.

The ENCODE data will be separated from UCSC annotations, and will have similar hierarchical organization, as in UCSC. That is, the data will be split into "data source/type" categories (like BroadHistone), and by cell types, as defined by the ENCODE data coordination center at UCSC. The goal is to make it much easier to run cell type-specific analyses, and/or to focus on best quality (Tier 1) data.

Keep in touch, the update is planned to be released before year's end. As always, I am grateful for everyone insights and suggestions.

Genome annotation databases

by Mikhail Dozmorov 2013-10-27

At the heart of GenomeRunner lies genome annotation data from the ENCODE project. The human genome is big, and so are the (growing amount of) data generated by the ENCODE. Therefore, for the human genome I provide only a part of the genome annotation data, the best quality Tier 1 data hg19tier1.sqlite, and Tier2 data hg19tier2.sqlite. These files should be opened in GenomeRunner individually, to perform annotation/enrichment analyses against Tier 1 or Tier 2 data, respectively. Also, note the hg19tier100.sqlite file, containing cell type-specific transcription factor binding sites.

If more data are needed, such as Tier 3 data, or a combined dataset for all tiers, please, contact me directly on my gmail address, mikhail dot dozmorov.

Finally, genome annotation data for mouse are much less extensive. Therefore, mm9.sqlite contains all tiered mouse genome annotation data.

Most biologically interesting (easily interpretable) genome annotation data

by Mikhail Dozmorov 2013-08-25

If you want to quickly get some biological insight into what functional elements may be affected by your data, use the following genomic features:

Tier1/Regulation/wgEncodeRegTfbsClustered* - experimentally validated TFBSs
Tier1/Regulation/tfbsConsSites* - computationally predicted TFBSs
Tier1/Histone Modifications/wgEncodeBroadHmmGm12878HMM* – chromatin segmentation states
Tier1/Genes/knownAlt* - alternative splicing sites

* - a table can and should be expanded into subtables using “Run Enrichment for all names” button

GenomeRunner video overview

by Mikhail Dozmorov 2013-07-14

On 10/26/2011 I gave a video overview of GenomeRunner on the BioConference LIVE. It outlines layers of genome organization, and how GenomeRunner can effectively use genome annotation data to interpret genome-wide experimental regions. Enjoy!

GenomeRunner v4.0 featured on SoftPedia

by Mikhail Dozmorov 2013-07-14

With the new release of GenomeRunner, it has been noted by SoftPedia web site for the second time.

As you may already know, Genome Runner, one of your products, is part of
Softpedia's database of software programs for the Windows operating system.
It is featured with a description text, screenshots, download links and
technical details on this page:
http://www.softpedia.com/get/Science-CAD/Genome-Runner.shtml

The description text was created by our editors, using sources such as text
from your product's homepage, information from its help system, the PAD
file (if available) and the editor's own opinions on the program itself.

"Genome Runner" has been tested in the Softpedia labs using several
industry-leading security solutions and found to be completely clean of
adware/spyware components. We are impressed with the quality of your
product and encourage you to keep these high standards in the future.

To assure our visitors that Genome Runner is clean, we have granted it with
the "100% CLEAN" Softpedia award. To let your users know about this
certification, you may display this award on your website, on software
boxes or inside your product.

More information about your product's certification and the award is
available on this page:
http://www.softpedia.com/progClean/Genome-Runner-Clean-206245.html

Feel free to link to us using the URLs above. If you choose to link to the
clean award page for your product, you may use the award graphic or a text
link: "100% CLEAN award granted by Softpedia".

GenomeRunner v4.0

by Mikhail Dozmorov 2013-06-29

GenomeRunner v4.0 is here! This version number change is due to change the heart of GenomeRunner - genome annotation data handling. Now, GenomeRunner is no longer require permanent Internet connection to access remote MySQL database. Instead, you should download one of the SQLite database - GenomeRunner will use it locally. See updated screenshot for an example.

Although we made all efforts to eliminate bugs, GenomeRunner is a complex tool, and there may be unpredicted events when it will break. Please, report any bugs, or ask any questions to the authors listed in the README file. Any feedback will be answered.

The older version, 3.1, will remain functional. But, as mentioned in the previous post, the MySQL database is going to retire. There is no set date for it, but we encourage to use the latest SQLite version of GenomeRunner.

MySQL retirement

by Mikhail Dozmorov 2013-06-20

GenomeRunner was initially designed and works with MySQL database containing various genome annotation elements and epigenomic features. It works locally, but often poises a problem to other users accessing MySQL database hosted by us. Network load and latency, severe weather disruptions of our servers, security considerations - many factors lead to the decision to make GenomeRunner completely stand-alone.

From July we will retire our MySQL database (no relation to Google Reader retirement). GenomeRunner will be able to use SQLite databases, downloadable from this site. Advantages are obvious - users can run analyses locally, a single file serves as a snapshot of the data being used for the analysis, no need to use login credentials for the remote database. There are disadvantages - SQLite is slower, even if used locally, database files, especially for human, are large, updates will require downloading new files. But at the end, I hope, SQLite version of GenomeRunner will be much appreciated.

Visualization

by Mikhail Dozmorov 2013-05-25

By default, GenomeRunner outputs n x m matrix of -log10 transformed p-values. n (rows) are genomic features (GFs) and m (columns) are features of interest (FOIs). Each cell shows enriched association (or depletion, if "-" is present) of a FOI with a GF. Such matrix is outputted into a tab-separated file "*_ChiSquare_Matrix.gr", which can be opened in Excel or any text editor.

How to visualize it? I have commented and elaborated on this before, and provided an R script for visualizing such a matrix. But what about those who are not familiar with R? Thanks to an excellent heatmap visualization tool, you can upload GenomeRunner' matrix to it, click "Submit", and voila! High quality PDF heatmap is ready. Also, it provides lots of ways for tweaking colors, clustering parameters (although default settings work OK). And it gives an R code to make such a heatmap, for your learning and use. So, head on to the heatmap visualization tool and try it out.

Transcription Factor Binding Sites enrichment analysis

by Mikhail Dozmorov 2013-04-14

I've been asked which genome annotation elements should one use to get quick insight into easily interpretable biological functions. TFBS enrichment analysis is a place to start.

Example (for human hg19 genome assembly): You have ChIP-seq data, identified peaks, and have their locations in BED format.

To identify whether they are enriched in experimentally validated TFBSs, add Tier 1/Regulation/wgEncodeRegTfbsClustered genomic feature. Click "Run Enrichment for all names" button - it will expand this single table into 148 individual TFBSs, like NFkB, IRF1 etc. Running "Enrichment analysis" will answer whether your peaks are statistically significantly enriched in any of these TFBSs.

The same test can be done for computationally predicted TFBSs. Use Tier 1/Regulation/tfbsConsSites.

It is always advisable to use custom background, as nearly half of the human genome contains low-complexity regions. Random sampling from them will over-inflate p-values. More about it in the subsequent posts.

Database fully functional!

by Mikhail Dozmorov 2013-02-16

Earlier than anticipated, the MySQL database is fully functional! Use the link with the tooltip "Change database connection settings" to load default settings, click "OK" and it will connect. HG19, HG18 and MM9 genome annotation data are available, updated on 02/14/2013.
In a mean time, we're integrating the ability to handle SQLite database. This will simplify work with GenomeRunner and eliminate the need of accessing remote database. Keep checking!

Database restoration

by Mikhail Dozmorov 2013-02-08

We're working on restoring database access for GenomeRunner. Due to shipping delay of hard drives we expect MySQL database up and running by ~02/23/2013. We'll plan adding SQLite support into VB version of GenomeRunner, to provide downloadable databases - this will solve dependency on remotely hosted database.

Database unavailable

by Mikhail Dozmorov 2012-12-15

We've experienced hard disk failure on our main server hosting the public MySQL database for GenomeRunner. Currently, the server is offline, expected to be restored by the end of December or early January. Apologies for the inconvenience, please find contact e-mail in the README file.

GenomeRunner updated to v3.1

by Mikhail Dozmorov 2012-11-25

This release includes numerous minor adjustments for ease of use. Main improvements include ability to use any genomic feature as a background, output what fraction of user supplied features of interest overlaps with a (epi)genomic annotation feature. Check README.txt file for more. The code is also updated in the consolidate branch.

Using SNP tables as a background is trickier due to memory limitations, and it is included as a separate feature. Thus, one can test whether a set of SNPs is enriched in any (epi)genomic annotation as compared with a set of SNPs randomly selected from all SNPs identified. This restriction allows to avoid genomic regions with low complexity and/or poorly annotated. Furthermore, one can select different SNP tables to be used as a background, to further restrict background for random sampling. For example, if one tests a set of SNPs identified from a GWA study, snp135 would be more appropriate as a background, as it includes all human SNPs ever identified. If, on the other hand, one is evaluating a set of SNPs from selected clinically relevant SNPs, snp135flagged table would be more appropriate as a background.

ENCODE data

by Mikhail Dozmorov 2012-09-26

September 2012 will be remembered for a huge release of genome annotation data from ENCODE. Despite being so massive, there are still many gaps remaining.

What are these data for GenomeRunner? They'll add enormously to the power of GenomeRunner to find yet unknown associations of experimental genome-wide data with (epi)genomic elements. The work is under way to incorporate these data into GenomeRunner database without overwhelming the user. It may take some time for me, as ENCODE project release was created over >3 years by 442 authors, again, massive data. Check some facts about it here.

In a mean time, a web-version of GenomeRunner slowly shaping up.

Calculating p-value from binomial distribution

by Mikhail Dozmorov 2012-08-20

A new code has been pushed into the branch "consolidate". It adds p-value calculation using expected by random chance calculations and binomial distribution. Although not in the compiled release yet, you may explore its usefulness and speedup now.

HG19 database update

by Mikhail Dozmorov 2012-07-24

Human genome annotation database HG19 was updated on 07/23/2012. As usual, only a few tables were affected, including gene-/mrna-related tables. From now on, the updates are scheduled every 6 month, starting January, 2013.

First paper using GenomeRunner results published

by Mikhail Dozmorov 2012-06-20

Fine mapping and conditional analysis identify a new mutation in the autoimmunity susceptibility gene BLK that leads to reduced half-life of the BLK protein.
Delgado-Vega AM, Dozmorov MG, Quirós MB, Wu YY, Martínez-García B, Kozyrev SV, Frostegård J, Truedsson L, de Ramón E, González-Escribano MF, Ortego-Centeno N, Pons-Estel BA, D'Alfonso S, Sebastiani GD, Witte T, Lauwerys BR, Endreffy E, Kovács L, Vasconcelos C, da Silva BM, Wren JD, Martin J, Castillejo-López C, Alarcón-Riquelme ME.
Ann Rheum Dis. 2012 Jul;71(7):1219-26.
PMID: 22696686

GenomeRunner did the analysis of SNPs enriched/depleted in transcription factor binding sites and histone modification marks. Genomic regions associated with Systemic Lupus Erythematosus were found to be enriched in NFkB binding, and in several epigenetic marks. In contrast, no enrichment was observed for non-associated regions.

That's a start, more on the way.

Table Creator for GenomeRunner

by Mikhail Dozmorov 2012-05-24

A separate program for local re-creation of MySQL database for GenomeRunner is posted on GitHub.
git@github.com:MikhailD/GenomeRunnerTC.git, branch experimental

It was developed with hope of one-click-do-it-all, so a single button would do everything. It is still in development, that's why it is not available through sourceforge yet. More, new changes are coming, which would greatly simplify data handling and provide closer integration woth UCSC genome database. Keep checking back!

Update for Analytical method

by Mikhail Dozmorov 2012-03-14

In the original publication Analytical method was called "experimental", as we didn't test it thoroughly. Now, GenomeRunner is at the stage of thoroughly testing and comparing analytical method with the default Monte-Carlo coupled with Chi-squared test.

The goal of this testing is to identify discrepancies, if any, in analytical method. Analytical method was designed to perform statistical calculations based on binomial distribution, which increases the speed of calculations significantly. All updates regarding analytical method testing will be posted.

An updated version of GenomeRunner's code is posted on SourceForge and on GitHub. I'm thinking about release cycles for GenomeRunner binaries, in a mean time use current GenomeRunner version with default settings.

GenomeRunner at F1000

by Mikhail Dozmorov 2012-03-14

A presentation about GenomeRunner is available at Faculty of 1000 web site, at http://f1000.com/posters/browse/summary/1089942. This presentation won first place at MCBIOS2012 meeting http://www.mcbios.org, held February 17-18 in the University of Mississippi, Oxford MS at the Inn at Ole Miss.

Visualization of GenomeRunner's enrichment results - clustering

by Mikhail Dozmorov 2012-02-14

Following heatmaps.R script was a question - what clustering method and dissimilarity metric (distance) to use? It depends. I prefer "euclidean" distance and "ward" clustering, but often other combinations give better results. Here's a code to view all combinations of clustering parameters, each heatmap outputted on a separate page of a .pdf file:

:::R
dist.methods<-c("euclidean",  "manhattan", "binary", "minkowski") #"canberra","maximum",
hclust.methods<-c("ward", "single", "complete", "average", "mcquitty", "median", "centroid")
pdf("Output_File_Name.pdf")
par(oma=c(5,0,0,5)) #Make right and bottom margins larger
for (d in dist.methods) {
  for (h in hclust.methods){
    # With breaks, greenred colors
    # heatmap.2(as.matrix(mtx), distfun=function(x){dist(x,method=d)}, hclustfun=function(x){hclust(x,method=h)}, breaks=my.breaks, lwid=c(1.5,3), lhei=c(1.5,4), key=T,  keysize=0.1, density.info="none", trace="none",  cexCol=1.5, cexRow=1.5, main=paste("Dist : ",d,"; Hclust : ",h), col=greenred(2*granularity-1))
    # Without breaks, brewer colors. Best for enrichment only
    heatmap.2(as.matrix(mtx),trace="none",col=color,distfun=function(x){dist(x,method=d)}, hclustfun=function(x){hclust(x,method=h)}, density.info="none",cexCol=1,cexRow=1, notecex=1.5, main=paste("Dist : ",d,"; Hclust : ",h)) # cellnote=mtx.raw, notecol='darkgreen',
     }
}
dev.off()

Make this part of heatmaps.R, and list through the pages!

heatmaps.R - Visualization of GenomeRunner's enrichment results

by Mikhail Dozmorov 2012-02-06

I got a very relevant and expected question - what to do with the output from GenomeRunner's enrichment analysis. Well, if you have a single set of genomic features of interest, exploring log file may be sufficient, as it lists enrichment p-values that can be sorted to find the most significant over/underrepresentations.

When one has multiple conditions, like ChIP-seq peaks in untreated and treated cells, it may be desrable to see p-values side-by-side, to contrast the differences between the conditions. While it's possible to copy-paste values from the log file, it's already done for you in a *_Matrix.gr file, with genomic features listed vertically, conditions horizontally. p-values are -Log10 transformed, and a '-' is added in case of underrepresentation. This matrix can be visualized, and red/green gradient will signify over/underrepresentation.

heatmaps.R is a script set to do just that, visualizing matrixes. See Downloads section. I highly recommend to use R-Studio environment for R. Under Windows, I open a matrix file in Excel, copy the content to clipboard, and then run it through the script. Questions - contact me. I'm working on seamless integration of .NET with R to include these visualization capabilities in GenomeRunner

GenomeRunner's code

by Mikhail Dozmorov 2012-02-05

Those who follows code development may have noticed creation of another branch, 'consolidate'. This branch contains reorganized structure of code files, allowing to have GUI and command line version within one project. Thanks, Cory. This convenience comes at the cost of manually disabling one or the other group of files for compiling one or the other version. Also, when compiling into command line version, application type should be set to 'Console application'. And it whould be 'Windows Forms Application', when compiled to GIU. Details are in 'BUILD' file.

This restructuring was also aimed in simplifying compilation of Linux version of GenomeRunner. It's on the way, check back soon!

GenomeRunner version 3.0.0.1

by Mikhail Dozmorov 2012-02-05

This update includes cosmetic code changes, hiding menu items that are not fullu functional, and making remaining fully functional. Some important changes include:

! When running traditional Monte-Carlo simulation, which takes a while, zero p-values are now correctly -Log10-transformed into System.Double.MaxValue. Other methods for processing p-values rarely encounter zero p-values and won't be affected by this

! 'Run enrichment for all names' button no functions correctly for other categories. E.g., table 'nestedRepeats' contain column 'repClass', with 18 labels. This button now correctly recognizes these 18 labels as different genomic features. Previously it worked correctly on 'name' columns only, such as in 'wgEncodeRegTfbsClustered' table.

! Now ANY genomic feature can be loaded as genomic background. This is needed, for example, when analyzing a set of SNPs for enrichment, and comparing it with random selection from all the SNPs available instead of the whole genome. Select 'snp135' genomic feature into 'Genomic features that will be run' panel, then 'File/Load selected Genomic Feature as spot background'. Note that for huge tables, such as SNPs, loading can take up to 3 hours.

! Generating random genomic features. Now the code includes weighted selection of chromosomes, so small and random chromosomes would be rarely selected.

GenomeRunner on Softpedia

by Mikhail Dozmorov 2012-02-03

Genome can now also be downloaded from Softpedia.

They also award GenomeRunner their "100% clean" award. Of course, everyone can look at the source code and see for themselves, but it is nice to display Softpedia's logo on the front page.
100% CLEAN award granted by Softpedia

GenomeRunner tutorial

by Mikhail Dozmorov 2012-01-31

Since publication, GenomeRunner's inner working were upgraded. I described all the changes in the Supplemental Material, aka Tutorial. It's available for download or as a web page. Red highlights the differences from the published version.

All examples are updated, to be analyzed with the latest HG19 human genome assembly. Random FOIs, of course, remain random and no enrichment were observed. I selected different sets of SNPs, the results expectedly changed, but the expected findings are holding. That is, SNPs associated with genes are heavily enriched in genes, and SNPs associated with conserved elements are associated with them. Association of H3K4me2 with DNAse hypersensitive sites is also holding in HG19 genome assembly, again, as expected.

GenomeRunner v.3.0 is available

by Mikhail Dozmorov 2012-01-31

Both GUI and console version got updated. Although you may not notice many changes in appearance, behind the scenes there are many improvements designed to make better user experience.

The databases for GenomeRunner now have improved Tier structure, for human and mouse genome assemblies. Older features, as well as hg18 Human genome assembly, are still kept for compatibility. chromInfo table is used to get genome background information and changes automatically as one selects different organism. Server selection got easier, in light with new database servers coming online soon.

Command line now uses universal settings stored in XML file, which can be changed in any text editor. Help file was rewritten, and many other small enhancements and fool-proof hooks.

It's hardly possible to foresee all situations that may disrupt GenomeRunner's work. Any comments about user experience are welcomed in Tickets, Blog, Discussion or Wiki pages.

New databases available

by Mikhail Dozmorov 2012-01-30

Latest data for human and mouse are available for GenomeRunner, databases hg19 and mm9, respectively. Database hg18test remains available as a legacy to published version of GenomeRunner, but it is highly encouraged to use hg19. As always, the databases are accessible via
Mysql -h156.110.144.34 –ugenomerunner –pgenomerunner hg19

What's new:
- Better Tiers organization within genomerunner master table
- Genome background information is stored in chromInfo table, downloadable from UCSC database
- Timestamp of the last update, and Completed code, stored in genomerunner master table

Behind the scenes:
- Programmatic database update and maintenance. The program will be released separately

Coming soon:
- More efficient handling of genomerunner master table, making use of trackDB table. This table contains track description, and serves as a master table for UCSC genome browser. It will greatly simplify incorporation of all genome annotation data for use in GenomeRunner, but still requires manual selection of tiers, query types etc. Debating what's the best way of using it, whether to recode GenomeRunner to use trackDB directly or to process it into genomerunner master table.

Database update

by Mikhail Dozmorov 2012-01-28

GenomeRunner server is down for maintenance.

After MySQL crashed and the server required restart, new database assembly is being loaded. Sounds simple, but installing 77Gb database onto 100Gb drive over the network is not trivial. The human database should be up and running in about a day, and I'll try to put at least part of the mouse data on the remaining space.

In a few days a new server will be open, which will solve disk space limitations. Check back soon!

Check all news and updates to ensure you're using the latest version of GenomeRunner!

100% CLEAN award granted by Softpedia