Background

History

This module has been inspired by the TheAnnotator tool first used in the analysis by Ponjavic et al (2007) by Gerton Lunter and early work of Caleb Webber.

The differences are:

  • permit more measures of association. The original Annotator used nucleotide overlap, but other measures might be useful like number of elements overlapping by at least x nucleotides, proximity to closest element, etc.
  • easier user interface and using standard formats.
  • permit incremental runs. Annotations can be added without recomputing the samples.
  • faster.

Comparison to other methods

Testing for the association between genomic features is a field of long-standing interest in genomics and has gained considerable traction with the publication of large scale genomic data sets such as the ENCODE data.

Generally we believe that the problem of testing for association has not been fully resolved and advise every genomicist to apply several methods. The list of tools/services below is not exhaustive:

GREAT (MacLean et al. (2010)) uses a binomial test to test if transcription factor binding sites are associated with regulatory domains of genes. GREAT has a convenient web interface with many annotations pre-loaded. Compared to GREAT, GAT can measure depletion and can

GenometriCorr (Favorov et al. (2012)) compute a variety of distance metrics when comparing two interval sets and then apply a set of statistical tests to measure association. GenomicCorr is a good exploratory tool to generate hypotheses about the relationships of two genomic sets of intervals. Compared to GenometriCorr, GAT can simulate more realistic genomic scenarios, for example, segments might not occur in certain regions (due to mapping problems) or occur at reduced frequency (G+C biases).

The GSC (The Encode Project Consortium (2012)) metric (for Genome Structure Correlation) is inspired by the analysis of approximately piecewise stationary time series . The GSC metric estimates the significance of an association metric by estimating the random expectation of the association metric using randomly chosen intervals on the genome. This expectation is then used to test if the observed value of the metric (nucleotide overlap, region overlap, ...) is higher than expected. The method is described in the supplemental details of the first and recent ENCODE papers (Birney et al. (2007)_, ...) and here.

BITS (Layer et al. (2013)) (Binary Interval Search) is a method to perform quick overlap queries between genomic data sets. It implements a Monte-Carlo method for simulation that is particularly suited towards making all on all comparisons between a large number data sets.

Benchmark

We used the example from Tutorial - Interval overlap to perform a rough comparison between various methods. In all cases, we used n = 1000 for simulations. Times are wall-clock times. Please note that this is not a rigorous benchmark.

Method Set1 Set2 Observed Expected P-value Time
BITS srf jurkat 450 5.24 <0.001 43s
BITS srf hepg2 381 9.87 <0.001 39s
BITS srf hepg2/jurkat 9 5.7 0.13 28s
BITS jurkat hepg2 47237 3548 <0.001 106s
GSC srf jurkat     0.0004 58s
GSC srf hepg2     6.9E-11 54s
GSC srf hepg2/jurkat     5.23E-7 40s
GSC jurkat hepg2     0 159s
GAT srf jurkat 20183 247.6 <0.001 11s
GAT srf hepg2 18965 601.4 <0.001 11s
GAT srf hepg2/jurkat 425 327.3 0.21 11s
GAT jurkat hepg2 6163503 457332.8 <0.001 316s

BITS and GAT are fairly comparable, even though they use different metrics for the association (number of segments overlapping versus number of nucleotides overlapping). GAT is quicker on smaller data sets, while BITS outperforms on large datasets.

GSC reports a significant association in the comparison between srf and dhs intervals specific to hepg2 cells, while the other two tools do not, which is the biologically plausible result. It is difficult to say if there indeed is an association, or GSC is overestimating association.