Background¶

History¶

This module has been inspired by the TheAnnotator tool first used in the analysis by Ponjavic et al (2007) by Gerton Lunter and early work of Caleb Webber.

The differences are:

permit more measures of association. The original Annotator used nucleotide overlap, but other measures might be useful like number of elements overlapping by at least x nucleotides, proximity to closest element, etc.

easier user interface and using standard formats.

permit incremental runs. Annotations can be added without recomputing the samples.

faster.

Comparison to other methods¶

Testing for the association between genomic features is a field of long-standing interest in genomics and has gained considerable traction with the publication of large scale genomic data sets such as the ENCODE data.

Generally we believe that the problem of testing for association has not been fully resolved and advise every genomicist to apply several methods. The list of tools/services below is not exhaustive:

GREAT (MacLean et al. (2010)) uses a binomial test to test if transcription factor binding sites are associated with regulatory domains of genes. GREAT has a convenient web interface with many annotations pre-loaded. Compared to GREAT, GAT can measure depletion and can

GenometriCorr (Favorov et al. (2012)) compute a variety of distance metrics when comparing two interval sets and then apply a set of statistical tests to measure association. GenomicCorr is a good exploratory tool to generate hypotheses about the relationships of two genomic sets of intervals. Compared to GenometriCorr, GAT can simulate more realistic genomic scenarios, for example, segments might not occur in certain regions (due to mapping problems) or occur at reduced frequency (G+C biases).

The GSC (The Encode Project Consortium (2012)) metric (for Genome Structure Correlation) is inspired by the analysis of approximately piecewise stationary time series . The GSC metric estimates the significance of an association metric by estimating the random expectation of the association metric using randomly chosen intervals on the genome. This expectation is then used to test if the observed value of the metric (nucleotide overlap, region overlap, ...) is higher than expected. The method is described in the supplemental details of the first and recent ENCODE papers (Birney et al. (2007)_, ...) and here.

BITS (Layer et al. (2013)) (Binary Interval Search) is a method to perform quick overlap queries between genomic data sets. It implements a Monte-Carlo method for simulation that is particularly suited towards making all on all comparisons between a large number data sets.

Benchmark¶

We used the example from Tutorial - Interval overlap to perform a rough comparison between various methods. In all cases, we used n = 1000 for simulations. Times are wall-clock times. Please note that this is not a rigorous benchmark.

Method	Set1	Set2	Observed	Expected	P-value	Time
BITS	srf	jurkat	450	5.24	<0.001	43s
BITS	srf	hepg2	381	9.87	<0.001	39s
BITS	srf	hepg2/jurkat	9	5.7	0.13	28s
BITS	jurkat	hepg2	47237	3548	<0.001	106s
GSC	srf	jurkat			0.0004	58s
GSC	srf	hepg2			6.9E-11	54s
GSC	srf	hepg2/jurkat			5.23E-7	40s
GSC	jurkat	hepg2			0	159s
GAT	srf	jurkat	20183	247.6	<0.001	11s
GAT	srf	hepg2	18965	601.4	<0.001	11s
GAT	srf	hepg2/jurkat	425	327.3	0.21	11s
GAT	jurkat	hepg2	6163503	457332.8	<0.001	316s

BITS and GAT are fairly comparable, even though they use different metrics for the association (number of segments overlapping versus number of nucleotides overlapping). GAT is quicker on smaller data sets, while BITS outperforms on large datasets.

GSC reports a significant association in the comparison between srf and dhs intervals specific to hepg2 cells, while the other two tools do not, which is the biologically plausible result. It is difficult to say if there indeed is an association, or GSC is overestimating association.