Background¶
History¶
This module has been inspired by the TheAnnotator tool first used in the analysis by Ponjavic et al (2007) by Gerton Lunter and early work of Caleb Webber.
The differences are:
- permit more measures of association. The original Annotator used nucleotide overlap, but other measures might be useful like number of elements overlapping by at least x nucleotides, proximity to closest element, etc.
- easier user interface and using standard formats.
- permit incremental runs. Annotations can be added without recomputing the samples.
- faster.
Comparison to other methods¶
Testing for the association between genomic features is a field of long-standing interest in genomics and has gained considerable traction with the publication of large scale genomic data sets such as the ENCODE data.
Generally we believe that the problem of testing for association has not been fully resolved and advise every genomicist to apply several methods. The list of tools/services below is not exhaustive:
GREAT (MacLean et al. (2010)) uses a binomial test to test if transcription factor binding sites are associated with regulatory domains of genes. GREAT has a convenient web interface with many annotations pre-loaded. Compared to GREAT, GAT can measure depletion and can
GenometriCorr (Favorov et al. (2012)) compute a variety of distance metrics when comparing two interval sets and then apply a set of statistical tests to measure association. GenomicCorr is a good exploratory tool to generate hypotheses about the relationships of two genomic sets of intervals. Compared to GenometriCorr, GAT can simulate more realistic genomic scenarios, for example, segments might not occur in certain regions (due to mapping problems) or occur at reduced frequency (G+C biases).
The GSC (The Encode Project Consortium (2012)) metric (for Genome Structure Correlation) is inspired by the analysis of approximately piecewise stationary time series . The GSC metric estimates the significance of an association metric by estimating the random expectation of the association metric using randomly chosen intervals on the genome. This expectation is then used to test if the observed value of the metric (nucleotide overlap, region overlap, ...) is higher than expected. The method is described in the supplemental details of the first and recent ENCODE papers (Birney et al. (2007)_, ...) and here.
BITS (Layer et al. (2013)) (Binary Interval Search) is a method to perform quick overlap queries between genomic data sets. It implements a Monte-Carlo method for simulation that is particularly suited towards making all on all comparisons between a large number data sets.
Benchmark¶
We used the example from Tutorial - Interval overlap to perform a rough
comparison between various methods. In all cases, we used
n = 1000
for simulations. Times are wall-clock times.
Please note that this is not a rigorous benchmark.
Method | Set1 | Set2 | Observed | Expected | P-value | Time |
BITS | srf | jurkat | 450 | 5.24 | <0.001 | 43s |
BITS | srf | hepg2 | 381 | 9.87 | <0.001 | 39s |
BITS | srf | hepg2/jurkat | 9 | 5.7 | 0.13 | 28s |
BITS | jurkat | hepg2 | 47237 | 3548 | <0.001 | 106s |
GSC | srf | jurkat | 0.0004 | 58s | ||
GSC | srf | hepg2 | 6.9E-11 | 54s | ||
GSC | srf | hepg2/jurkat | 5.23E-7 | 40s | ||
GSC | jurkat | hepg2 | 0 | 159s | ||
GAT | srf | jurkat | 20183 | 247.6 | <0.001 | 11s |
GAT | srf | hepg2 | 18965 | 601.4 | <0.001 | 11s |
GAT | srf | hepg2/jurkat | 425 | 327.3 | 0.21 | 11s |
GAT | jurkat | hepg2 | 6163503 | 457332.8 | <0.001 | 316s |
BITS and GAT are fairly comparable, even though they use different metrics for the association (number of segments overlapping versus number of nucleotides overlapping). GAT is quicker on smaller data sets, while BITS outperforms on large datasets.
GSC reports a significant association in the comparison between srf and dhs intervals specific to hepg2 cells, while the other two tools do not, which is the biologically plausible result. It is difficult to say if there indeed is an association, or GSC is overestimating association.