Tutorial - Genomic annotation¶

This tutorial demonstrates the usage of gat with a simple example - where does a transcription factor bind in the genome?

As opposed to Tutorial - Interval overlap we will not be looking at the overlap between one set of intervals with another, but at the overlap of one set of intervals with multiple others.

This tutorial uses the SRF data set described in Valouev et al. (2008). The data sets used in this tutorial are available at:

http://www.cgat.org/~andreas/documentation/gat-examples/TutorialGenomicAnnotation.tar.gz

The data is in srf.hg19.bed.gz. This bed formatted file contains 556 high confidence peaks from the analysis of Valouev et al. (2008) mapped to human chromosome hg19.

We build the analysis in multiple steps. First, we will perform a simple analysis, which will also motivate the use of gat. Later, we will build more sophisticated analyses that take into account the effective genome.

First analysis¶

gat requires three sets of intervals:

a set of segments delineating the active part of the genome (workspace), and
a set of segments of interest (tracks), and
a set of segments with annotations.

gat accepts bed formatted files as input.

As segments of interest we will be using the srf.hgf19.bed.gz containing the results of the ChIP-Seq experiment:

chr5    60627981        60628031        SRF.1
chr5    137801055       137801105       SRF.2
chr5    137800766       137800816       SRF.3
chr7    5570273 5570323 SRF.4
chr5    137827838       137827888       SRF.5
...

As our workspace we will for now use the bed formatted contigs.bed, which simply lists all chromosomes in hg19:

chr13   0       115169878       ws
chr12   0       133851895       ws
chr11   0       135006516       ws
chr10   0       135534747       ws
chr17   0       81195210        ws
...

As our annotations file, we will use the annotations_geneset.bed.gz.

This file required a little more effort to build. We took all protein genes of Ensembl (release 67) and merged the exons of all transcripts of a gene. Based on these gene definitions we then divided genomic regions into intergenic, intronic and exonic regions. We also annotated the UTR (5’ and 3’), and the 5kb flank upstream and downstream. The result is a set of non-overlapping intervals covering the full genome:

...
chr1    362640  367639  5flank
chr1    367640  367658  UTR5
chr1    367659  368594  CDS
chr1    368595  368634  UTR3
chr1    368635  373634  3flank
chr1    373635  616058  intergenic
chr1    616059  621058  3flank
chr1    621059  621098  UTR3
chr1    621099  622034  CDS
chr1    622035  622053  UTR5
chr1    622054  627053  5flank
...

We can now run gat by giving specifying the three input files:

gat-run.py --ignore-segment-tracks --segments=srf.hg19.bed.gz
   --annotations=annotations_geneset.bed.gz --workspace=contigs.bed.gz
--num-samples=1000 --log=gat.log > gat.out

The option –ignore-segment-tracks tells gat to ignore the fourth column in the tracks file and assume that all intervals in this file belong to the same track. If not given, each interval would be treated separately.

The above statement finishes in a few seconds. With large interval collections or many annotations, gat might take a while. It is thus good practice to always save the output in a file. The option –log tells gat to save information or warning messages into a separate log file.

The first 11 columns of the output file are the most informative:

track	annotation	observed	expected	Ci95low	CI95high	stddev	fold	l2fold	pvalue	qvalue
merged	intergenic	5800	14056.3300	13100.0000	15000.0000	583.7181	0.4126	-1.2771	1.0000e-03	1.5714e-03
merged	intronic	8816	10633.8530	9665.0000	11602.0000	592.7589	0.8291	-0.2705	1.0000e-03	1.5714e-03
merged	UTR3	233	278.0720	100.0000	493.0000	117.3112	0.8379	-0.2551	3.6500e-01	4.4611e-01
merged	3flank	800	659.6560	400.0000	1000.0000	175.0544	1.2128	0.2783	2.3100e-01	3.1762e-01
merged	CDS	754	360.7680	161.0000	580.0000	127.2204	2.0900	1.0635	1.0000e-03	1.5714e-03
merged	flank	1334	167.8620	50.0000	350.0000	91.4581	7.9470	2.9904	1.0000e-03	1.5714e-03
merged	5flank	6524	691.5400	400.0000	1000.0000	185.0053	9.4340	3.2379	1.0000e-03	1.5714e-03
merged	UTR5	3441	87.0110	0.0000	200.0000	60.9119	39.5467	5.3055	1.0000e-03	1.5714e-03

The first two columns contain the name of the track and annotation that are being compared. The columns observed and expected give the observed and expected nucleotide overlap, respectively, between the track and annotation.

The following columns CI95low, CI95high, stddev give 95% confidence intervals and the standard deviation of the sample distribution, respectively.

The fold column is the fold enrichment or depletion and is computed as the ratio of observed over expected. The column l2fold is the log2 of this ratio.

The column pvalue gives the empirical p-value, i.e. in what proportion of samples was a higher enrichment or lower depletion found than the one that was observed.

The column qvalue lists a multiple testing corrected p-value. Setting a qvalue threshold and accepting only those comparisons with a qvalue below that threshold corresponds to controlling the false discovery rate at that particular level.

What does this table tell us? Looking at the column observed only, we see that most binding of SRF occurs in intronic and intergenic regions:

Strictly speaking, this is a a naive analysis that does not require gat. The observed overlap alone does not tell us if the overlap we see is more or less than we expect. We do know that there are much more and larger intronic regions than there are UTRs, for example.

More instructive is to look at the enrichment within the various genomic regions, which is given by the fold change.

Here, we clearly see that SRF binds preferentially at transcription start sites (UTR5 and 5flank), while its binding is actually less than expected in introns and intergenic regions.

_images/genomic_annotation_foldchange.png

Binding distribution of SRF with respect to known protein coding genes. Plotted is the log2(fold change). Value not significant are transparent.

The effective genome¶

In the previous analysis we used the complete genome (3.1Gb) as the workspace. However, that is not realistic. For example, SRF will not be predicted in regions that are assembly gaps. Generally speaking, if the workspace is too large, fold enrichment values will be too optimistic.

To get a more accurate estimate of the enrichment in various regions, we should exclude assembly gaps.

The bed formatted file contigs_ungapped.bed.gz contains only those genomic regions that are not assembly gaps (2.86Gb). We can use this file instead:

gat-run.py --ignore-segment-tracks --segments=srf.hg19.bed.gz
   --annotations=annotations_geneset.bed.gz --workspace=contigs_ungapped.bed.gz
   --num-samples=1000 --log=gat.log > gat.out

annotation	observed	expected	fold	l2fold	pvalue	qvalue
intergenic	5800	13806.4540	0.4201	-1.2512	1.0000e-03	2.2000e-03
UTR3	233	303.6340	0.7674	-0.3820	2.5300e-01	3.9757e-01
intronic	8816	11473.2200	0.7684	-0.3801	1.0000e-03	2.2000e-03
3flank	800	713.4290	1.1213	0.1652	3.4000e-01	4.6750e-01
CDS	754	391.1840	1.9275	0.9467	5.0000e-03	9.1667e-03
flank	1334	182.0200	7.3289	2.8736	1.0000e-03	2.2000e-03
5flank	6524	761.1600	8.5711	3.0995	1.0000e-03	2.2000e-03
UTR5	3441	97.3670	35.3405	5.1433	1.0000e-03	2.2000e-03

The associated fold changes change, albeit not much. But have we done enough? The SRF intervals are the result of a ChIP-Seq experiment. Because these were short reads (25bp), not all can be unambiguously mapped to a unique genomic location. This again effectively removes some genomic regions from the analysis.

The bed formatted mapability_36.filtered.bed.gz contains all those genomic regions, that are uniquely mapable with reads of 24 bases. These regions have been derived from the UCSC mapability tracks and reduce the effective genome considerably (1.96Gb).

We could intersect the two bed files ourselves, but we can also supply multiple workspaces to gat. gat will automatically intersect multiple workspaces:

gat-run.py --ignore-segment-tracks --segments=srf.hg19.bed.gz
   --annotations=annotations_geneset.bed.gz
   --workspace=contigs_ungapped.bed.gz
   --workspace=mapability_36.filtered.bed.gz
   --num-samples=1000 --log=gat.log > gat.out

As a consequence of reducing the workspace the fold changes change:

annotation	observed	expected	fold	l2fold	pvalue	qvalue
intergenic	5800	12531.2490	0.4628	-1.1114	1.0000e-03	1.6000e-03
UTR3	233	385.1620	0.6049	-0.7251	1.1000e-01	1.2571e-01
intronic	8816	10942.7440	0.8056	-0.3118	1.0000e-03	1.6000e-03
3flank	800	625.3780	1.2792	0.3553	1.6500e-01	1.6500e-01
CDS	754	540.3700	1.3953	0.4806	8.2000e-02	1.0933e-01
flank	1334	166.6400	8.0053	3.0010	1.0000e-03	1.6000e-03
5flank	6524	638.2110	10.2223	3.3537	1.0000e-03	1.6000e-03
UTR5	3441	122.2010	28.1585	4.8155	1.0000e-03	1.6000e-03