This page describes basic and advanced usage of GAT.
A list of all command-line options is available via:
The gat tool is controlled via the
gat-run.py script. This
script requires the following input:
GAT requires bed formatted files. In its simplest form, GAT is then run as:
gat-run.py --segment-file=segments.bed.gz --workspace-file=workspace.bed.gz --annotation-file=annotations.bed.gz
The script recognizes gzip compressed files by the suffix
The principal output is a tab-separated table of pairwise comparisons between
each segments of interest and annotations. The table
will be written to stdout, unless the option
--stdout is given
with a filename to which output should be redirected.
The main columns in the table are:
- the segments of interest track
- the annotations track
- the observed count
- the expected count based on the sampled segments
- the value at the 5% percentile of samples
- the value at the 95% percentile of samples
- the standard deviation of samples
- the fold enrichment, given by the ratio observed / expected
- log2 of the fold enrichment value
- the p-value of enrichment/depletion
- a multiple-testing corrected p-value. See multiple testing correction.
- Additionally, there are the following columns:
- number of segments in track in segments of interest
- number of residues in covered by track in segments of interest within the workspace
- fraction of residues in track in segments of interest within the workspace
- number of segments in track in annotations.
- number of residues in covered by track in annotations within the workspace
- number of residues in covered by track in annotations within the workspace
- number of segments in overlapping between segments of interest and annotations
- number of nucleotides overlapping between segments of interest and annotations
- fraction of residues overlapping between segments of interest and annotations within workspace
- percentage of segments in segments of interest overlapping annotations
- percentage of nucleotides in segments of interest overlapping annotations
- percentage of segments in annotations overlapping segments of interest
- percentage of nucleotides in annotations overlapping segments of interest
- additional description of track (requires
--descriptionsto be set).
Further output files such as auxiliary summary statistics go to files
named according to
--filename-output-pattern. The argument to
filename-output-pattern should contain one
which is then substituted with section names.
Count here denotes the measure of association and defaults to number of overlapping nucleotides.
Submitting multiple files¶
All of the options –segment-file, –workspace-file, –annotation-file can be used several times on the command line. What happens with multiple files depends on the file type:
track name="segmentset1" chr1 23 100 chr3 50 2000 track name="segmentset2" chr1 1000 2000 chr3 4000 5000
or alternatively, using the fourth column in a bed formatted file:
chr1 23 100 segmentset1 chr3 50 2000 segmentset1 chr1 1000 2000 segmentset2 chr3 4000 5000 segmentset2
The latter takes precedence. The option –ignore-segment-tracks` forces gat to ignore the fourth column and consider all intervals to be from a single interval set.
Be careful with bed-files where each interval gets a unique
identifier. Gat will interprete each interval as a separate
segment set to read. This is usually not intended and causes
gat to require a very large amount of memory.
(see the option
By default, tracks can not be split over multiple files. The option
--enable-split-tracks permits this.
Isochores are genomic segments with common properties that are potentially correlated with the segments of interest and the annotations, but the correlation is not of interest here. For example, consider a CHiP-Seq experiment and the testing if CHiP-Seq intervals are close to genes. G+C rich regions in the genome are gene rich, while at the same time there is possibly a nucleotide composition bias in the CHiP-Seq protocol depleting A+T rich sequence. An association between genes and CHiP-Seq intervals might simply be due to the G+C effect. Using isochores can control for this effect to some extent.
Isochores split the workspace into smaller workspaces of similar properties, so called isochore workspaces. Simulations are performed for each isochore workspaces separately. At the end, results for each all isochore workspaces are aggregated.
In order to add isochores, use the –isochore-file command line option.
Choosing measures of association¶
Counters describe the measure of association that is tested. Counters
are selected with the command line option
nucleotide-overlap: number of bases overlapping [default]
segment-overlap: number of intervals intervals in the segments of interest overlapping annotations. A single base-pair overlap is sufficient.
segment-mid-overlap: number of intervals in the segments of interest overlapping at their midpoint annotations.
annotations-overlap: number of intervals in the annotations overlapping segments of interest. A single base-pair overlap is sufficient.
segment-mid-overlap: number of intervals in the annotations overlapping at their midpoint segments of interest
Multiple counters can be given. If only one counter is provided, the
output will be to stdout. Otherwise, separate output files will be
created each counter. The filename can be controlled with the
Changing the PValue method¶
Sometimes the lower bound on p-values causes methods that estimate the FDR to fail as the distribution of p-values is atypical. In order to estimate lower pvalues, the number of samples needs to be increased. Unfortunately, the run-time of gat is directly proportional to the number of samples.
A solution is to set the option
--pvalue-method=norm. In that
case, pvalues are estimated by fitting a normal distribution to the
samples. Small p-values are obtained by extrapolating from this fit.
Multiple testing correction¶
gat provides several methods for controlling the false discovery
rate. The default is to use the Benjamini-Hochberg procedure.
Different methods can be chosen with the
--qvalue-method=storey uses the procedure by Storey et al. (2002) to compute a
q-value for each pairwise comparison. The implementation
is in its functionality equivalent to the qvalue package implemented
Other options are equivalent to the methods as implemented in the
Caching sampling results¶
gat can save and retrieve samples from a cache
cache_filename does not exist, samples will be saved to the
cache after computation. If
cache_filename does already exist,
samples will be retrieved from the cache instead of being re-computed.
Using cached samples is useful when trying different counters
(see Choosing measures of association).
If the option
--counts-file is given, gat will skip the sampling
and counting step completely and read observed counts from
Using multiple CPU/cores¶
GAT can make use of several available CPU/cores if available. Use
--num-threads=# option in order to specify how many CPU/cores
GAT will make use of. The default
--num-threads=0 means that GAT
will not use any multiprocessing.
Outputting intermediate results¶
A variety of options govern the output of intermediate results by gat.
These options usually accept patterns that represent filenames with
%s as a wild card character. The wild card is replaced with
various keys. Note that the amount of data output can be substantial.
- output counts. One file is created for each counter. Counts output files are required for gat-compare.
- create plots (requires matplotlib). One plot for each annotation is created showing the distribution of expected counts and the observed count. Also, outputs the distribution of p-values and q-values.
- output bed formatted files with individual samples.
The gat-compare tool can be used to test if the fold changes found in two or more different gat experiments are significantly different from each other.
This tool requires the output files with counts created using the
For example, to compare if fold changes are signficantly different between two cell lines, execute:
gat-run.py --segments=CD4.bed.gz <...> --output-counts-pattern=CD4.%s.overlap.counts.tsv.gz gat-run.py --segments=CD14.bed.gz <...> --output-counts-pattern=CD14.%s.overlap.counts.tsv.gz gat-compare.py CD4.nucleotide-overlap.counts.tsv.gz CD14.nucleotide-overlap.counts.tsv.gz
Plot gat results.