Usage instructions

This page describes basic and advanced usage of GAT.

A list of all command-line options is available via:

gat-run.py --help

Basic usage

The gat tool is controlled via the gat-run.py script. This script requires the following input:

  1. A set of intervals S with segments of interest to test.
  2. A set of intervals A with annotations to test against.
  3. A set of intervals W describing a workspace

GAT requires bed formatted files. In its simplest form, GAT is then run as:

gat-run.py
   --segment-file=segments.bed.gz
   --workspace-file=workspace.bed.gz
   --annotation-file=annotations.bed.gz

The script recognizes gzip compressed files by the suffix .gz.

The principal output is a tab-separated table of pairwise comparisons between each segments of interest and annotations. The table will be written to stdout, unless the option --stdout is given with a filename to which output should be redirected.

The main columns in the table are:

track
the segments of interest track
annotation
the annotations track
observed
the observed count
expected
the expected count based on the sampled segments
CI95low
the value at the 5% percentile of samples
CI95high
the value at the 95% percentile of samples
stddev
the standard deviation of samples
fold
the fold enrichment, given by the ratio observed / expected
l2fold
log2 of the fold enrichment value
pvalue
the p-value of enrichment/depletion
qvalue
a multiple-testing corrected p-value. See multiple testing correction.
Additionally, there are the following columns:
track_nsegments
number of segments in track in segments of interest
track_size
number of residues in covered by track in segments of interest within the workspace
track_density
fraction of residues in track in segments of interest within the workspace
annotation_nsegments
number of segments in track in annotations.
annotation_size
number of residues in covered by track in annotations within the workspace
annotation_density
number of residues in covered by track in annotations within the workspace
overlap_nsegments
number of segments in overlapping between segments of interest and annotations
overlap_size
number of nucleotides overlapping between segments of interest and annotations
overlap_density
fraction of residues overlapping between segments of interest and annotations within workspace
percent_overlap_nsegments_track
percentage of segments in segments of interest overlapping annotations
percent_overlap_size_track
percentage of nucleotides in segments of interest overlapping annotations
percent_overlap_nsegments_annotation
percentage of segments in annotations overlapping segments of interest
percent_overlap_size_annotation
percentage of nucleotides in annotations overlapping segments of interest
description
additional description of track (requires --descriptions to be set).

Further output files such as auxiliary summary statistics go to files named according to --filename-output-pattern. The argument to filename-output-pattern should contain one %s placeholder, which is then substituted with section names.

Count here denotes the measure of association and defaults to number of overlapping nucleotides.

Advanced Usage

Submitting multiple files

All of the options –segment-file, –workspace-file, –annotation-file can be used several times on the command line. What happens with multiple files depends on the file type:

  1. Multiple –segment-file entries are added to the list of segments of interest to test with.
  2. Multiple –annotation-file entries are added to the list of annotations to test against.
  3. Multiple –workspace entries are intersected to create a single workspace.

Generally, gat will test m segments of interest lists against n annotations lists in all m * n combinations.

Within a bed formatted file, different tracks can be separated using a UCSC formatted track line, such as this:

track name="segmentset1"
chr1  23     100
chr3  50     2000
track name="segmentset2"
chr1  1000   2000
chr3  4000   5000

or alternatively, using the fourth column in a bed formatted file:

chr1 23      100     segmentset1
chr3 50      2000    segmentset1
chr1 1000    2000    segmentset2
chr3 4000    5000    segmentset2

The latter takes precedence. The option –ignore-segment-tracks` forces gat to ignore the fourth column and consider all intervals to be from a single interval set.

Note

Be careful with bed-files where each interval gets a unique identifier. Gat will interprete each interval as a separate segment set to read. This is usually not intended and causes gat to require a very large amount of memory. (see the option --ignore-segment-tracks

By default, tracks can not be split over multiple files. The option --enable-split-tracks permits this.

Adding isochores

Isochores are genomic segments with common properties that are potentially correlated with the segments of interest and the annotations, but the correlation is not of interest here. For example, consider a CHiP-Seq experiment and the testing if CHiP-Seq intervals are close to genes. G+C rich regions in the genome are gene rich, while at the same time there is possibly a nucleotide composition bias in the CHiP-Seq protocol depleting A+T rich sequence. An association between genes and CHiP-Seq intervals might simply be due to the G+C effect. Using isochores can control for this effect to some extent.

Isochores split the workspace into smaller workspaces of similar properties, so called isochore workspaces. Simulations are performed for each isochore workspaces separately. At the end, results for each all isochore workspaces are aggregated.

In order to add isochores, use the –isochore-file command line option.

Choosing measures of association

Counters describe the measure of association that is tested. Counters are selected with the command line option --counter. Available counters are:

  1. nucleotide-overlap: number of bases overlapping [default]
  2. segment-overlap: number of intervals intervals in the segments of interest overlapping annotations. A single base-pair overlap is sufficient.
  3. segment-mid-overlap: number of intervals in the segments of interest overlapping at their midpoint annotations.
  4. annotations-overlap: number of intervals in the annotations overlapping segments of interest. A single base-pair overlap is sufficient.
  5. segment-mid-overlap: number of intervals in the annotations overlapping at their midpoint segments of interest

Multiple counters can be given. If only one counter is provided, the output will be to stdout. Otherwise, separate output files will be created each counter. The filename can be controlled with the --output-table-pattern option.

Changing the PValue method

By default, gat returns the empirical p-value based on the sampling procedure. The minimum p-value is 1 / number of samples.

Sometimes the lower bound on p-values causes methods that estimate the FDR to fail as the distribution of p-values is atypical. In order to estimate lower pvalues, the number of samples needs to be increased. Unfortunately, the run-time of gat is directly proportional to the number of samples.

A solution is to set the option --pvalue-method to --pvalue-method=norm. In that case, pvalues are estimated by fitting a normal distribution to the samples. Small p-values are obtained by extrapolating from this fit.

Multiple testing correction

gat provides several methods for controlling the false discovery rate. The default is to use the Benjamini-Hochberg procedure. Different methods can be chosen with the --qvalue-method option.

--qvalue-method=storey uses the procedure by Storey et al. (2002) to compute a q-value for each pairwise comparison. The implementation is in its functionality equivalent to the qvalue package implemented in R.

Other options are equivalent to the methods as implemented in the R function p.adjust.

Caching sampling results

gat can save and retrieve samples from a cache --cache=cache_filename. If cache_filename does not exist, samples will be saved to the cache after computation. If cache_filename does already exist, samples will be retrieved from the cache instead of being re-computed. Using cached samples is useful when trying different counters (see Choosing measures of association).

If the option --counts-file is given, gat will skip the sampling and counting step completely and read observed counts from --count-file=counts_filename.

Using multiple CPU/cores

GAT can make use of several available CPU/cores if available. Use the --num-threads=# option in order to specify how many CPU/cores GAT will make use of. The default --num-threads=0 means that GAT will not use any multiprocessing.

Outputting intermediate results

A variety of options govern the output of intermediate results by gat. These options usually accept patterns that represent filenames with a %s as a wild card character. The wild card is replaced with various keys. Note that the amount of data output can be substantial.

--output-counts-pattern
output counts. One file is created for each counter. Counts output files are required for gat-compare.
--output-plots-pattern
create plots (requires matplotlib). One plot for each annotation is created showing the distribution of expected counts and the observed count. Also, outputs the distribution of p-values and q-values.
--output-samples-pattern
output bed formatted files with individual samples.

Other tools

gat-compare

The gat-compare tool can be used to test if the fold changes found in two or more different gat experiments are significantly different from each other.

This tool requires the output files with counts created using the --output-counts-pattern option.

For example, to compare if fold changes are signficantly different between two cell lines, execute:

gat-run.py --segments=CD4.bed.gz <...>
--output-counts-pattern=CD4.%s.overlap.counts.tsv.gz
gat-run.py --segments=CD14.bed.gz <...>
--output-counts-pattern=CD14.%s.overlap.counts.tsv.gz

gat-compare.py CD4.nucleotide-overlap.counts.tsv.gz CD14.nucleotide-overlap.counts.tsv.gz

gat-plot

Plot gat results.

gat-great

Perform a GREAT analysis:

gat-great.py
   --segment-file=segments.bed.gz
   --workspace-file=workspace.bed.gz
   --annotation-file=annotations.bed.gz