Usage instructions¶
This page describes basic and advanced usage of GAT.
A list of all command-line options is available via:
gat-run.py --help
Basic usage¶
The gat tool is controlled via the gat-run.py
script. This
script requires the following input:
- A set of intervals
S
with segments of interest to test.- A set of intervals
A
with annotations to test against.- A set of intervals
W
describing a workspace
GAT requires bed formatted files. In its simplest form, GAT is then run as:
gat-run.py
--segment-file=segments.bed.gz
--workspace-file=workspace.bed.gz
--annotation-file=annotations.bed.gz
The script recognizes gzip compressed files by the suffix .gz
.
The principal output is a tab-separated table of pairwise comparisons between
each segments of interest and annotations. The table
will be written to stdout, unless the option --stdout
is given
with a filename to which output should be redirected.
The main columns in the table are:
- track
- the segments of interest track
- annotation
- the annotations track
- observed
- the observed count
- expected
- the expected count based on the sampled segments
- CI95low
- the value at the 5% percentile of samples
- CI95high
- the value at the 95% percentile of samples
- stddev
- the standard deviation of samples
- fold
- the fold enrichment, given by the ratio observed / expected
- l2fold
- log2 of the fold enrichment value
- pvalue
- the p-value of enrichment/depletion
- qvalue
- a multiple-testing corrected p-value. See multiple testing correction.
- Additionally, there are the following columns:
- track_nsegments
- number of segments in track in segments of interest
- track_size
- number of residues in covered by track in segments of interest within the workspace
- track_density
- fraction of residues in track in segments of interest within the workspace
- annotation_nsegments
- number of segments in track in annotations.
- annotation_size
- number of residues in covered by track in annotations within the workspace
- annotation_density
- number of residues in covered by track in annotations within the workspace
- overlap_nsegments
- number of segments in overlapping between segments of interest and annotations
- overlap_size
- number of nucleotides overlapping between segments of interest and annotations
- overlap_density
- fraction of residues overlapping between segments of interest and annotations within workspace
- percent_overlap_nsegments_track
- percentage of segments in segments of interest overlapping annotations
- percent_overlap_size_track
- percentage of nucleotides in segments of interest overlapping annotations
- percent_overlap_nsegments_annotation
- percentage of segments in annotations overlapping segments of interest
- percent_overlap_size_annotation
- percentage of nucleotides in annotations overlapping segments of interest
- description
- additional description of track (requires
--descriptions
to be set).
Further output files such as auxiliary summary statistics go to files
named according to --filename-output-pattern
. The argument to
filename-output-pattern
should contain one %s
placeholder,
which is then substituted with section names.
Count here denotes the measure of association and defaults to number of overlapping nucleotides.
Advanced Usage¶
Submitting multiple files¶
All of the options –segment-file, –workspace-file, –annotation-file can be used several times on the command line. What happens with multiple files depends on the file type:
- Multiple –segment-file entries are added to the list of segments of interest to test with.
- Multiple –annotation-file entries are added to the list of annotations to test against.
- Multiple –workspace entries are intersected to create a single workspace.
Generally, gat will test m segments of interest lists against n annotations lists in all m * n combinations.
Within a bed formatted file, different tracks can
be separated using a UCSC formatted track
line, such as this:
track name="segmentset1"
chr1 23 100
chr3 50 2000
track name="segmentset2"
chr1 1000 2000
chr3 4000 5000
or alternatively, using the fourth column in a bed formatted file:
chr1 23 100 segmentset1
chr3 50 2000 segmentset1
chr1 1000 2000 segmentset2
chr3 4000 5000 segmentset2
The latter takes precedence. The option –ignore-segment-tracks` forces gat to ignore the fourth column and consider all intervals to be from a single interval set.
Note
Be careful with bed-files where each interval gets a unique
identifier. Gat will interprete each interval as a separate
segment set to read. This is usually not intended and causes
gat to require a very large amount of memory.
(see the option --ignore-segment-tracks
By default, tracks can not be split over multiple files. The option
--enable-split-tracks
permits this.
Adding isochores¶
Isochores are genomic segments with common properties that are potentially correlated with the segments of interest and the annotations, but the correlation is not of interest here. For example, consider a CHiP-Seq experiment and the testing if CHiP-Seq intervals are close to genes. G+C rich regions in the genome are gene rich, while at the same time there is possibly a nucleotide composition bias in the CHiP-Seq protocol depleting A+T rich sequence. An association between genes and CHiP-Seq intervals might simply be due to the G+C effect. Using isochores can control for this effect to some extent.
Isochores split the workspace into smaller workspaces of similar properties, so called isochore workspaces. Simulations are performed for each isochore workspaces separately. At the end, results for each all isochore workspaces are aggregated.
In order to add isochores, use the –isochore-file command line option.
Choosing measures of association¶
Counters describe the measure of association that is tested. Counters
are selected with the command line option --counter
. Available
counters are:
nucleotide-overlap
: number of bases overlapping [default]segment-overlap
: number of intervals intervals in the segments of interest overlapping annotations. A single base-pair overlap is sufficient.segment-mid-overlap
: number of intervals in the segments of interest overlapping at their midpoint annotations.annotations-overlap
: number of intervals in the annotations overlapping segments of interest. A single base-pair overlap is sufficient.segment-mid-overlap
: number of intervals in the annotations overlapping at their midpoint segments of interest
Multiple counters can be given. If only one counter is provided, the
output will be to stdout. Otherwise, separate output files will be
created each counter. The filename can be controlled with the
--output-table-pattern
option.
Changing the PValue method¶
By default, gat returns the empirical p-value based on the sampling
procedure. The minimum p-value is 1 / number of samples
.
Sometimes the lower bound on p-values causes methods that estimate the FDR to fail as the distribution of p-values is atypical. In order to estimate lower pvalues, the number of samples needs to be increased. Unfortunately, the run-time of gat is directly proportional to the number of samples.
A solution is to set the option --pvalue-method
to --pvalue-method=norm
. In that
case, pvalues are estimated by fitting a normal distribution to the
samples. Small p-values are obtained by extrapolating from this fit.
Multiple testing correction¶
gat provides several methods for controlling the false discovery
rate. The default is to use the Benjamini-Hochberg procedure.
Different methods can be chosen with the --qvalue-method
option.
--qvalue-method=storey
uses the procedure by Storey et al. (2002) to compute a
q-value for each pairwise comparison. The implementation
is in its functionality equivalent to the qvalue package implemented
in R.
Other options are equivalent to the methods as implemented in the
R function p.adjust
.
Caching sampling results¶
gat can save and retrieve samples from a cache --cache=cache_filename
.
If cache_filename
does not exist, samples will be saved to the
cache after computation. If cache_filename
does already exist,
samples will be retrieved from the cache instead of being re-computed.
Using cached samples is useful when trying different counters
(see Choosing measures of association).
If the option --counts-file
is given, gat will skip the sampling
and counting step completely and read observed counts from
--count-file=counts_filename
.
Using multiple CPU/cores¶
GAT can make use of several available CPU/cores if available. Use
the --num-threads=#
option in order to specify how many CPU/cores
GAT will make use of. The default --num-threads=0
means that GAT
will not use any multiprocessing.
Outputting intermediate results¶
A variety of options govern the output of intermediate results by gat.
These options usually accept patterns that represent filenames with
a %s
as a wild card character. The wild card is replaced with
various keys. Note that the amount of data output can be substantial.
--output-counts-pattern
- output counts. One file is created for each counter. Counts output files are required for gat-compare.
--output-plots-pattern
- create plots (requires matplotlib). One plot for each annotation is created showing the distribution of expected counts and the observed count. Also, outputs the distribution of p-values and q-values.
--output-samples-pattern
- output bed formatted files with individual samples.
Other tools¶
gat-compare¶
The gat-compare tool can be used to test if the fold changes found in two or more different gat experiments are significantly different from each other.
This tool requires the output files with counts created using the
--output-counts-pattern
option.
For example, to compare if fold changes are signficantly different between two cell lines, execute:
gat-run.py --segments=CD4.bed.gz <...>
--output-counts-pattern=CD4.%s.overlap.counts.tsv.gz
gat-run.py --segments=CD14.bed.gz <...>
--output-counts-pattern=CD14.%s.overlap.counts.tsv.gz
gat-compare.py CD4.nucleotide-overlap.counts.tsv.gz CD14.nucleotide-overlap.counts.tsv.gz
gat-plot¶
Plot gat results.