Sampling strategies

Sampling creates a new set of interval P. There are several different strategies possible.

Annotator strategy

In the original Annotator strategy, samples are created in a two step procedure. First, the length of a sample segment is chosen randomly from the empirical segment length distribution. Then, a random coordinate is chosen. If the sampled segment does not overlap with the workspace it is rejected.

Before adding the segment to the sampled set, it is truncated to the workspace.

If it overlaps with a previously sampled segment, the segments are merged. Thus, bases shared between two segments are not counted twice.

Sampling continues, until exactly the same number of bases overlap between the P and the W as do between S and W.

Note that the length distribution of the intervals in P might be different from S.

The length is sampled from the empirical histogram of segment sizes. The granularity of the histogram can be controlled by the options -bucket-size and --nbuckets. The largest segment should be smaller than bucket-size * nbuckets. Increase either if you have large segments in your data set, but smaller values of nbuckets are quicker.

This method is quick if the workspace is large. If it is small, a large number of samples will be rejected and the procedure slows down.

This sampling method is compatible with both distance and overlap based measures of associaton.

Workspaces and isochores

Workspaces limit the genomic space available for segments and annotations. Isochores split a workspace into smaller parts that permit to control for confounding variables like G+C content.

The simplest workspace is the full genome. For some analyses it might be better to limit to analysis to autosomes.

Examples for the use of isochores could be to analyze chromosomes or chromosomal arms separately.

If isochores are used, the length distribution and nucleotide overlaps are counted per isochore ensuring that the same number of nucleotides overlap each isochore in P and S and the length distributions per isochore are comparable.

Empirical length distribution

The empirical length distribution is created from all intervals in S. The full segment length is chosen even if there is partial overlap. Optionally, the segment can be truncated. From Gerton:

What is the best choice depends on the data. Not truncating can lead
to a biased length distribution if it is expected that segments that
only partially overlap the workspace have very different lengths. However,
truncating can lead to spurious short segments.