11 Learn more

This section is always expanding. If you have a contribution and would like to add explanations for a method related to copy number data, please submit a PR. Contributions are extremely welcomed!

11.1 Copy Number data from short reads sequencing

11.2 Principle

To infer copy number data from short-read sequencing we count the number of reads that align to a particular region of the genome. Counter intuitively, we are not drawn to the individual nucleotides, but the regions of the genome in which the reads have aligned.

11.3 VarBin

To perform the alignment of millions of reads in a reasonable time frame, aligners trade-off accuracy for speed. Furthermore, the genome is filled with repetitive and hard to map regions. As a consequence, errors produced during the process of alignment are known as mapping errors.

Mapping errors may be a significant source of bias. Correct estimation of copy number gains or losses is dependent on accurate control of different biases. To infer copy number variations and account for sources of bias, different methods have been developed, including the Variable Binning method.

The VarBin method accounts for mapping bias by partitioning the genome into bins of variable sizes. The guiding principle is, if we were to map a diploid genome to our scaffold, each bin will receive an equal number of reads.

To construct the VarBin scaffolds, reads are simulated from a reference genome and mapped back. The reference genome is partitioned into bins of variable sizes, that receive an equal amount of reads.

We can construct scaffolds of different resolutions. Higher resolutions will have smaller bin sizes and detect more ‘focal’ copy number events. The decision of which resolution to use is dependent on diverse factors. Those include the library complexity, number of multiplexed cells, sequencer output, among others. Generally, a target of 1M reads/cell, with a 10% PCR duplicate rate, is sufficient to generate high-quality copy number profiles for the 220kb scaffold.

11.4 GC correction

GC-content can be a source of bias within the counts. Normalization of GC content is performed as follows:

Both fragment counts and GC counts are binned to a bin-size of choice. A curve describing the conditional mean fragment count per GC value is estimated. The resulting GC curve determines a predicted count for each bin based on the bin’s GC. These predictions can be used directly to normalize the original signal, or as the rates for a heterogeneous Poisson model. Extracted from: Summarizing and correcting the GC content bias in high-throughput sequencing. Benjamini & Speed

Therefore, we smooth the signal using loess normalization.

11.5 Merge Levels

After segmentation, some segments remain with small differences, generating spurious breakpoints that are unlikely to be real copy number events. To remove this effect we perform a Wilcoxon rank-sum test between the observed median across two segments. Segments that do not reach significance are merged.

11.6 How do we infer copy numbers

We previously partitioned the reference genome into bins of variables size and performed the alignment of the reads to the bins, quantifying the number of reads contained in each bin.

We can infer the relative copy number of a sample relative to the average number of reads of the sample.

Therefore, we perform a sample-wise normalization of the bin counts by their mean. This way, a value of 1 corresponds to the average copy number of the sample, whereas higher values reflect amplified regions, and lower values represent genomic losses. The resulting matrix is known as the ratio matrix and it’s segmented counterpart is known as the segment ratios mean matrix.

Importantly, with this normalization, we can’t immediately determine the integer copy number of a given segment. However several methods have been developed to infer the ploidy of a cell and its segments using short read sequencing.

11.7 Segmentation

11.7.1 CBS

Circular Binary Segmentation (CBS) is a popular method of segmentation.

From the help of DNAcopy package (see ?DNAcopy::segment):

This function implements the circular binary segmentation (CBS) algorithm of Olshen and Venkatraman (2004). Given a set of genomic data, either continuous or binary, the algorithm recursively splits chromosomes into either two or three subsegments based on a maximum t-statistic. A reference distribution, used to decided whether or not to split, is estimated by permutation. Options are given to eliminate splits when the means of adjacent segments are not sufficiently far apart. Note that after the first split the α-levels of the tests for splitting are not unconditional.