4 Pre-processing

CopyKit’s pre-processing module workhorse is runVarbin()

runVarbin() is a wrapper for a series of functions that perform three main processes:

  1. Binning and read counting from the .bam files
  2. Applying a variance stabilization transformation to the bin counts
  3. Piece-wise segmentating stabilized bin counts

We load CopyKit with the library() function.

library(copykit)

NOTE: If the BiocParallel framework was registered, CopyKit functions will, whenever possible, run in parallel. For more information check the parallelization section.

4.1 From Single Cell BAM files

4.1.1 runVarbin()

The runVarbin() function takes in the file path of the folder where your marked duplicate BAM files are located.

Each cell must be its own BAM file. For Chromium single cell CNA users (10X Genomics), this means splitting the possorted.bam file by cell barcode into individual BAM files.

If files are originated from paired-end sequencing, please ensure that the argument “is_paired_end” is set to TRUE during `runVarbin().

Once run, runVarbin() uses the variable binning method to count the number of reads in each genomic bin. - Learn More!.

A snakemake pipeline is provided here to convert FASTQ files to marked BAM files, but it is not a required step. If you already have marked BAM files from another source, you can skip this step. Note that the reads must be aligned to the same genome assembly as used in CopyKit.

The input for runVarbin() is the path of the folder containing your marked duplicate .bam files. An optional second argument remove_Y provides a convenient shortcut to exclude chromosome Y from the dataset.

To learn about all arguments that runVarbin() accepts use ?runVarbin
tumor <- runVarbin("~/path/to/bam/files/", 
                   remove_Y = TRUE)
## Counting reads for genome hg38 and resolution: 220kb
## 34 bam files had less than 10 mean bincounts and were removed.
## Performing GC correction.
## Running variance stabilization transformation: ft
## Smoothing outlier bins.
## Running segmentation algorithm: CBS for genome hg38
## Merging levels.
## Done.

The runVarbin() function uses the hg38 genome assembly and a 220kb bin resolution by default, but these can be adjusted using the "genome" or "resolution" arguments. If your BAM files come from paired-end sequencing, be sure to set the "is_paired_end" argument to TRUE. (See the function documentation for more information: ?runVarbin)

tumor <- runVarbin("~/path/to/bam/files/", 
                   remove_Y = TRUE,
                   is_paired_end = TRUE)

The resulting object is the CopyKit object.

tumor
## class: CopyKit 
## dim: 11268 1502 
## metadata(3): genome resolution vst
## assays(6): bincounts ft ... ratios logr
## rownames(11268): 1 2 ... 11267 11268
## rowData names(3): gc_content abspos arm
## colnames(1502):
##   PMTC6LiverC100DL1S2_S100_L001_R1_001
##   PMTC6LiverC100DL1S6_S484_L002_R1_001 ...
##   PMTC6LiverC9DL1S5_S393_L002_R1_001
##   PMTC6LiverC9DL6L7S1_S1161_L004_R1_001
## colData names(9): sample reads_assigned_bins ...
##   reads_total percentage_duplicates
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowRanges has: 11268 ranges
## Phylo: Phylogenetic tree with 0 tips and  nodes
## consensus dim: 0 0

CopyKit objects inherit from the SingleCellExperiment class.

Each column corresponds to a single cell and each row corresponds to a bin. The bincounts assay, which can be accessed using the bincounts() function, stores the read counts for each bin.

head(bincounts(tumor)[,1:5])
## # A tibble: 6 × 5
##   PMTC6LiverC100DL1S2_S… PMTC6LiverC100D… PMTC6LiverC100D…
##                    <dbl>            <dbl>            <dbl>
## 1                   37.1            10.3              44.5
## 2                   28.8            12.8              53.1
## 3                   34.8             9.03             50.4
## 4                   24.3            14.2              34.9
## 5                   30.8             9.18             55.8
## 6                   29.3             5.88             43.0
## # … with 2 more variables:
## #   PMTC6LiverC100DL6L7S3_S1252_L004_R1_001 <dbl>,
## #   PMTC6LiverC101DL1S2_S101_L001_R1_001 <dbl>

To address overdispersion, CopyKit performs a variance stabilization transformation (VST) of the count matrix using the Freeman-Tukey transformation. Alternatively, log transformation is also available (see the function documentation for more information ?runVst())

The resulting transformation is stored within the ft assay, which can be accessed with assay():

head(assay(tumor, 'ft')[,1:5])
##   PMTC6LiverC100DL1S2_S100_L001_R1_001
## 1                            12.258603
## 2                            10.832889
## 3                            11.880779
## 4                             9.951377
## 5                            11.181771
## 6                            10.921155
##   PMTC6LiverC100DL1S6_S484_L002_R1_001
## 1                             6.558716
## 2                             7.300770
## 3                             6.172013
## 4                             7.674830
## 5                             6.220463
## 6                             5.047847
##   PMTC6LiverC100DL4L5S1_S868_L003_R1_001
## 1                               13.42067
## 2                               14.63815
## 3                               14.27288
## 4                               11.89424
## 5                               15.01185
## 6                               13.19827
##   PMTC6LiverC100DL6L7S3_S1252_L004_R1_001
## 1                                13.36691
## 2                                14.48431
## 3                                13.82443
## 4                                12.00304
## 5                                12.26513
## 6                                13.20130
##   PMTC6LiverC101DL1S2_S101_L001_R1_001
## 1                             12.50574
## 2                             11.30275
## 3                             11.37859
## 4                             12.23737
## 5                             10.90282
## 6                             12.46409

We use segmentation to divide the genome-ordered bin counts into piecewise constant segments, and the means of these segments can be used to infer copy number states across the genomic regions.

The runVarbin() function allows for the selection of different segmentation methods using the method argument, which can be set to either CBS or multipcf. By default, the function uses the Circular Binary Segmentation (CBS) - Learn More! - method from the DNAcopy package. An additional argument, alpha, controls the significance levels required to accept change-points.

The second segmentation option is the Multi-sample Piecewise Constant Fit (multipcf) segmentation from the copynumber package, which performs a joint segmentation of the samples and identifies common breakpoints across all samples.

The resulting segmentation is stored within the CopyKit object into two different assays: ratios and segment_ratios. Learn More! which can be accessed with the functions ratios() and segment_ratios()

head(ratios(tumor)[,1:5])
##   PMTC6LiverC100DL1S2_S100_L001_R1_001
## 1                                 1.15
## 2                                 0.89
## 3                                 1.08
## 4                                 0.75
## 5                                 0.95
## 6                                 0.91
##   PMTC6LiverC100DL1S6_S484_L002_R1_001
## 1                                 0.90
## 2                                 1.13
## 3                                 0.80
## 4                                 1.25
## 5                                 0.81
## 6                                 0.52
##   PMTC6LiverC100DL4L5S1_S868_L003_R1_001
## 1                                   0.94
## 2                                   1.12
## 3                                   1.06
## 4                                   0.74
## 5                                   1.18
## 6                                   0.91
##   PMTC6LiverC100DL6L7S3_S1252_L004_R1_001
## 1                                    1.06
## 2                                    1.25
## 3                                    1.14
## 4                                    0.86
## 5                                    0.89
## 6                                    1.04
##   PMTC6LiverC101DL1S2_S101_L001_R1_001
## 1                                 1.03
## 2                                 0.84
## 3                                 0.85
## 4                                 0.99
## 5                                 0.78
## 6                                 1.03
head(segment_ratios(tumor)[,1:5])
##   PMTC6LiverC100DL1S2_S100_L001_R1_001
## 1                                    1
## 2                                    1
## 3                                    1
## 4                                    1
## 5                                    1
## 6                                    1
##   PMTC6LiverC100DL1S6_S484_L002_R1_001
## 1                                 0.99
## 2                                 0.99
## 3                                 0.99
## 4                                 0.99
## 5                                 0.99
## 6                                 0.99
##   PMTC6LiverC100DL4L5S1_S868_L003_R1_001
## 1                                      1
## 2                                      1
## 3                                      1
## 4                                      1
## 5                                      1
## 6                                      1
##   PMTC6LiverC100DL6L7S3_S1252_L004_R1_001
## 1                                    1.03
## 2                                    1.03
## 3                                    1.03
## 4                                    1.03
## 5                                    1.03
## 6                                    1.03
##   PMTC6LiverC101DL1S2_S101_L001_R1_001
## 1                                    1
## 2                                    1
## 3                                    1
## 4                                    1
## 5                                    1
## 6                                    1

4.2 runVarbin() modules

The runVarbin() module includes the following functions, which are detailed below. It is important to note that if you have already started from runVarbin(), you do not need to run these functions again. The details of these functions is to enable you to run the modules under different conditions without requiring you to re-run all runVarbin() module.

4.2.0.1 runCountReads()

The runCountReads() function counts the reads into bins, and performs GC correction of the counts - Learn More!.

The genome argument defines the genome assembly (“hg38”, “hg19”) and the resolution argument specifies the size of the variable bins (“55kb”, “110kb”, “195kb”, “220kb”, “280kb”, “500kb”, “1Mb”, “2.8Mb”). The remove_Y argument provides a convenient shortcut to exclude the chromosome Y from the dataset.

4.2.0.2 runVst()

The runVst() function performs variance stabilization transformation on the bin counts, using either the freeman-tukey (‘ft’) or ‘log’ transformation.

4.2.0.3 runSegmentation()

The runSegmentation() function runs either the ‘CBS’ or ‘multipcf’ segmentation, and then merges consecutive segments that don’t reach the significance threshold Learn More!. The argument alpha controls the significance threshold for CBS segmentation, whil th gamma argument controls the penalty for the multipcf segmentation.

4.2.0.4 logNorm()

logNorm() performs a log transformation of the segment_ratios assay and stores into the logr assay. The logr assay is used downstream by functions such as runUmap().


4.3 From external count data or user-defined scaffolds

To use CopyKit’s downstream functions with processed datasets, we can create a CopyKit object meeting the following requirements:

  1. Either a bin count matrix where columns represent cells and rows represent bins (to be stored in the bincounts assay) or a segment mean ratios matrix where columns represent cells and rows represent segment mean values for each bin. An integer matrix of copy number calls can also be used at this step (to be stored in the segment_ratios assay).

  2. A GenomicRanges object with length equal to the number of bins from the cell assay matrix. As long as the length requirement is respected this allows user-defined scaffolds to be used within CopyKit.

To construct the CopyKit object, the following mock code is presented as an example. If providing bincounts:

obj <- CopyKit(list(bincounts = cell_bincount_matrix),
               rowRanges = genomic_ranges_scaffold)

If providing bincounts, the functions runVst(), runSegmentation(), and logNorm() can be used to continue with the analysis. To obtain variance stabilized counts, and segment ratio means.

If providing segment mean ratios:

obj <- CopyKit(list(segment_ratios = cell_bincount_matrix),
               rowRanges = genomic_ranges_scaffold)

It is useful to call logNorm() and set the logr slot for downstream functions. The resulting object can then be passed on to the Quality Control and Analysis modules of CopyKit.

Furthermore, it is useful for downstream functions to add the genome assembly used on the dataset.

metadata(obj)$genome <- "hg38"

NOTE: If no bincount matrix is provided, functions that require a matrix of bincounts, such as runMetrics() can’t be used with this object.