5 QC module

CopyKit Quality Control Module consists of 3 main functions:

  1. runMetrics()

  2. findOutliers()

  3. findAneuploidCells().

5.1 runMetrics()

runMetrics() adds basic quality control information to colData. It returns sample-wise metrics of overdispersion and breakpoint counts.

tumor <- runMetrics(tumor)
## Calculating overdispersion.
## Counting breakpoints.
## Done.

The resulting information can be viewed with:

colData(tumor)
## DataFrame with 1502 rows and 11 columns
##                                                         sample
##                                                    <character>
## PMTC6LiverC100DL1S2_S100_L001_R1_001    PMTC6LiverC100DL1S2_..
## PMTC6LiverC100DL1S6_S484_L002_R1_001    PMTC6LiverC100DL1S6_..
## PMTC6LiverC100DL4L5S1_S868_L003_R1_001  PMTC6LiverC100DL4L5S..
## PMTC6LiverC100DL6L7S3_S1252_L004_R1_001 PMTC6LiverC100DL6L7S..
## PMTC6LiverC101DL1S2_S101_L001_R1_001    PMTC6LiverC101DL1S2_..
## ...                                                        ...
## PMTC6LiverC99DL4L5S1_S867_L003_R1_001   PMTC6LiverC99DL4L5S1..
## PMTC6LiverC99DL6L7S3_S1251_L004_R1_001  PMTC6LiverC99DL6L7S3..
## PMTC6LiverC9DL1S1_S9_L001_R1_001        PMTC6LiverC9DL1S1_S9..
## PMTC6LiverC9DL1S5_S393_L002_R1_001      PMTC6LiverC9DL1S5_S3..
## PMTC6LiverC9DL6L7S1_S1161_L004_R1_001   PMTC6LiverC9DL6L7S1_..
##                                         reads_assigned_bins
##                                                   <integer>
## PMTC6LiverC100DL1S2_S100_L001_R1_001                 362665
## PMTC6LiverC100DL1S6_S484_L002_R1_001                 130570
## PMTC6LiverC100DL4L5S1_S868_L003_R1_001               536017
## PMTC6LiverC100DL6L7S3_S1252_L004_R1_001              467002
## PMTC6LiverC101DL1S2_S101_L001_R1_001                 423654
## ...                                                     ...
## PMTC6LiverC99DL4L5S1_S867_L003_R1_001                460498
## PMTC6LiverC99DL6L7S3_S1251_L004_R1_001               306486
## PMTC6LiverC9DL1S1_S9_L001_R1_001                     274402
## PMTC6LiverC9DL1S5_S393_L002_R1_001                   465001
## PMTC6LiverC9DL6L7S1_S1161_L004_R1_001                336559
##                                         reads_unmapped
##                                              <integer>
## PMTC6LiverC100DL1S2_S100_L001_R1_001             30250
## PMTC6LiverC100DL1S6_S484_L002_R1_001             22260
## PMTC6LiverC100DL4L5S1_S868_L003_R1_001           30942
## PMTC6LiverC100DL6L7S3_S1252_L004_R1_001          28239
## PMTC6LiverC101DL1S2_S101_L001_R1_001             28756
## ...                                                ...
## PMTC6LiverC99DL4L5S1_S867_L003_R1_001            37945
## PMTC6LiverC99DL6L7S3_S1251_L004_R1_001           28090
## PMTC6LiverC9DL1S1_S9_L001_R1_001                 38326
## PMTC6LiverC9DL1S5_S393_L002_R1_001               34111
## PMTC6LiverC9DL6L7S1_S1161_L004_R1_001            38814
##                                         reads_duplicates
##                                                <integer>
## PMTC6LiverC100DL1S2_S100_L001_R1_001               34883
## PMTC6LiverC100DL1S6_S484_L002_R1_001               12657
## PMTC6LiverC100DL4L5S1_S868_L003_R1_001             58222
## PMTC6LiverC100DL6L7S3_S1252_L004_R1_001            53055
## PMTC6LiverC101DL1S2_S101_L001_R1_001               43008
## ...                                                  ...
## PMTC6LiverC99DL4L5S1_S867_L003_R1_001              54159
## PMTC6LiverC99DL6L7S3_S1251_L004_R1_001             36534
## PMTC6LiverC9DL1S1_S9_L001_R1_001                   25527
## PMTC6LiverC9DL1S5_S393_L002_R1_001                 48055
## PMTC6LiverC9DL6L7S1_S1161_L004_R1_001              37898
##                                         reads_multimapped
##                                                 <integer>
## PMTC6LiverC100DL1S2_S100_L001_R1_001                    0
## PMTC6LiverC100DL1S6_S484_L002_R1_001                    0
## PMTC6LiverC100DL4L5S1_S868_L003_R1_001                  0
## PMTC6LiverC100DL6L7S3_S1252_L004_R1_001                 0
## PMTC6LiverC101DL1S2_S101_L001_R1_001                    0
## ...                                                   ...
## PMTC6LiverC99DL4L5S1_S867_L003_R1_001                   0
## PMTC6LiverC99DL6L7S3_S1251_L004_R1_001                  0
## PMTC6LiverC9DL1S1_S9_L001_R1_001                        0
## PMTC6LiverC9DL1S5_S393_L002_R1_001                      0
## PMTC6LiverC9DL6L7S1_S1161_L004_R1_001                   0
##                                         reads_unassigned
##                                                <integer>
## PMTC6LiverC100DL1S2_S100_L001_R1_001               75555
## PMTC6LiverC100DL1S6_S484_L002_R1_001               28651
## PMTC6LiverC100DL4L5S1_S868_L003_R1_001            110352
## PMTC6LiverC100DL6L7S3_S1252_L004_R1_001            98560
## PMTC6LiverC101DL1S2_S101_L001_R1_001               87786
## ...                                                  ...
## PMTC6LiverC99DL4L5S1_S867_L003_R1_001              96735
## PMTC6LiverC99DL6L7S3_S1251_L004_R1_001             64866
## PMTC6LiverC9DL1S1_S9_L001_R1_001                   59107
## PMTC6LiverC9DL1S5_S393_L002_R1_001                 96126
## PMTC6LiverC9DL6L7S1_S1161_L004_R1_001              71998
##                                         reads_ambiguous
##                                               <integer>
## PMTC6LiverC100DL1S2_S100_L001_R1_001                 90
## PMTC6LiverC100DL1S6_S484_L002_R1_001                 27
## PMTC6LiverC100DL4L5S1_S868_L003_R1_001              103
## PMTC6LiverC100DL6L7S3_S1252_L004_R1_001              91
## PMTC6LiverC101DL1S2_S101_L001_R1_001                 76
## ...                                                 ...
## PMTC6LiverC99DL4L5S1_S867_L003_R1_001                86
## PMTC6LiverC99DL6L7S3_S1251_L004_R1_001               56
## PMTC6LiverC9DL1S1_S9_L001_R1_001                     54
## PMTC6LiverC9DL1S5_S393_L002_R1_001                  109
## PMTC6LiverC9DL6L7S1_S1161_L004_R1_001                60
##                                         reads_total
##                                           <numeric>
## PMTC6LiverC100DL1S2_S100_L001_R1_001         503443
## PMTC6LiverC100DL1S6_S484_L002_R1_001         194165
## PMTC6LiverC100DL4L5S1_S868_L003_R1_001       735636
## PMTC6LiverC100DL6L7S3_S1252_L004_R1_001      646947
## PMTC6LiverC101DL1S2_S101_L001_R1_001         583280
## ...                                             ...
## PMTC6LiverC99DL4L5S1_S867_L003_R1_001        649423
## PMTC6LiverC99DL6L7S3_S1251_L004_R1_001       436032
## PMTC6LiverC9DL1S1_S9_L001_R1_001             397416
## PMTC6LiverC9DL1S5_S393_L002_R1_001           643402
## PMTC6LiverC9DL6L7S1_S1161_L004_R1_001        485329
##                                         percentage_duplicates
##                                                     <numeric>
## PMTC6LiverC100DL1S2_S100_L001_R1_001                    0.069
## PMTC6LiverC100DL1S6_S484_L002_R1_001                    0.065
## PMTC6LiverC100DL4L5S1_S868_L003_R1_001                  0.079
## PMTC6LiverC100DL6L7S3_S1252_L004_R1_001                 0.082
## PMTC6LiverC101DL1S2_S101_L001_R1_001                    0.074
## ...                                                       ...
## PMTC6LiverC99DL4L5S1_S867_L003_R1_001                   0.083
## PMTC6LiverC99DL6L7S3_S1251_L004_R1_001                  0.084
## PMTC6LiverC9DL1S1_S9_L001_R1_001                        0.064
## PMTC6LiverC9DL1S5_S393_L002_R1_001                      0.075
## PMTC6LiverC9DL6L7S1_S1161_L004_R1_001                   0.078
##                                         overdispersion
##                                              <numeric>
## PMTC6LiverC100DL1S2_S100_L001_R1_001        0.00262992
## PMTC6LiverC100DL1S6_S484_L002_R1_001        0.01564945
## PMTC6LiverC100DL4L5S1_S868_L003_R1_001      0.00302844
## PMTC6LiverC100DL6L7S3_S1252_L004_R1_001     0.00379236
## PMTC6LiverC101DL1S2_S101_L001_R1_001        0.00674672
## ...                                                ...
## PMTC6LiverC99DL4L5S1_S867_L003_R1_001       0.00195059
## PMTC6LiverC99DL6L7S3_S1251_L004_R1_001      0.00362179
## PMTC6LiverC9DL1S1_S9_L001_R1_001            0.00406539
## PMTC6LiverC9DL1S5_S393_L002_R1_001          0.00276767
## PMTC6LiverC9DL6L7S1_S1161_L004_R1_001       0.00231466
##                                         breakpoint_count
##                                                <numeric>
## PMTC6LiverC100DL1S2_S100_L001_R1_001                   0
## PMTC6LiverC100DL1S6_S484_L002_R1_001                   0
## PMTC6LiverC100DL4L5S1_S868_L003_R1_001                 1
## PMTC6LiverC100DL6L7S3_S1252_L004_R1_001                0
## PMTC6LiverC101DL1S2_S101_L001_R1_001                   0
## ...                                                  ...
## PMTC6LiverC99DL4L5S1_S867_L003_R1_001                  2
## PMTC6LiverC99DL6L7S3_S1251_L004_R1_001                 1
## PMTC6LiverC9DL1S1_S9_L001_R1_001                       0
## PMTC6LiverC9DL1S5_S393_L002_R1_001                     0
## PMTC6LiverC9DL6L7S1_S1161_L004_R1_001                  2

5.2 findAneuploidCells()

Datasets may contain euploid cells mixed with the aneuploidy cells.

To detect euploid cells findAneuploidCells() calculates the sample-wise coefficient of variation from the segment ratio means. The expected coefficient of variation for euploid cells N(0, 0.01) is simulated for x data points, where x is equal to the number of cells within the dataset. An expectation-maximization algorithm is used to fit a mixture of gaussian distributions to the coefficient of variation from the samples together with the simulated datasets. The distribution containing the simulated dataset is inferred to be the euploid distribution. Samples that group with the inferred euploid distribution and present coefficient of variation smaller than 5 standard deviations from the mean euploid distribution are classified as euploid samples.

The threshold can be changed from the automatic detection to a custom threshold with the argument resolution. For example, by setting a threshold = 0.1, findAneuploidCells will mark as euploid all cells with a coefficient of variation less or equal than 0.1.

tumor <- findAneuploidCells(tumor)
## number of iterations= 23
## Copykit detected 610 that are possibly diploid cells using a resolution of: 0.074
## Added information to colData(CopyKit).

The results from findAneuploidCells() are stored within the colData in the column is_aneuploid.

We can visualize the results with plotHeatmap():

plotHeatmap(tumor, label = 'is_aneuploid', row_split = 'is_aneuploid', n_threads = 40)
## order_cells argument is NULL. Samples are ordered according to
##               colnames(CopyKit)
## Plotting Heatmap.

The object is subsetted in the same way as with any R object, to keep only the aneuploid cells.

tumor <- tumor[,colData(tumor)$is_aneuploid == TRUE]

5.3 findOutliers()

findOutliers() annotates low-quality cells according to a defined resolution threshold.

To detect low-quality samples, CopyKit calculates the Pearson correlation matrix of all samples from the segment ratio means. Next, we calculate a sample-wise mean of the correlation between a cell and its k-nearest-neighbors (default = 5). Cells in which the correlation value is lower than the defined threshold are classified as low-quality cells (default = 0.9).

tumor <- findOutliers(tumor)
## Calculating correlation matrix.
## Marked 99 cells as outliers.
## Adding information to metadata. Access with colData(scCNA).
## Done.

The default correlation cutoff for filtering can be adjusted with the argument ‘resolution’. For example, setting the resolution = 0.8 will mark all cells with a mean correlation smaller than 0.8 as low-quality cells. Higher resolution values will result in stricter filtering criterias.

Results from findOutliers() are added to colData (column outlier) marking cells that can be removed or kept.

We can check the results with plotHeatmap(). To make visualization easier, rows can also be split according to elements of colData with the argument row_split.

plotHeatmap(tumor, label = 'outlier', row_split = 'outlier', n_threads = 40)
## order_cells argument is NULL. Samples are ordered according to
##               colnames(CopyKit)
## Plotting Heatmap.

We remove the marked low-quality cells from the object with:

tumor <- tumor[,colData(tumor)$outlier == FALSE]

The dataset should be ready to proceed with the analysis.