5 QC module
CopyKit Quality Control Module consists of 3 main functions:
runMetrics()
findOutliers()
findAneuploidCells()
.
5.1 runMetrics()
runMetrics()
adds basic quality control information to colData.
It returns sample-wise metrics of overdispersion and breakpoint counts.
<- runMetrics(tumor) tumor
## Calculating overdispersion.
## Counting breakpoints.
## Done.
The resulting information can be viewed with:
colData(tumor)
## DataFrame with 1502 rows and 11 columns
## sample
## <character>
## PMTC6LiverC100DL1S2_S100_L001_R1_001 PMTC6LiverC100DL1S2_..
## PMTC6LiverC100DL1S6_S484_L002_R1_001 PMTC6LiverC100DL1S6_..
## PMTC6LiverC100DL4L5S1_S868_L003_R1_001 PMTC6LiverC100DL4L5S..
## PMTC6LiverC100DL6L7S3_S1252_L004_R1_001 PMTC6LiverC100DL6L7S..
## PMTC6LiverC101DL1S2_S101_L001_R1_001 PMTC6LiverC101DL1S2_..
## ... ...
## PMTC6LiverC99DL4L5S1_S867_L003_R1_001 PMTC6LiverC99DL4L5S1..
## PMTC6LiverC99DL6L7S3_S1251_L004_R1_001 PMTC6LiverC99DL6L7S3..
## PMTC6LiverC9DL1S1_S9_L001_R1_001 PMTC6LiverC9DL1S1_S9..
## PMTC6LiverC9DL1S5_S393_L002_R1_001 PMTC6LiverC9DL1S5_S3..
## PMTC6LiverC9DL6L7S1_S1161_L004_R1_001 PMTC6LiverC9DL6L7S1_..
## reads_assigned_bins
## <integer>
## PMTC6LiverC100DL1S2_S100_L001_R1_001 362665
## PMTC6LiverC100DL1S6_S484_L002_R1_001 130570
## PMTC6LiverC100DL4L5S1_S868_L003_R1_001 536017
## PMTC6LiverC100DL6L7S3_S1252_L004_R1_001 467002
## PMTC6LiverC101DL1S2_S101_L001_R1_001 423654
## ... ...
## PMTC6LiverC99DL4L5S1_S867_L003_R1_001 460498
## PMTC6LiverC99DL6L7S3_S1251_L004_R1_001 306486
## PMTC6LiverC9DL1S1_S9_L001_R1_001 274402
## PMTC6LiverC9DL1S5_S393_L002_R1_001 465001
## PMTC6LiverC9DL6L7S1_S1161_L004_R1_001 336559
## reads_unmapped
## <integer>
## PMTC6LiverC100DL1S2_S100_L001_R1_001 30250
## PMTC6LiverC100DL1S6_S484_L002_R1_001 22260
## PMTC6LiverC100DL4L5S1_S868_L003_R1_001 30942
## PMTC6LiverC100DL6L7S3_S1252_L004_R1_001 28239
## PMTC6LiverC101DL1S2_S101_L001_R1_001 28756
## ... ...
## PMTC6LiverC99DL4L5S1_S867_L003_R1_001 37945
## PMTC6LiverC99DL6L7S3_S1251_L004_R1_001 28090
## PMTC6LiverC9DL1S1_S9_L001_R1_001 38326
## PMTC6LiverC9DL1S5_S393_L002_R1_001 34111
## PMTC6LiverC9DL6L7S1_S1161_L004_R1_001 38814
## reads_duplicates
## <integer>
## PMTC6LiverC100DL1S2_S100_L001_R1_001 34883
## PMTC6LiverC100DL1S6_S484_L002_R1_001 12657
## PMTC6LiverC100DL4L5S1_S868_L003_R1_001 58222
## PMTC6LiverC100DL6L7S3_S1252_L004_R1_001 53055
## PMTC6LiverC101DL1S2_S101_L001_R1_001 43008
## ... ...
## PMTC6LiverC99DL4L5S1_S867_L003_R1_001 54159
## PMTC6LiverC99DL6L7S3_S1251_L004_R1_001 36534
## PMTC6LiverC9DL1S1_S9_L001_R1_001 25527
## PMTC6LiverC9DL1S5_S393_L002_R1_001 48055
## PMTC6LiverC9DL6L7S1_S1161_L004_R1_001 37898
## reads_multimapped
## <integer>
## PMTC6LiverC100DL1S2_S100_L001_R1_001 0
## PMTC6LiverC100DL1S6_S484_L002_R1_001 0
## PMTC6LiverC100DL4L5S1_S868_L003_R1_001 0
## PMTC6LiverC100DL6L7S3_S1252_L004_R1_001 0
## PMTC6LiverC101DL1S2_S101_L001_R1_001 0
## ... ...
## PMTC6LiverC99DL4L5S1_S867_L003_R1_001 0
## PMTC6LiverC99DL6L7S3_S1251_L004_R1_001 0
## PMTC6LiverC9DL1S1_S9_L001_R1_001 0
## PMTC6LiverC9DL1S5_S393_L002_R1_001 0
## PMTC6LiverC9DL6L7S1_S1161_L004_R1_001 0
## reads_unassigned
## <integer>
## PMTC6LiverC100DL1S2_S100_L001_R1_001 75555
## PMTC6LiverC100DL1S6_S484_L002_R1_001 28651
## PMTC6LiverC100DL4L5S1_S868_L003_R1_001 110352
## PMTC6LiverC100DL6L7S3_S1252_L004_R1_001 98560
## PMTC6LiverC101DL1S2_S101_L001_R1_001 87786
## ... ...
## PMTC6LiverC99DL4L5S1_S867_L003_R1_001 96735
## PMTC6LiverC99DL6L7S3_S1251_L004_R1_001 64866
## PMTC6LiverC9DL1S1_S9_L001_R1_001 59107
## PMTC6LiverC9DL1S5_S393_L002_R1_001 96126
## PMTC6LiverC9DL6L7S1_S1161_L004_R1_001 71998
## reads_ambiguous
## <integer>
## PMTC6LiverC100DL1S2_S100_L001_R1_001 90
## PMTC6LiverC100DL1S6_S484_L002_R1_001 27
## PMTC6LiverC100DL4L5S1_S868_L003_R1_001 103
## PMTC6LiverC100DL6L7S3_S1252_L004_R1_001 91
## PMTC6LiverC101DL1S2_S101_L001_R1_001 76
## ... ...
## PMTC6LiverC99DL4L5S1_S867_L003_R1_001 86
## PMTC6LiverC99DL6L7S3_S1251_L004_R1_001 56
## PMTC6LiverC9DL1S1_S9_L001_R1_001 54
## PMTC6LiverC9DL1S5_S393_L002_R1_001 109
## PMTC6LiverC9DL6L7S1_S1161_L004_R1_001 60
## reads_total
## <numeric>
## PMTC6LiverC100DL1S2_S100_L001_R1_001 503443
## PMTC6LiverC100DL1S6_S484_L002_R1_001 194165
## PMTC6LiverC100DL4L5S1_S868_L003_R1_001 735636
## PMTC6LiverC100DL6L7S3_S1252_L004_R1_001 646947
## PMTC6LiverC101DL1S2_S101_L001_R1_001 583280
## ... ...
## PMTC6LiverC99DL4L5S1_S867_L003_R1_001 649423
## PMTC6LiverC99DL6L7S3_S1251_L004_R1_001 436032
## PMTC6LiverC9DL1S1_S9_L001_R1_001 397416
## PMTC6LiverC9DL1S5_S393_L002_R1_001 643402
## PMTC6LiverC9DL6L7S1_S1161_L004_R1_001 485329
## percentage_duplicates
## <numeric>
## PMTC6LiverC100DL1S2_S100_L001_R1_001 0.069
## PMTC6LiverC100DL1S6_S484_L002_R1_001 0.065
## PMTC6LiverC100DL4L5S1_S868_L003_R1_001 0.079
## PMTC6LiverC100DL6L7S3_S1252_L004_R1_001 0.082
## PMTC6LiverC101DL1S2_S101_L001_R1_001 0.074
## ... ...
## PMTC6LiverC99DL4L5S1_S867_L003_R1_001 0.083
## PMTC6LiverC99DL6L7S3_S1251_L004_R1_001 0.084
## PMTC6LiverC9DL1S1_S9_L001_R1_001 0.064
## PMTC6LiverC9DL1S5_S393_L002_R1_001 0.075
## PMTC6LiverC9DL6L7S1_S1161_L004_R1_001 0.078
## overdispersion
## <numeric>
## PMTC6LiverC100DL1S2_S100_L001_R1_001 0.00262992
## PMTC6LiverC100DL1S6_S484_L002_R1_001 0.01564945
## PMTC6LiverC100DL4L5S1_S868_L003_R1_001 0.00302844
## PMTC6LiverC100DL6L7S3_S1252_L004_R1_001 0.00379236
## PMTC6LiverC101DL1S2_S101_L001_R1_001 0.00674672
## ... ...
## PMTC6LiverC99DL4L5S1_S867_L003_R1_001 0.00195059
## PMTC6LiverC99DL6L7S3_S1251_L004_R1_001 0.00362179
## PMTC6LiverC9DL1S1_S9_L001_R1_001 0.00406539
## PMTC6LiverC9DL1S5_S393_L002_R1_001 0.00276767
## PMTC6LiverC9DL6L7S1_S1161_L004_R1_001 0.00231466
## breakpoint_count
## <numeric>
## PMTC6LiverC100DL1S2_S100_L001_R1_001 0
## PMTC6LiverC100DL1S6_S484_L002_R1_001 0
## PMTC6LiverC100DL4L5S1_S868_L003_R1_001 1
## PMTC6LiverC100DL6L7S3_S1252_L004_R1_001 0
## PMTC6LiverC101DL1S2_S101_L001_R1_001 0
## ... ...
## PMTC6LiverC99DL4L5S1_S867_L003_R1_001 2
## PMTC6LiverC99DL6L7S3_S1251_L004_R1_001 1
## PMTC6LiverC9DL1S1_S9_L001_R1_001 0
## PMTC6LiverC9DL1S5_S393_L002_R1_001 0
## PMTC6LiverC9DL6L7S1_S1161_L004_R1_001 2
5.2 findAneuploidCells()
Datasets may contain euploid cells mixed with the aneuploidy cells.
To detect euploid cells findAneuploidCells()
calculates the sample-wise coefficient of variation from the segment ratio means.
The expected coefficient of variation for euploid cells N(0, 0.01)
is simulated for x data points, where x is equal to the number of cells within the dataset.
An expectation-maximization algorithm is used to fit a mixture of gaussian distributions to the coefficient of variation from the samples together with the simulated datasets.
The distribution containing the simulated dataset is inferred to be the euploid distribution.
Samples that group with the inferred euploid distribution and present coefficient of variation smaller than 5 standard deviations from the mean euploid distribution are classified as euploid samples.
The threshold can be changed from the automatic detection to a custom threshold with the argument resolution
. For example, by setting a threshold = 0.1, findAneuploidCells will mark as euploid all cells with a coefficient of variation less or equal than 0.1.
<- findAneuploidCells(tumor) tumor
## number of iterations= 23
## Copykit detected 610 that are possibly diploid cells using a resolution of: 0.074
## Added information to colData(CopyKit).
The results from findAneuploidCells()
are stored within the colData in the column is_aneuploid.
We can visualize the results with plotHeatmap()
:
plotHeatmap(tumor, label = 'is_aneuploid', row_split = 'is_aneuploid', n_threads = 40)
## order_cells argument is NULL. Samples are ordered according to
## colnames(CopyKit)
## Plotting Heatmap.
The object is subsetted in the same way as with any R object, to keep only the aneuploid cells.
<- tumor[,colData(tumor)$is_aneuploid == TRUE] tumor
5.3 findOutliers()
findOutliers()
annotates low-quality cells according to a defined resolution threshold.
To detect low-quality samples, CopyKit calculates the Pearson correlation matrix of all samples from the segment ratio means. Next, we calculate a sample-wise mean of the correlation between a cell and its k-nearest-neighbors (default = 5). Cells in which the correlation value is lower than the defined threshold are classified as low-quality cells (default = 0.9).
<- findOutliers(tumor) tumor
## Calculating correlation matrix.
## Marked 99 cells as outliers.
## Adding information to metadata. Access with colData(scCNA).
## Done.
The default correlation cutoff for filtering can be adjusted with the argument ‘resolution’. For example, setting the resolution = 0.8 will mark all cells with a mean correlation smaller than 0.8 as low-quality cells. Higher resolution values will result in stricter filtering criterias.
Results from findOutliers()
are added to colData (column outlier) marking cells that can be removed or kept.
We can check the results with plotHeatmap()
. To make visualization easier, rows can also be split according to elements of colData with the argument row_split
.
plotHeatmap(tumor, label = 'outlier', row_split = 'outlier', n_threads = 40)
## order_cells argument is NULL. Samples are ordered according to
## colnames(CopyKit)
## Plotting Heatmap.
We remove the marked low-quality cells from the object with:
<- tumor[,colData(tumor)$outlier == FALSE] tumor
The dataset should be ready to proceed with the analysis.