GATK4's CalculateContamination reports no hom alt sites found
I have been trying to use GATK4's CalculateContamination but the output is not as expected:
level contamination error
whole_bam 0.0 1.0
The GATK log contained warnings that there was not enough data points to segment and that no hom alt sites were found.
Using GATK jar /mnt/projects/dlho/tancrc/bcbio_pipeline/anaconda/share/gatk44.0.4.00/gatkpackage4.0.4.0local.jar
Running:
java Dsamjdk.use_async_io_read_samtools=false Dsamjdk.use_async_io_write_samtools=true Dsamjdk.use_async_io_write_tribble=false Dsamjdk.compression_level=2 Xmx16g jar /mnt/projects/dlho/tancrc/bcbio_pipeline/anaconda/share/gatk44.0.4.00/gatkpackage4.0.4.0local.jar CalculateContamination I out/BC00203042014_A_getpileupsummaries.table O out/BC00203042014_A_calculatecontamination.table
Picked up _JAVA_OPTIONS: XX:+UseSerialGC
09:46:05.758 INFO NativeLibraryLoader  Loading libgkl_compression.so from jar:file:/mnt/projects/dlho/tancrc/bcbio_pipeline/anaconda/share/gatk44.0.4.00/gatkpackage4.0.4.0local.jar!/com/intel/gkl/native/libgkl_compression.so
09:46:05.872 INFO CalculateContamination  
09:46:05.872 INFO CalculateContamination  The Genome Analysis Toolkit (GATK) v4.0.4.0
09:46:05.872 INFO CalculateContamination  For support and documentation go to https://software.broadinstitute.org/gatk/
09:46:05.872 INFO CalculateContamination  Executing as [email protected] on Linux v2.6.32431.el6.x86_64 amd64
09:46:05.872 INFO CalculateContamination  Java runtime: OpenJDK 64Bit Server VM v1.8.0_102b14
09:46:05.873 INFO CalculateContamination  Start Date/Time: May 14, 2018 9:46:05 AM SGT
09:46:05.873 INFO CalculateContamination  
09:46:05.873 INFO CalculateContamination  
09:46:05.873 INFO CalculateContamination  HTSJDK Version: 2.14.3
09:46:05.873 INFO CalculateContamination  Picard Version: 2.18.2
09:46:05.873 INFO CalculateContamination  HTSJDK Defaults.COMPRESSION_LEVEL : 2
09:46:05.873 INFO CalculateContamination  HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
09:46:05.873 INFO CalculateContamination  HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
09:46:05.873 INFO CalculateContamination  HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
09:46:05.873 INFO CalculateContamination  Deflater: IntelDeflater
09:46:05.874 INFO CalculateContamination  Inflater: IntelInflater
09:46:05.874 INFO CalculateContamination  GCS max retries/reopens: 20
09:46:05.874 INFO CalculateContamination  Using googlecloudjava patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/googlecloudjava/tree/dr_all_nio_fixes
09:46:05.874 INFO CalculateContamination  Initializing engine
09:46:05.874 INFO CalculateContamination  Done initializing engine
09:46:05.935 WARN KernelSegmenter  Specified dimension of the kernel approximation (100) exceeds the number of data points (2) to segment; using all data points to calculate kernel matrix.
09:46:05.961 WARN KernelSegmenter  Number of points needed to calculate local changepoint costs (2 * window size = 100) exceeds number of data points (2). Local changepoint costs will not be calculated for this window size.
09:46:05.961 WARN KernelSegmenter  No changepoint candidates were found. The specified window sizes may be inappropriate, or there may be insufficient data points
09:46:06.083 INFO KernelSegmenter  Found 0 changepoints after applying the changepoint penalty.
09:46:06.090 WARN KernelSegmenter  Specified dimension of the kernel approximation (100) exceeds the number of data points (3) to segment; using all data points to calculate kernel matrix.
09:46:06.090 WARN KernelSegmenter  Number of points needed to calculate local changepoint costs (2 * window size = 100) exceeds number of data points (3). Local changepoint costs will not be calculated for this window size.
09:46:06.090 WARN KernelSegmenter  No changepoint candidates were found. The specified window sizes may be inappropriate, or there may be insufficient data points
09:46:06.091 INFO KernelSegmenter  Found 0 changepoints after applying the changepoint penalty.
09:46:06.091 WARN KernelSegmenter  Specified dimension of the kernel approximation (100) exceeds the number of data points (2) to segment; using all data points to calculate kernel matrix.
09:46:06.092 WARN KernelSegmenter  Number of points needed to calculate local changepoint costs (2 * window size = 100) exceeds number of data points (2). Local changepoint costs will not be calculated for this window size.
09:46:06.092 WARN KernelSegmenter  No changepoint candidates were found. The specified window sizes may be inappropriate, or there may be insufficient data points
09:46:06.092 INFO KernelSegmenter  Found 0 changepoints after applying the changepoint penalty.
09:46:06.093 WARN KernelSegmenter  Specified dimension of the kernel approximation (100) exceeds the number of data points (1) to segment; using all data points to calculate kernel matrix.
09:46:06.093 WARN KernelSegmenter  Number of points needed to calculate local changepoint costs (2 * window size = 100) exceeds number of data points (1). Local changepoint costs will not be calculated for this window size.
09:46:06.093 WARN KernelSegmenter  No changepoint candidates were found. The specified window sizes may be inappropriate, or there may be insufficient data points
09:46:06.093 INFO KernelSegmenter  Found 0 changepoints after applying the changepoint penalty.
09:46:06.113 WARN CalculateContamination  No hom alt sites found! Perhaps GetPileupSummaries was run on too small of an interval, or perhaps the sample was extremely inbred or haploid.
09:46:06.116 WARN CalculateContamination  No hom alt sites found! Perhaps GetPileupSummaries was run on too small of an interval, or perhaps the sample was extremely inbred or haploid.
09:46:06.117 WARN CalculateContamination  No hom alt sites found! Perhaps GetPileupSummaries was run on too small of an interval, or perhaps the sample was extremely inbred or haploid.
To get the pileup file required for CalculateContamination I used GetPileupSummaries and restricted the region with L to a bedfile containing 77 genes which are of interest. The pileup file looks normal and I have 311 variants in the file though, is this not enough to CalculateContamination? Can CalculateContamination not be performed on small targeted sequencing panels? Would appreciate if someone could assist pls!
Best Answer

shlee Cambridge ✭✭✭✭✭
Hi @manolis,
I believe these WARNs relate to asking the tool to segment the case by minor allele fraction with
tumorsegmentation /home/manolis/GATK4/2.BQSR/segments.table
. So if you remove this parameter, these WARNs should disappear. However, I think you are asking how to overcome these WARNs. Based on the message:Specified dimension of the kernel approximation (100) exceeds the number of data points (11) to segment; using all data points to calculate kernel matrix.
It appears that there are not enough datapoints. I think this message means you have 11 data points with which to segment but the kernal approximation requires 100. So the solution I think would be to provide more data points.
You should know though that we have a fullfledged workflow for segmentation that we recommend you use instead. The ModelSegments CNV workflow allows you to segment based on allelic data. You can read more about the workflow at https://software.broadinstitute.org/gatk/documentation/article?id=11682 and https://software.broadinstitute.org/gatk/documentation/article?id=11683.
Answers
@phu5ion
Hi,
It looks like you don't have enough data to run the contamination tools. Can you try running on your entire BAM file or are the 77 genes the entire BAM file? Have a look at this tutorial and the hands on tutorials in the Presentations section for more information.
Sheila
Hello,
I have used the pipeline given for GATk3 for my RNAseq samples. https://gatkforums.broadinstitute.org/gatk/discussion/3892/thegatkbestpracticesforvariantcallingonrnaseqinfulldetail
I saw on your forum that SplitNcigarReads in GATK3 is not available for GATK4. So I am not sure that I must use GATK4 in my all steps.
I try to apply GetPileupSummaries and CalculateContamination on my tumor bam file that I obtained from GATK3 RNAseq pipeline. I have used all vcf files downloaded from gnomeAD while using GetPileupSummaries. But I have similiar problem when I used CalculateContamination. Only the error column is different and it is 0.0 as I gave below.
level contamination error
whole_bam 0.0 1.0
Of course, FilterMutectCalls doesn't execute because of my contamination table.
What can I do to handle this problem?
Thank you so much.
Hello,
Sorry for my previous post. I realized that I have used wrong gnomad file. I download exomes one from gnomAD now and try to use GetPileupSummaries and CalculateContamination again.
The output of the contamination is the file I gaved liftover.txt.
And the output of contamination table is
level contamination error
whole_bam 0.0 0.0
Would you help me to find the wrong thing about this file, please?
Thank you so much.
And I forgot to denote that I have no matched samples but I created a PON file from my normal samples and I created vcf file by using only tumor sample. Can I still use GetPileupSummaries, CalculateContamination and FilterMutectCalls commands?
@mine
Hi,
The error message says you "hit memory limit at least once during execution. This may or may not result in some failure." It looks like you need to give more memory to the tools. https://software.broadinstitute.org/gatk/documentation/article?id=11050
Sheila
I didn't realize the error at the end of the file. I know the meaning of this error. I am really so sorry. I take up your time. Thank you so much.
Hi, I have a similar problem related to the first part of this thread.
GATK 4.0.11.0, linux server, WES
I read the other treads 1, 2, 3 but I didn't find a solution to how change the set up.
I used the entire bam file, is an old WES. I have an output file and I used it in the FilterMutectCalls step without any error.
Hi @manolis,
Can you please describe your problem? It looks like your CalculateContamination run ran fine and you have a contamination of 0.004.
Sorry! Is about all the warnings.
12:12:44.142 WARN KernelSegmenter  Specified dimension of the kernel approximation (100) exceeds the number of data points (11) to segment; using all data points to calculate kernel matrix.
12:12:44.143 WARN KernelSegmenter  Number of points needed to calculate local changepoint costs (2 * window size = 100) exceeds number of data points (11). Local changepoint costs will not be calculated for this window size.
12:12:44.143 WARN KernelSegmenter  No changepoint candidates were found. The specified window sizes may be inappropriate, or there may be insufficient data points
12:12:44.321 WARN KernelSegmenter  Specified dimension of the kernel approximation (100) exceeds the number of data points (59) to segment; using all data points to calculate kernel matrix.
12:12:44.556 WARN KernelSegmenter  Specified dimension of the kernel approximation (100) exceeds the number of data points (82) to segment; using all data points to calculate kernel matrix.
e.t.c. ...
How can I fix them?
Many thanks!
Hi @manolis,
I believe these WARNs relate to asking the tool to segment the case by minor allele fraction with
tumorsegmentation /home/manolis/GATK4/2.BQSR/segments.table
. So if you remove this parameter, these WARNs should disappear. However, I think you are asking how to overcome these WARNs. Based on the message:It appears that there are not enough datapoints. I think this message means you have 11 data points with which to segment but the kernal approximation requires 100. So the solution I think would be to provide more data points.
You should know though that we have a fullfledged workflow for segmentation that we recommend you use instead. The ModelSegments CNV workflow allows you to segment based on allelic data. You can read more about the workflow at https://software.broadinstitute.org/gatk/documentation/article?id=11682 and https://software.broadinstitute.org/gatk/documentation/article?id=11683.
Hi @shlee, as always thank you very much for your time!