To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

GATK's UnifiedGenotyper SNP/Indel Calling Producing No Results

CindyCindy Member
edited December 2013 in Ask the GATK team

Hey there,

I was trying to build an analysis pipeline for paired reads with BWA, Duplicate Removal Local Realignment and Base Quality Score Recalibration to finally use GATK's UnifiedGenotyper for SNP and Indel calling. However, for both SNPs and Indels, I receive no called variants no matter how low my used thresholds are. Quality values of the reads look ok, leaving out dbSNP does not change results. I have used the same reference throughout the whole pipeline. I use GATK 2.7, nevertheless a switch to GATK 1.6 did not change anything.

This is my shell command for SNP calling on chromosome X (GATK delivers no results for all chromosomes):
java -Xmx4g -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -R Homo_sapiens_assembly19.fasta -stand_call_conf 30.0 -stand_emit_conf 30.0 -glm SNP -mbq 17 -I test.bam -L X -o test.snps.vcf -D dbsnp_135.hg19.excluding_sites_after_129.vcf

Entries in my BAM file look like this:
SRR389458.1885965 113 X 10092397 37 76M = 10092397 1 CCTGTTTCCCCTGGGGCTGGGCTNGANACTGGGCCCAACCNGTGGCTCCCACCTGCACACACAGGGCTGGAGGGAC 98998999989:99:9:999888#88#79999:;:89998#99:;:88:989:;:91889888:;:9;:::::999 X0:i:1 X1:i:0 BD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN MD:Z:23G2G13C35 PG:Z:MarkDuplicates RG:Z:DEFAULT XG:i:0 BI:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN AM:i:37 NM:i:3 SM:i:37 XM:i:3 XO:i:0 MQ:i:37 XT:A:U
SRR389458.1885965 177 X 10092397 37 76M = 10092397 -1 CCTGTTTCCCCTGGGGCTGGGCTNGANACTGGGCCCAACCNGTGGCTCCCACCTGCACACACAGGGCTGGAGGGAC 98998999989:99:9:999888#88#79999:;:89998#99:;:88:989:;:91889888:;:9;:::::999 X0:i:1 X1:i:0 BD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN MD:Z:23G2G13C35 PG:Z:MarkDuplicates RG:Z:DEFAULT XG:i:0 BI:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN AM:i:37 NM:i:3 SM:i:37 XM:i:3 XO:i:0 MQ:i:37 XT:A:U
SRR389458.1888837 113 X 14748343 37 76M = 14748343 1 TCGTGAAAGTCGTTTTAATTTTAGCGGTTATGGGATGGGTCACTGCCTCCAAGTGAAAGATGGAACAGCTGTCAAG 889999:9988;98:9::9;9986::::99:8:::::999988989:8;;9::989:999:9;9:;:99:98:999 X0:i:1 X1:i:0 BD:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN MD:Z:76 PG:Z:MarkDuplicates RG:Z:DEFAULT XG:i:0 BI:Z:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 MQ:i:37 XT:A:U

And this is the output of the UnifiedGenotyper:
INFO 17:57:00,575 HelpFormatter - --------------------------------------------------------------------------------
INFO 17:57:00,578 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.7-4-g6f46d11, Compiled 2013/10/10 17:27:51
INFO 17:57:00,578 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 17:57:00,578 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 17:57:00,582 HelpFormatter - Program Args: -T UnifiedGenotyper -R /hana/exchange/reference_genomes/hg19/Homo_sapiens_assembly19.fasta -stand_call_conf 30.0 -stand_emit_conf 30.0 -glm SNP -mbq 17 -I test.bam -L X -o testX.snps.vcf -D dbsnp_135.hg19.excluding_sites_after_129.vcf
INFO 17:57:00,583 HelpFormatter - Date/Time: 2013/12/17 17:57:00
INFO 17:57:00,583 HelpFormatter - --------------------------------------------------------------------------------
INFO 17:57:00,583 HelpFormatter - --------------------------------------------------------------------------------
INFO 17:57:00,943 ArgumentTypeDescriptor - Dynamically determined type of /hana/exchange/reference_genomes/hg19/dbsnp_135.hg19.excluding_sites_after_129.vcf to be VCF
INFO 17:57:01,579 GenomeAnalysisEngine - Strictness is SILENT
INFO 17:57:02,228 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250
INFO 17:57:02,237 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 17:57:02,364 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.13
INFO 17:57:02,594 RMDTrackBuilder - Loading Tribble index from disk for file /hana/exchange/reference_genomes/hg19/dbsnp_135.hg19.excluding_sites_after_129.vcf
INFO 17:57:02,867 IntervalUtils - Processing 155270560 bp from intervals
INFO 17:57:02,943 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files
INFO 17:57:03,166 GenomeAnalysisEngine - Done preparing for traversal
INFO 17:57:03,167 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 17:57:03,167 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining
INFO 17:57:33,171 ProgressMeter - X:11779845 1.01e+07 30.0 s 2.0 s 7.6% 6.6 m 6.1 m
INFO 17:58:03,173 ProgressMeter - X:24739805 1.89e+07 60.0 s 3.0 s 15.9% 6.3 m 5.3 m
INFO 17:58:33,175 ProgressMeter - X:37330641 3.25e+07 90.0 s 2.0 s 24.0% 6.2 m 4.7 m
INFO 17:59:03,177 ProgressMeter - X:49404077 4.94e+07 120.0 s 2.0 s 31.8% 6.3 m 4.3 m
INFO 17:59:33,178 ProgressMeter - X:64377965 5.36e+07 2.5 m 2.0 s 41.5% 6.0 m 3.5 m
INFO 18:00:03,180 ProgressMeter - X:75606869 7.54e+07 3.0 m 2.0 s 48.7% 6.2 m 3.2 m
INFO 18:00:33,189 ProgressMeter - X:88250233 7.74e+07 3.5 m 2.0 s 56.8% 6.2 m 2.7 m
INFO 18:01:03,190 ProgressMeter - X:100393213 9.94e+07 4.0 m 2.0 s 64.7% 6.2 m 2.2 m
INFO 18:01:33,192 ProgressMeter - X:110535705 1.09e+08 4.5 m 2.0 s 71.2% 6.3 m 109.0 s
INFO 18:02:03,193 ProgressMeter - X:121257489 1.20e+08 5.0 m 2.0 s 78.1% 6.4 m 84.0 s
INFO 18:02:33,195 ProgressMeter - X:132533757 1.32e+08 5.5 m 2.0 s 85.4% 6.4 m 56.0 s
INFO 18:03:03,197 ProgressMeter - X:144498909 1.41e+08 6.0 m 2.0 s 93.1% 6.4 m 26.0 s
INFO 18:03:30,079 ProgressMeter - done 1.55e+08 6.4 m 2.0 s 100.0% 6.4 m 0.0 s
INFO 18:03:30,079 ProgressMeter - Total runtime 386.91 secs, 6.45 min, 0.11 hours
INFO 18:03:30,080 MicroScheduler - 0 reads were filtered out during the traversal out of approximately 150 total reads (0.00%)
INFO 18:03:30,080 MicroScheduler - -> 0 reads (0.00% of total) failing BadMateFilter
INFO 18:03:30,080 MicroScheduler - -> 0 reads (0.00% of total) failing DuplicateReadFilter
INFO 18:03:30,080 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
INFO 18:03:30,081 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter
INFO 18:03:30,081 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter
INFO 18:03:30,081 MicroScheduler - -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter
INFO 18:03:30,081 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter
INFO 18:03:32,167 GATKRunReport - Uploaded run statistics report to AWS S3

Do I miss anything here?

Best,

Cindy

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi Cindy,

    Based on this line:

    INFO 18:03:30,080 MicroScheduler - 0 reads were filtered out during the traversal out of approximately 150 total reads (0.00%)

    it looks like you have only 150 reads in total over the entire X chromosome. Assuming this is whole genome sequence (since you're not using a target intervals file) you should do a QC analysis to find out why you have so little coverage and whether the problem is genome-wide or just that chromosome.

  • yes, this is a very small data set just for testing if the pipeline delivers any result. Nevertheless, I think UG should at least detect some SNPs here, shouldn't it?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    It depends where the reads are. Maybe there is none that overlaps a SNP. Generally speaking that dataset sounds much too small to do any kind of useful testing.

  • zzyzzy BeijingMember
    edited June 2015

    @Geraldine_VdAuwera said:
    It depends where the reads are. Maybe there is none that overlaps a SNP. Generally speaking that dataset sounds much too small to do any kind of useful testing.

    Hi, Geraldine,
    I happened to come across the same problem, I have my downloaded SAM file from SRA, NCBI.
    After remove duplication (by samtools rmdup), realignment and recalibration (by GATK) I use UnifiedGenotyper to call variants, and there were no variation results.
    I tested a sub set (chr21) of my data and there were still no results even I set -stand_call_conf and -stand_emit_conf to small values like 2.0

    I am using GATK version 3.4-0-g7e26428.

    Here is my command:
    +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++``
    java -XX:+UseParallelGC -XX:ParallelGCThreads=8 -Xmx2g -jar ../GenomeAnalysisTK.jar \
    -R hg19_chr21.fa -T UnifiedGenotyper -I $file \
    --dbsnp /dbSNP/All.vcf.gz \
    -o $one.snp_indel.raw.vcf \
    -nct 8 \
    -glm BOTH \
    -stand_call_conf 10.0 -stand_emit_conf 10.0
    +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    And here is the output:
    +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

    INFO 21:19:02,915 HelpFormatter - Date/Time: 2015/06/06 21:19:02
    INFO 21:19:02,916 HelpFormatter - --------------------------------------------------------------------------------
    INFO 21:19:02,917 HelpFormatter - --------------------------------------------------------------------------------
    INFO 21:19:03,229 GenomeAnalysisEngine - Strictness is SILENT
    INFO 21:19:03,514 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMP
    LE, Target Coverage: 250
    INFO 21:19:03,558 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
    INFO 21:19:03,663 SAMDataSource$SAMReaders - Done initializing BAM readers: tot
    al time 0.10
    WARN 21:19:04,714 IndexDictionaryUtils - Track dbsnp doesn't have a sequence di
    ctionary built in, skipping dictionary validation
    INFO 21:19:04,742 MicroScheduler - Running the GATK in parallel mode with 8 tot
    al threads, 8 CPU thread(s) for each of 1 data thread(s), of 48 processors avail
    able on this machine
    INFO 21:19:04,947 GenomeAnalysisEngine - Preparing for traversal over 1 BAM fil
    es
    INFO 21:19:05,790 GenomeAnalysisEngine - Done preparing for traversal
    INFO 21:19:05,791 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING
    ]
    INFO 21:19:05,792 ProgressMeter - | processed | time | pe
    r 1M | | total | remaining
    INFO 21:19:05,793 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime
    INFO 21:19:05,882 StrandBiasTest - SAM/BAM data was found. Attempting to use read data to calculate strand bias annotations values.
    INFO 21:19:05,883 StrandBiasTest - SAM/BAM data was found. Attempting to use read data to calculate strand bias annotations values.
    INFO 21:19:39,070 ProgressMeter - chr21.fa:7378901 0.0 33.0 s 55.0 w 7.4% 7.4 m 6.9 m
    INFO 21:20:09,074 ProgressMeter - chr21.fa:14959593 1.4958592E7 63.0 s 4.0 s 15.0% 7.0 m 5.9 m
    INFO 21:20:39,079 ProgressMeter - chr21.fa:21960861 2.195456E7 93.0 s 4.0 s 22.1% 7.0 m 5.5 m
    INFO 21:21:09,085 ProgressMeter - chr21.fa:29103297 2.8983296E7 2.1 m 4.0 s 29.3% 7.0 m 5.0 m
    INFO 21:21:39,089 ProgressMeter - chr21.fa:34220793 3.4209792E7 2.6 m 4.0 s 34.4% 7.4 m 4.9 m
    INFO 21:22:09,108 ProgressMeter - chr21.fa:39526309 3.9518208E7 3.1 m 4.0 s 39.8% 7.7 m 4.6 m
    INFO 21:22:39,160 ProgressMeter - chr21.fa:44695469 4.4679168E7 3.6 m 4.0 s 44.9% 7.9 m 4.3 m
    INFO 21:22:58,701 ProgressMeter - done 4.8129895E7 3.9 m 4.0 s 48.4% 8.0 m 4.1 m
    INFO 21:22:58,702 ProgressMeter - Total runtime 232.91 secs, 3.88 min, 0.06 hours
    INFO 21:22:58,703 MicroScheduler - 226 reads were filtered out during the traversal out of approximately 656355 total reads (0.03%)
    INFO 21:22:58,703 MicroScheduler - -> 0 reads (0.00% of total) failing BadCigarFilter
    INFO 21:22:58,704 MicroScheduler - -> 226 reads (0.03% of total) failing BadMateFilter
    INFO 21:22:58,705 MicroScheduler - -> 0 reads (0.00% of total) failing DuplicateReadFilter
    INFO 21:22:58,705 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
    INFO 21:22:58,706 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter
    INFO 21:22:58,707 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter
    INFO 21:22:58,707 MicroScheduler - -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter
    INFO 21:22:58,708 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter

    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

    I am looking forward to your reply, thanks!

    Zeyu

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @zzy
    Hi Zeyu,

    Can you try using Haplotype Caller on the same file and see if that gives any variants?

    Thanks,
    Sheila

  • zzyzzy BeijingMember

    @Sheila said:
    zzy
    Hi Zeyu,

    Can you try using Haplotype Caller on the same file and see if that gives any variants?

    Thanks,
    Sheila

    Thanks, @Sheila
    I tried, still no results there.
    I had converted the downloaded SAM file back to fastq file, and checked the data quality. Something must be wrong that the base qualties are all around 1. But the raw data have a pretty good quality.
    Therefore I am now doing the alignment with raw data by myself instead of just downloading SAM file from SRA, NCBI.

    Maybe my command for download SAM file is wrong:
    sam-dump -r --bzip2 --output-file /path/I/want/to/put/files <Accession>

    Hopefully, I will get the result this time.

    Thanks agian.

    Zeyu

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @zzy
    Hi,

    Is your input bam file only chr21 also? You can try adding -L chr21 to your command to restrict analysis to chr21 only.

    Also, you can try not using -nct in your command.

    I am not sure if reverting back to FASTQ is going to make a difference. Can you post an IGV screenshot of a region where you think variants should be called? If there are good mapping qualities and base qualities, variants should be called.

    -Sheila

  • zzyzzy BeijingMember

    Hi, @Sheila,
    Thank you for your reply.
    What I mean is that I converted the BAM file which gave me no results back to fastq file, and checked its reads quality, the quality was extremely low.
    I don't know which region should have variants, what I am going to use is the discovery mode. And for the data I described above just gave nothing.
    Then I downloaded the original raw data from NCBI, not the aligned data, and checked the base quality again, it was good.
    I am now applying best practise to this raw data.

    Zeyu

Sign In or Register to comment.