HaplotypeCaller produces incomplete output

jitendrasbhatijitendrasbhati IndiaPosts: 24Member

I am using GATK 2.7.2. I am working on the Best practices of GATK. I have followed all the steps as mentioned for Best practices. All the .vcf files that are needed as input are downloaded from ftp bundle/2.3/h37.

In Variant Recalibration step: I am using HaplotypeCaller tool to generate the raw_variants.vcf The .vcf is generated but it consist of only headers and no variants for the same. Due to this issue the VarianRecalibration tool is not working as it needs annotation.

I am attaching the raw_variants.vcf that is generated after executing HaplotypeCaller tool.

I have one more question: Whether the HaplotypeCaller adds quality score in the output .vcf file?

zip
zip
raw_variants.zip
4K

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,643Administrator, GATK Developer admin

    Hi there,

    To clarify, your problem is that the HaplotypeCaller VCF output is empty of variants, correct? Can you please show the console output produced by HaplotypeCaller, as well as your original command line?

    Geraldine Van der Auwera, PhD

  • kraigrskraigrs Ann Arbor, Michigan, USAPosts: 8Member

    I am having a similar issue at the recalibration step followed by the HaplotypeCaller walker. The recalibration seems to have executed successfully, followed by successful runs of the PrintReads and ReduceReads walkers, but then in the final HaplotypeCaller walker, outputs only 3 variants, where I know there to be many more than that. I've uploaded the output of running all of these programs, but I can't seem to find any indication of something going wrong. Please help!

    txt
    txt
    GATK_SNPindel_recalibration.o11350595.txt
    70K
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,643Administrator, GATK Developer admin

    Hi @kraigrs,

    Your problem is that most of your reads are getting filtered out because they are unmapped. Have a look at the filtering summary, particularly this line:

    INFO 18:00:22,993 MicroScheduler - -> 333472716 reads (88.58% of total) failing MappingQualityZeroFilter

    Basically HaplotypeCaller sees little or no data to call variants on.

    Geraldine Van der Auwera, PhD

  • kraigrskraigrs Ann Arbor, Michigan, USAPosts: 8Member

    So I ran a version of this on a separate file which had a similar percentage of "unmapped" reads, and comparing the number of raw variants (original run of HaplotypeCaller) to those called after recalibration, the number of variants went from 560,060 --> 521,841, whereas the sample I'm currently troubleshooting went from 710,940 --> 3.

    I've attached the output from the separate file with the similar percentage of "unmapped" reads for comparison. This one seems to have run successfully despite this higher proportion.

    txt
    txt
    GATK_SNPindel_recalibration.o11350594.txt
    75K
  • kraigrskraigrs Ann Arbor, Michigan, USAPosts: 8Member

    I think that those high proportions are a result of aligning gDNA sequence reads to a list of annotated genes from my reference, which is why in my "good" sample, there are still ample numbers of variants to work with.

  • pdexheimerpdexheimer Posts: 372Member, GSA Collaborator ✭✭✭

    Your two samples used different knownSites for the BaseRecalibration process - I'm guessing, by the names, that this workflow is using bootstrapped variants. If your results are that different pre- and post- recalibration, I'd focus on the difference between those two knownSites files. Maybe one is dramatically more specific than the other?

    Also, a MAPQ of 0 isn't strictly unmapped, it's a read that maps equally well to multiple sites in the genome (and so in not confidently mapped). The end result in the GATK is the same, but it's a helpful distinction when you're thinking about where the numbers come from. My (somewhat random) guess is that your genome was recently duplicated

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,643Administrator, GATK Developer admin

    Oh, well that makes more sense, although it's not very good practice to align gDNA to a small subset of genes, because that can introduce a lot of bias into the aligner's decisions. So you really should do a full alignment to the complete reference. Then afterwards you can restrict your analysis to your genes of interest using the -L argument.

    As to why your results are so different between files, @pdexheimer's suggestion of looking more closely at the known sites files is a very good idea. Also, right now you're recalibrating on a very small amount of data, which is not good for the recalibration model building. You should do the recalibration on the full genome for best results.

    Geraldine Van der Auwera, PhD

  • kraigrskraigrs Ann Arbor, Michigan, USAPosts: 8Member

    I still don't think this explains why the two samples, processed identically, give such dramatically different numbers of variants after the recalibration step. You are correct @pdexheimer, I am using a small subset of very high-quality homozygous variant calls as my knownSites, which are different for each of my samples (two different Dsosophila species).

    I will try the -L argument, but I still might find it useful to figure out why the results are so different for two identically processed files. The file that is "behaving" is a single-end sequencing run, while the one that is "misbehaving" is a paired-end sequencing run, but I'm not convinced that that explains the difference in recalibrated variant calls.

    What format should the -L targets.interval_list be in? Would a BED file be appropriate? And is invoking -L the same as -A?

  • pdexheimerpdexheimer Posts: 372Member, GSA Collaborator ✭✭✭

    Single- versus paired-end is a pretty major change, and shifting between species may or may not be a large change. Both of these have the potential to be much larger than you typically see in sample-to-sample variation, I don't think it's particularly fair to expect similar behavior, even if the workflow is the same.

    Related to that, did you perchance filter your high-confidence variants based on MQ? I know BWA produces very different ranges of mapping qualities for single and paired data, I suspect Bowtie2 does as well. I wonder if your paired "high confidence" sites are not quite as stringent as your single-end?

    Also, Geraldine's comment that "right now you're recalibrating on a very small amount of data" could completely explain sample-to-sample variation, even without the other differences. If you don't give it enough data, the model can't stabilize, and your results are erratic

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,643Administrator, GATK Developer admin

    Processing is not identical if you are using different sets of known sites, and apparently the data type is not the same either. Differences accumulate -- and keep in mind that one of your datasets may have suffered from specific machine errors during sequencing that the other wouldn't have been exposed to. It happens; a technician bumps a machine, or a part is misbehaving, and bam, you've got systematic errors throughout your data. That is why base the recalibrator builds a readgroup-specific model of error. And since you're recalibrating on a very fairly small amount of data, any issues with the recalibration process could potentially amplify pre-existing problems. Maybe suddenly the base quals that show variation are all bumped down below 20, which the HC will dismiss out of hand. Things like that can have a huge effect downstream.

    If you want to get to the bottom of this, have a look at the pre- and post-recalibration base qualities, compare the plots, all that can help. But in my opinion the best way to tackle this is to redo alignments from scratch on your whole genomes and recalibrate all the data, not just your genes of interest.

    See the FAQs on inputs for more details on the interval files requirements (BED is ok).

    Geraldine Van der Auwera, PhD

  • kraigrskraigrs Ann Arbor, Michigan, USAPosts: 8Member

    I would definitely call 710,940 variants down to 3 variants "erratic"! That's a good point though about the potential mapping quality differences. So far, I hadn't applied any filters to the variants themselves, I was just using default behavior for the HaplotypeCaller, which I think applies some level of quality filtering.

    Based on some of the recommendations I had seen for recalibration without a set of known sites, I was under the impression that using a few sites would still be effective. There are between 14,000 and 18,000 "known" sites being used for these two samples (and if it makes a difference, 14,000 for the "misbehaving" sample and 18,000 for the "behaving" sample).

  • jitendrasbhatijitendrasbhati IndiaPosts: 24Member

    Hi,

    Yes ,HaplotypeCaller VCF output is empty of variants. I am attaching the Console out here.

    The command is as below:

    -T HaplotypeCaller -R ReferenceFiles\sequence.fasta -I ReferenceFiles\reduced_reads.bam --genotyping_mode DISCOVERY -stand_emit_conf 10 -stand_call_conf 30 -o ReferenceFiles\raw_variants.vcf

    txt
    txt
    HaplotypeCallerConsole output.txt
    5K
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,643Administrator, GATK Developer admin

    OK, the command looks normal. Can you tell me exactly what data you are working with? Is it your own data or some data we provide?

    Geraldine Van der Auwera, PhD

  • jitendrasbhatijitendrasbhati IndiaPosts: 24Member

    We are using .vcf anf .fast file from GATK forum. But according to best practices we need to start with the FastQ file for BWA process. There is no Fastq file in the resource bundle of your ftp. We downloaded the H37 related fastq from net and then followed the steps mentioned in the pre-processing part. Could you please let us know a fastq file along with the reference Fasta file?

    And also we are using bwasw tool of BWA as mem is not available , to generate the .sam file. will this create a problem for firther processing?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,643Administrator, GATK Developer admin

    In the near future we will add a separate tutorial bundle that includes a fastq file. In the meantime, for purposes of testing GATK, you can start with one of the bam files we provide in the bundle, and just start from the indel realignment step.

    Geraldine Van der Auwera, PhD

  • jitendrasbhatijitendrasbhati IndiaPosts: 24Member

    There is one bam file in the bundle/2.3/exampleFASTA folder on the FTP. There was no bam in the bundle/2.3/hg19 folder .By using the .bam file in exampleFASTa folder and also the fasta file the HaplotypeCaller still gave a incomplete raw_variant.vcf file . I mean there were no variants.

    I doubt that the we are not using correct fasta file(refernce file) and its corresposnding bam file. Could you just Provide me a refernce FASTA file along with its bam file and Fastq file ? So that I can test the complete workflow at my end.

    Thanks.

  • ebanksebanks Posts: 684GATK Developer mod

    Hi @jitendrasbhati, We don't provide fastq files. I'm not sure where you got your original fastq file from, but it doesn't look complete (e.g. there are less than 3,000 reads with good mapping quality across all of chromosomes 20 and 21). Are you sure there are reads with variant alleles in the bam produced from this fastq? Have you looked at this in IGV? Ultimately, just because the HC produced no output doesn't necessarily mean that it's not doing the right thing.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,643Administrator, GATK Developer admin

    @jitendrasbhati, you shouldn't use the exampleFASTA, exampleBAM etc for testing the variant calling. They are very small files that are only meant for checking that your installation of GATK works. Instead, you can use the NA12878 or CEUTrio bam files in the b37 directory (along with the b37 reference). That will be a more appropriate test object.

    Geraldine Van der Auwera, PhD

  • jitendrasbhatijitendrasbhati IndiaPosts: 24Member

    I want to know about the .vcf files that are used in the VariantRecalibration tool. Why do we need 4 .vcf files? Don't we have option of using only 3 or only 2 or only 1 resource .vcf files? If all 4 are compulsory, where do we find the omni.vcf files? Is it a database as dbsnp, 1000G?

    Can you share with me a .vcf files which has variants and also has annotations as DP, QD, ... etcs which are needed in the VariantRecalibration tool.

    Thanks.

Sign In or Register to comment.