Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Less unfiltered reads than filtered ones

ssandmannssandmann Münster, GermanyMember

Hey,

we are working with GATK 3.2.2 and focus on an optimization of the SNP and indel calling in the context of our data. Thereby, we also determine the frequency of a certain variant. However, we came across some strange results in this case.
In the vcf files there is of course the DP in the info field, representing the unfiltered depth, and the DP in the format field, representing the filtered depth. Yet, we also calculated the depth by ourselves in the bam file and looked it up in the IGV.
One especially strange case is the following:
For a sample GATK calls a deletion. The unfiltered DP (info field) is DP=644. The filtered DP (format field) is DP=505. It consists of 438 reads with the original genotype and 67 reads containing the variant. However, in our raw bam file and in the IGV things are different. There we see 499 reads in total. 425 containing a C (reference), one with an N and 73 with a deletion. We checked and the results are reproduceable.
Furthermore, we see similar differences as well. There are many cases in which the number of reads containing the variant or the reference allele as we count it and as it is observable in the IGV is smaller than the reported number of filtered reads.
How can this happen? I'm really looking forward to your answers.

Sarah

Tagged:

Best Answer

Answers

  • SheilaSheila Broad InstituteMember, Broadie admin

    @ssandmann
    Hi Sarah,

    I think those extra reads you are seeing are the artificial haplotypes created by Haplotype Caller. Please read more about it here: http://gatkforums.broadinstitute.org/discussion/4146/hc-step-2-local-re-assembly-and-haplotype-determination

    You can tell which reads are sample reads and artificial haplotypes by right clicking on the reads in IGV -> selecting "Color Alignments By" -> "Sample" .
    When you hover over the different colored reads, you can see in the label which are sample reads and which are artificial haplotypes.

    -Sheila

  • ssandmannssandmann Münster, GermanyMember

    Hi Sheila,

    thanks for your quick answer. Yet, I still do not understand properly what is going on with my reads. I colored them in the IGV according to "Sample". However, they are all of the same color (light red).
    Just to understand it correctly, but the Haplotype Caller creats artificial reads? How am I supposed to deal with them when I want to analyze the frequency by which a variant appears? I cannot really trust the frequency I calculate from the vcf file if it contains additional artificial reads, can I?
    In one case, I saw 3 reads with an insert and 1 with a deletion, but according to the vcf file there were 72 reads containing the variant. And in the IGV I just saw my four reads containing an indel.
    For us it is really important to have reliable information on the frequency, but sometimes the differences between the frequency based on the vcf file and the frequency I calculated from the raw bam file is really big. What would you suggest we should do?

    Sarah

  • SheilaSheila Broad InstituteMember, Broadie admin

    @ssandmann

    Hi Sarah,

    HaplotypeCaller does a local reassembly step that may cause some of the reads to change positions. To see the reads after the reassembly step, you need to get the bamout file. Please read more about it here: https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php#--bamOutput

    When you check the bamout file, you will see the artificial haplotypes and sample reads having different colors.

    -Sheila

  • ssandmannssandmann Münster, GermanyMember

    Dear Sheila,

    thank you very much for your answer. I started looking at the bamout files and restarted our calculations. From the first results it seems that in most cases the raw number of reads is now bigger than the one reported in the vcf file. In most cases where this cannot be observed, different ways of counting (especially concering inserted bases) might explain the differences. However, there remain a few cases where I just can find no explanation for the results.
    For example: Once, a SNP is called. I count 124 reads containing the reference allele and 132 reads containing the alternate allele. These results are consistent with the observations in the IGV. Additionally, there are 9 reads containing a deletion. However, according to the format field in the vcf file there should be 178 reads with the original genotype and 45 with the variant. Do you have any idea how this could happen?

    Sarah

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Sarah,

    It's difficult to really understand the situation here. Can you please post a screenshot and the full VCF record? Also, please clarify whether in this example, both variants occur at the same site.

  • ssandmannssandmann Münster, GermanyMember

    Dear Geraldine,

    I chose a slightly different SNP as an example, because I think this is even more problematic:
    Below you find a screenshot from the IGV.
    For the SNP I counted 295 reads containing the reference and 277 reads containing the alternate allele (as you can also see in the screenshot). In addition there are 57 reads with a deletion at the position of the variant, which I ignored as they neither contained the reference nor the alternate allele.
    The corresponding line in the vcf is the following:

    4 106196834 . C T 5666.77 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=12.277;ClippingRankSum=1.464;DP=575;EFF=NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|Cct/Tct|P1744S|2023|TET2|protein_coding|CODING|ENST00000513237|11|1);FS=13.364;MLEAC=1;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=0.462;QD=9.86;ReadPosRankSum=4.240;set=variant2 GT:AD:DP:GQ:PL 0/1:319,252:571:99:5695,0,8193

    So on the one hand, I cannot understand how 319 reads could have been counted containing the genotype 0. On the otherhand, it seems odd that there are a total of 572 reads in the IGV, but 575 according to the info field of the vcf.

    Please let me know if you need any further information.

    Sarah

    SNP.png 40.3K
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Sarah,

    Keep in mind that IGV hides some of the (filtered) reads by default, so that may explain some minor discrepancies.

    Can you tell me how you counted the reads, exactly?

  • ssandmannssandmann Münster, GermanyMember

    Dear Geraldine,

    I tried to disable all filtering options, so that all reads should be displayed in the IGV. Yet, there seems to be one read missing, which contais a deletion (I counted the reads manually to be sure). So, there seem to be 58 instead fo 57 reads containing a deletion at the side of the SNP. The other figures however do not change.

    What I did for the SNP is the following:
    I extract information on the SNP from the vcf file (using R and the package Rsamtools). I use this information to get all reads (from the bam file) that cover the mutated base. In total there are 630 reads. I checked the number of reads directly by samtools and it is the same.
    In the next step, I looked at the cigar string of every read. As I know the starting position of every read, I can simply calculate, whether a read covers the mutated base or whether there is a deletion at the site.
    In 58 out of the 630 cases I find a deletion, which is why I neglect these reads. However, for the remaining 572 reads I check whether the base at the positiion of the mutation is the same as the reference base, the same as the alternate base or a different base. I just count the number of reads containing the reference base (295) and those containing the alternate base (277) and compare the results with those reported in the vcf file.

    Sarah

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Ok, that makes sense. I assume these calculations were made on the original bam, rather than HC's output bam? If so I think what's happening is that HC realigns some of those reads with the artifactual indel so they now support the reference.

  • ssandmannssandmann Münster, GermanyMember

    Hello Geraldine,
    at first I made my calculations with the original bam, but as I read about the output bam (with the realigns), I changed my script so that I always use the output bams for the calculations. (Otherwise the differences in the numbers are even greater).
    Sarah

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hmm, then it is possible that we have an accounting problem. What command lines are you using, and does this reproduce with the latest version (3.3)?

  • ssandmannssandmann Münster, GermanyMember

    Dear Geraldine,
    I restarted our pipeline using GATK 3.3 and the results are changing:

    Old version:
    no SNP (bamout): 295; SNP (bamout): 277; 572 reads+58 deletions
    DP=575
    no SNP (vcf): 319; SNP (vcf): 252; 571 reads
    no SNP (IGV): 295; SNP (IGV): 277; 572 reads+57 deletions

    Latest version:
    no SNP (bamout): 270; SNP (bamout): 296; 566 reads+52 deletions
    DP=564
    no SNP (vcf): 295; SNP (vcf): 268; 563 reads
    no SNP (IGV): 270; SNP (IGV): 296; 566 reads+51 deletions

    So, concerning the total number of reads the results do now appear to be fine. Yet, there remains the great difference between reads without the SNP counted in the bamout-file and those counted in the vcf-file which I do not understand.

    The command lines concerning the variant calling I used are the following:

    java -Xmx32g -Djava.io.tmpdir=/mnt/fc6tb/tmp -jar $GATK \
    -T HaplotypeCaller \
    -R $GENOME \
    -stand_call_conf 30.0 \
    -stand_emit_conf 10.0 \
    --dbsnp $DBSNP \
    -L $TARGETREGION \
    --max_alternate_alleles 9 \
    --downsample_to_coverage $DOWNSAMPLING \
    -nct $THREADS \
    -I ${BAMDIR}/${SAMPLE}.bam \
    -o ${INTERMRES}/${SAMPLE}.rawMutations.vcf \
    -bamout ${BAMDIR}/${SAMPLE}.bamout.bam \
    2>&1 | tee $LOGS/03_HaplotypeCaller.txt

    We use DOWNSAMPLING=1500 and THREADS=1.
    Please let me know if you need any further information.

    Sarah

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    edited March 2015

    Oh, I just realized that it looks like the ADs are basically flipped between the alleles. Does that match what you're seeing? There was an ordering bug in 3.2 that occasionally switched the AD values between alleles. Is it possible that that explains what you're seeing? And do you still see that happen when you re-call variants from the original bam file with 3.3?

  • ssandmannssandmann Münster, GermanyMember

    Dear Geraldine,

    I found a little mistake in my former calculations (I used the bam file resulting from GATK 3.2.2 as an input for the pipeline using GATK 3.3.0 and not the original file resulting from the alignment with TMAP). Yet, I corrected it mistake and the results stay somehow strange:

    GATK 3.2.2:

    no SNP (bam): 360; SNP (bam): 370; 730 reads+68 deletions
    no SNP (IGV): 362; SNP (IGV): 371; 733 reads(incl.2 Inserts)+56 deletions
    no SNP (bamout): 295; SNP (bamout): 277; 572 reads+58 deletions
    no SNP (IGV): 295; SNP (IGV): 277; 572 reads+57 deletions
    DP=575
    no SNP (vcf): 319; SNP (vcf): 252; 571 reads

    GATK 3.3.0:

    no SNP (bam): 360; SNP (bam): 370; 730 reads+68 deletions
    no SNP (IGV): 362; SNP (IGV): 371; 733 reads(incl.2 Inserts)+56 deletions
    no SNP (bamout): 302; SNP (bamout): 286; 566 reads+46 deletions
    no SNP (IGV): 302; SNP (IGV): 286; 588 reads(incl.1 Insert)+45 deletions
    DP=576
    no SNP (vcf): 314; SNP (vcf): 254; 568 reads

    So, as the ADs reproted in the vcf result from the bamout files, it remains the problem that more reads containing no SNP get reported than there are in the IGV/that can be counted in the bamout file.

    Sarah

Sign In or Register to comment.