Does GenotypeGVCFs calculate ReadPosRankSum and MQRankSum correctly?

tommycarstensentommycarstensen United KingdomMember ✭✭✭
edited September 2014 in Ask the GATK team

On 2000 samples I have run HC3.2, CGVCFs3.2, GGVCFs3.2 and VR3.2.

For the GenotypeGVCFs step I used the current default annotations:

InbreedingCoeff FisherStrand QualByDepth ChromosomeCounts GenotypeSummaries

And these non-default annotations:

MappingQualityRankSumTest ReadPosRankSumTest

When running VariantRecalibrator and plotting each of the dimensions I noticed all of the non-default annotations taking on discrete values; see bottom of this post. Is it no longer recommended to use ReadPosRankSum and MQRankSum for VR? Should I calculate these annotation with VariantAnnotator instead of GenotypeGVCFs? If I have to run VariantAnnotator, should I then run it separately for SNPs and INDELs cf. my previous question about annotations being different, when applied to BOTH and SNPs:
http://gatkforums.broadinstitute.org/discussion/2620

Thank you.

zcat out_GenotypeGVCFs/chrom20.vcf.gz | grep -v ^# | cut -f8 | tr ";" "\n" | grep ReadPosRankSum | sort | uniq -c | awk '$1>20000' | sort -k1n,1
  41649 ReadPosRankSum=0.731
  41760 ReadPosRankSum=0.550
  46305 ReadPosRankSum=0.720
  47060 ReadPosRankSum=0.00
  87348 ReadPosRankSum=0.406
 105254 ReadPosRankSum=0.736
 116426 ReadPosRankSum=0.727
 164855 ReadPosRankSum=0.358

zcat out_GenotypeGVCFs/chrom20.vcf.gz | grep -v ^# | cut -f8 | tr ";" "\n" | grep "MQ=" | sort | uniq -c | awk '$1>5000' | sort -k1n,1
   5802 MQ=57.05
   8382 MQ=29.00
   8525 MQ=56.62
  10069 MQ=51.77
  10574 MQ=53.95
  10682 MQ=47.12
  10818 MQ=56.04
  11553 MQ=55.21
 802603 MQ=60.00

zcat out_GenotypeGVCFs/chrom20.vcf.gz | grep -v ^# | cut -f8 | tr ";" "\n" | grep MQRankSum | sort | uniq -c | awk '$1>20000' | sort -k1n,1
  21511 MQRankSum=-7.360e-01
  27222 MQRankSum=0.322
  33699 MQRankSum=0.550
  34481 MQRankSum=0.731
  37603 MQRankSum=0.720
  60729 MQRankSum=0.00
  76031 MQRankSum=0.406
  85812 MQRankSum=0.736
  98519 MQRankSum=0.727
 186092 MQRankSum=0.358

Best Answers

Answers

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    Here my HaplotypeCaller command, which yields lots of doubtful annotations:

    $java \
     -jar $jar \
     --analysis_type HaplotypeCaller \
     --reference_sequence $ref \
     --input_file $bam \
    --intervals $chrom \
     --variant_index_parameter 128000 \
     --variant_index_type LINEAR \
     --dbsnp $dbsnp \
     --out $out \
     -gt_mode DISCOVERY \
     -stand_call_conf 10 \
     -stand_emit_conf 10 \
      --annotation Coverage -A FisherStrand -A HaplotypeScore -A MappingQualityRankSumTest -A QualByDepth -A RMSMappingQuality -A ReadPosRankSumTest \
     --emitRefConfidence GVCF
    
  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    If I am forced to get correct MQ, MQRankSum and ReadPosRankSum annotations using VariantAnnotator, does it then work correctly for INDELs now cf. this discussion on version 2.2, when it didn't:
    http://gatkforums.broadinstitute.org/discussion/1842/variant-annotator-not-annotating-mappingqualityranksumtest-and-readposranksumtest-for-indels

    I couldn't find anything in the 2.3 release notes saying it had been fixed:
    http://gatkforums.broadinstitute.org/discussion/1981/release-notes-for-gatk-version-2-3

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Tommy,

    First, thanks for jumping in and helping with a lot of user questions while Sheila and I were away for the workshop! That was very helpful, thanks.

    MQRankSum and ReadPosRankSum are definitely still recommended for VQSR, and as far as I know the values given by GGVCFs should be fine. Can you elaborate on why you think the annotation values are not ok? I don't actually look at annotation values all that much myself, so it's not obvious to me why those you posted above would be wrong.

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    I have attached two figures showing what I expect the distribution of MQ, MQRankSum and ReadPosRankSum to look like. The difference compared to the other previously attached plots is quite obvious, no? It's a bit odd that MQ=60.00 for more than 80% of my variants, no? And MQRankSum and ReadPosRankSum also seem to take on discrete values. HaplotypeCaller also yields a lot of MQ=60.00 variants. MQ is the culprit for more than 80% of my variants. It all seems a bit suspicious to me.

    P.S. I knew you guys were away, so I felt like invading your territory and giving back to the community. I'm glad you guys are not upset about me trespassing. Glad to have you back.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Ooh those are beautiful recalibration plots :)

    I think the plots you attached earlier got lost somehow, because I can't see them. In any case I just now understood what you meant by discrete values, and yes that does seem rather odd. Have you looked at the distribution of MAPQs in your data? Any chance there's something funky happening there? I'll ask the devs what we would need to look into this, assuming it's not just due to something simple that I'm not aware of.

    Are you kidding, we love it when someone knowledgeable jumps in to help out other users. That's all less work for us to catch up on (thanks!). As long as they know what they're talking about and we don't have to do damage control, of course ;)

  • pdexheimerpdexheimer Member ✭✭✭✭

    It's a bit odd that MQ=60.00 for more than 80% of my variants, no?

    Hmm, I was all ready with a massive post about how wrong you are. But the thing is, you're not. I see a similar spike in MQ=60.00 with the GATK3 workflow compared to older projects. In my case, I looked at chromosome 20 of ~750 exomes. I see 10k variants with an MQ of 60, with the next most common score (59.56) only having 268 instances. By contrast, an old GATK 2.8 run of UG over 53 exomes only has 258 variants total with an MQ of exactly 60.00.

    I don't think we have enough evidence yet to say it's a problem, though. The absolute scores are not nearly as important as their ability to carry information, so the discrete values doesn't necessarily bother me. MQ has not been in the recommended covariates for VQSR for a while, so if it loses a bit of resolution I don't see it as an issue.

    Because ultimately, I suspect that this amounts to a difference in resolution. GGVCFs doesn't look at the bam files, only at the gvcfs -so it's only as precise as the data in the gvcfs. And so we get a situation where there's less resolution in reference regions (thanks to banding), more analyses are getting bigger (i.e., my example of 750 vs 50), and most variants are rare. So most variants will be in a reference band in most of the samples, and rounding will probably push everything to 60.00

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    Have you looked at the distribution of MAPQs in your data? Any chance there's something funky happening there?

    I looked at the MAPQ distribution of one of the bam files and MAPQ=60 is definitely more prevalent, so it's an upstream problem and not a GATK problem:

    samtools view $bam | cut -f5 | sort | uniq -c | sort -k1n,1 | tail
       517035 36
       576346 15
       706970 9
       873442 23
      1402171 17
      3934604 37
      7194895 29
     10056318 0
    141922136 60
    

    @pdexheimer said:
    I don't think we have enough evidence yet to say it's a problem, though. The absolute scores are not nearly as important as their ability to carry information, so the discrete values doesn't necessarily bother me. MQ has not been in the recommended covariates for VQSR for a while, so if it loses a bit of resolution I don't see it as an issue.

    I just checked "(howto) Recalibrate variant quality scores = run VQSR" in Best Practices > Variant Discovery > Variant Filtering:

    http://gatkforums.broadinstitute.org/discussion/2805/howto-recalibrate-variant-quality-scores-run-vqsr

    I also checked this page:

    http://gatkforums.broadinstitute.org/discussion/1259/what-vqsr-training-sets-arguments-should-i-use-for-my-specific-project

    MQ is not on the first page, but it is present on the second page. Same goes for InbreedingCoeff. The second page has been edited more recently. I'll leave out MQ (and InbreedingCoeff).

    The ReadPosRankSum and MQRankSum values are recommended on both pages and those have discrete values in my case, which they didn't have previously. I wonder what has changed. I'll try to run VR again without MQ and see if that generates acceptable plots of ReadPosRankSum and MQRankSum. Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Sorry about the lack of consistency between docs -- the tutorial (first page) is meant as an example only, while the FAQ (second page) is THE go-to reference for actual Best Practice recommendations (which does include MQ). I'm aware this is a source of confusion and have plans to clarify/ sort this out in the near future.

  • pdexheimerpdexheimer Member ✭✭✭✭

    it's still in the SNP recommendations

    My mistake, I obviously meant to say "MQ has not been in my recommended covariates for VQSR for a while" :)

    I thought I was pretty much in sync with the best practices, though, I wonder when/why I deviated...

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    The VQSR recommendations were updated in April, but we didn't announce the change explicitly, which I realize was poor communication since the change has stayed mostly under people's radar. Nothing has changed since then, but if/when it does we'll make some kind of announcement to ensure that people are properly informed.

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭
    edited September 2014

    Thanks. In that case I have a problem with HaplotypeCaller3.2 and over-representation of MQ 60 scores. This was not previously a problem with UnifiedGenotyper2.1 despite using the MQ annotation and also having an over-representation of MAPQ 60 in my bams.

    P.S. I'll be afk tomorrow...

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    I'll discuss this with the devs and see what they think. Will report back when I know more.

    FYI I just made some edits to the VQSR arguments doc to clarify and highlight some caveats, but none of the recommendations themselves have changed.

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    Much appreciated. Thank you. Please let me know, if I can provide you with additional information.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    OK, I discussed this with the devs and they are not convinced that this is not just due to newer data having better quality and better MQs overall. They'd need to see an example of a single site in which you think the MQ is incorrect, which we can check by looking at the reads. Does that sound reasonable?

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    That sounds very reasonable. I can also run version 3.2 on some old data. I'll convince the devs. Another plan is to generate ROC curves for different sets of annotations. I'll try to get it all done tomorrow. Thanks.

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    @Geraldine_VdAuwera I realize you never saw the original plots with the discrete values. Here they are. I don't know, why they are only visible to me. Perhaps because I edited my post and then attached them.

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭
    edited September 2014

    @Geraldine_VdAuwera said:
    OK, I discussed this with the devs and they are not convinced that this is not just due to newer data having better quality and better MQs overall.

    Attached plot for the old data using the new method; i.e. HC3.2. I think the contrast is quite stark in terms of discrete values. I forgot to include the MQ and InbreedingCoeff annotations. A plot of those two to follow tomorrow. I will have the MQ=60 problem for sure. I will check the previous (UG2.x) MQ values and the current (HC3.2) MQ values. I can then extract the relevant reads and VCF lines to further convince you.

    Post edited by tommycarstensen on
  • tommycarstensentommycarstensen United KingdomMember ✭✭✭
    edited September 2014
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Oh I see -- that is visually very striking. Let me know when you have the MQ values distribution and I'll re-discuss with the devs.

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭
    edited September 2014

    @Geraldine_VdAuwera said:
    Oh I see -- that is visually very striking. Let me know when you have the MQ values distribution and I'll re-discuss with the devs.

    See attachments. It's the same chromosome 20 data. 100 African samples at 4x. Those histograms are more terrifying than "The Scream" by Edvard Munch.

  • drkaosdrkaos MAMember

    Dear Geraldine..
    Amazing Gatk, congratulation to all your team....

    One question, wondering if you have fix the bug or problem with MQ filter problem that tommycarstensen had?

    Thanks
    dR kaOs

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @drkaos
    Hi dR kaOs,

    Yes, the issue has been fixed. Have a look at the latest release notes for more information.

    -Sheila

Sign In or Register to comment.