Is HaplotypeCaller suitable for use in a GWAS project?

ianwianw Bangor, UKMember

Hi GATK Team,

I'm performing a Genome Wide Association Study and have just used HaplotypeCaller to call SNPs. Having read previous threads discussing the reference bias inherent in HC (at least relative to UnifiedGenotyper), I included the recommended arguments so my command line was as follows:

java -Xmx16000m -XX:+UseSerialGC -jar /GenomeAnalysisTK.jar -T HaplotypeCaller -L /lustre/scratch113/projects/cichlid/Mzebra_UMD1_assembly/New_Intervals_May16/UMD_1_2.intervals -R /lustre/scratch113/projects/cichlid/Mzebra_UMD1_assembly/UMD1_mzebra_nuclear_and_mtDNA.fa --emitRefConfidence GVCF --variant_index_type LINEAR --minPruning 1 --minDanglingBranchLength 1 --variant_index_parameter 128000 -nct 4 -I {} -o GATKgVCFs/{}_UMD_1_2.g.vcf.gz

I understand that the minPruning and minDanglingBranchLength arguments are expected to mitigate the reference bias? I would like to know how it does so - does it just refuse to make calls at those sites rather than guessing or does it take another approach? As I'm sure you appreciate, in a GWAS study it could be very misleading to have certain sites incorrectly recorded as reference bases simply as a result of lack of sequencing depth. Is HC's appropriateness for a GWAS study comparable to that of UG or would UG be recommended for this particular application?

Kind regards,

Ian

Best Answer

Answers

  • ianwianw Bangor, UKMember

    I should edit this to say that, when I refer to 'reference bias' in HC, I mean in low coverage sequences :smile:

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @ianw
    Hi Ian,

    Thanks for the clarification. You are correct that GATK tools are tested on 30X coverage data, so it may not perform as well at much lower coverage.

    What coverage data do you have?

    As for the -minPruning and -minDanglingBranchLength, they will certainly help in low coverage regions. You can also use --stand_emit_conf and --stand_call_conf to see if there are any sites that have some low quality evidence for a variant, and look in IGV to see if they may be real or not.

    -Sheila

  • ianwianw Bangor, UKMember

    Hi @Sheila and @Geraldine_VdAuwera,

    Thank you both for your quick responses and apologies for the slight delay in my own. The majority of my samples have coverage levels of 5-7x, whilst others have roughly 15x. I currently have 230 genomes, though this will be increased to 370 over the next 12 months.

    Geraldine, your description of the minPruning and minDanglingBranchLength arguments are very helpful, thank you. I'm so glad I checked as I obviously hadn't fully considered their implications in a GWAS analysis. Looking at the very low RGQ scores attributed to a lot of sites generated in my first attempt, I assume that these scores represent coverage depth rather than being Phred scores? If this is the case, am I right in saying that it's down to personal preference at which RGQ score I set a cutoff for accepted reference calls? And once I set a cutoff, I can edit those sites to show that they have too low coverage to make reliable calls?

    Ian

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @ianw
    Hi Ian,

    The RGQ scores are the reference Genotype Quality. Basically, RGQ tells you how much confidence we have in the genotype actually being hom-ref. Have a look at this article under "GQ: Quality of the assigned genotype." for more information.
    You are correct it is up to you to set a cutoff for the RGQ score, as we do not have any recommendations. However, the RGQ alone will not tell you whether there was low coverage at the site or simply not enough informative reads to may a good call. You will need to look at the DP and AD fields for coverage information.

    -Sheila

  • ianwianw Bangor, UKMember

    Hi @Sheila,

    Thank you for clearing that up for me, that's very helpful!

    Many thanks to yourself and @Geraldine_VdAuwera !

Sign In or Register to comment.