Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

What is the best way to make an in-house database with allele frequency (like 100 genomes or ExAC)?

aneekaneek Hyderabad, IndiaMember ✭✭

Hi,
I am currently working on making a pilot in-house database of around 35 exomes data we have in our laboratory. I am following the GATK best practices guidelines for data analysis. The following pipeline I am using,

1. Data prepsocessing to make a clean BAM (Picardtools, BWA) from fastq files
2. Mark duplicates in Picard
3. Realignment around indels (RealignerTargetCreator + IndelRealigner)
4. Base quality score recalibration (BaseRecalibrator + PrintReads)
5. Generate GVCF or VCF (HaplotypeCaller).

After this there are two ways to proceed I guess.

Either I can generate seperate VCF files for 35 samples directly using HaplotypeCaller, hardfilter them and merge them using VCFtools (using mergeVCF command) to make a multisample VCF file. Then I can calculate the Minor Allele Frequency using VCFtools for each variant in the multisample VCF file.

The other way to proceed is to generate seperate GVCF files for 35 samples using HaplotypeCaller, do joint analysis using GenotypeGVCFs to make a multisample VCF file and then perform VQSR. Then again I can calculate the Minor Allele Frequency using VCFtools for each variant in the multisample VCF file.

My question is, what is your recommendation about the best way to proceed. Do you recommend any other pipeline? Please also correct me if I am wrong in choosing the pipelines mentioned above.

Thank you.

Best Answers

Answers

  • aneekaneek Hyderabad, IndiaMember ✭✭

    @Geraldine_VdAuwera

    Hi,

    Thank you very much for the suggestion. I would like to proceed with the second option. In this case do I need to perform CombineGVCFs first and then go for GenotypeGVCFs? Because the GVCF file sizes are generally larger than normal VCFs I guess. How to put them into single file like "combined_gvcfs.list" and use it in GenotypeGVCFs.

    Also is it recommended to pass -L with a bed file consist of intervals and -ip option. Also can I use -nt option to make the function faster?

    Thank you.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    For only 35 exomes you shouldn't need to run CombineGVCFs unless you find that GenotypeGVCFs is very slow or you run out of memory. It's not really a problem about file size, it's the number of files that your system needs to read in simultaneously that is the limiting factor.

    You can definitely make a list of the GVCFs to provide as a .lsit file. Depending on how you are managing your files it may be as simple as making a short script to list the gvcf files in your directories.

    You only need to use -L and -ip when you run HaplotypeCaller; it shouldn't be necessary for GenotypeGVCFs.

    Yes, I expect running with -nt should make it faster. You can experiment with different values to see what works well on your system.

  • aneekaneek Hyderabad, IndiaMember ✭✭
    edited April 2016

    @Geraldine_VdAuwera

    Hi,
    Thank you very much for these information. Actually I have used -L and -ip to run HaplotypeCaller to generate GVCF. It has generated a 797 Mb GVCF for a single sample. The size of the base recalibrated BAM file was 15 Gb.

    When I used the HaplotypeCaller (wih -L and -ip) to generate a normal unfiltered VCF from the recalibrated BAM it generated a 20 Mb size VCF.

    I would like to know whether the increased size of the GVCF is normal or I have done some mistake during analysis.

    Another thing is since our super computer has 8 CPUs can I use -nt 7?

    Thank you.

  • aneekaneek Hyderabad, IndiaMember ✭✭
    edited August 2016

    @Geraldine_VdAuwera

    Hi,
    further extending this discussion I am presently using GenotypeGVCFs command to merge 46 whole exome data to create the database. Following the GATK best practices guidelines I have used the following command:

    java -jar GenomeAnalysisTK.jar \
    -T GenotypeGVCFs \
    -R reference.fasta \
    -V sample1.g.vcf \
    -V sample2.g.vcf \
    ............................ \
    ............................. \
    -V sample46.g.vcf \
    -o output.vcf

    I have not used any other GenotypeGVCF specific advanced arguments and I would like to know if that is Ok as far as generating a database concern.

    After executing the command the program is running but with some warning messages which I am not understanding whether it may the purpose by generating an output with errors in joint genotyping which may ultimately can affect the subsequent steps of VQSR and alternate allele frequency calculation by VCFtools. The warning messages are as follows:

    INFO 11:48:44,912 HelpFormatter - Executing as [email protected] on Linux 2.6.18-194.17.4.el5 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_75-b13.
    INFO 11:48:44,913 HelpFormatter - Date/Time: 2016/08/19 11:48:44
    INFO 11:48:44,913 HelpFormatter - --------------------------------------------------------------------------------
    INFO 11:48:44,913 HelpFormatter - --------------------------------------------------------------------------------
    INFO 11:48:53,969 GenomeAnalysisEngine - Strictness is SILENT
    INFO 11:48:54,232 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
    INFO 11:49:08,714 GenomeAnalysisEngine - Preparing for traversal
    INFO 11:49:08,744 GenomeAnalysisEngine - Done preparing for traversal
    INFO 11:49:08,745 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
    INFO 11:49:08,746 ProgressMeter - | processed | time | per 1M | | total | remaining
    INFO 11:49:08,746 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime
    WARN 11:49:09,360 StrandBiasTest - StrandBiasBySample annotation exists in input VCF header. Attempting to use StrandBiasBySample values to calculate strand bias annotation values. If no sample has the SB genotype annotation, annotation may still fail.
    WARN 11:49:09,361 StrandBiasTest - StrandBiasBySample annotation exists in input VCF header. Attempting to use StrandBiasBySample values to calculate strand bias annotation values. If no sample has the SB genotype annotation, annotation may still fail.
    INFO 11:49:09,362 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant files
    WARN 11:49:10,102 HaplotypeScore - Annotation will not be calculated, must be called from UnifiedGenotyper, not org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs
    WARN 11:49:20,285 ExactAFCalculator - this tool is currently set to genotype at most 6 alternate alleles in a given context, but the context at chr1:984147 has 8 alternate alleles so only the top alleles will be used; see the --max_alternate_alleles argument. This warning message is output just once per run and further warnings will be suppressed unless the DEBUG logging level is used.
    INFO 11:49:38,759 ProgressMeter - chr1:1328887 53041.0 30.0 s 9.4 m 0.0% 19.7 h 19.7 h
    INFO 11:50:38,780 ProgressMeter - chr1:6479269 386757.0 90.0 s 3.9 m 0.2% 12.1 h 12.1 h

    After seeing these messages (StrandBiasTest and ExactAFcalculator warnings) I stopped the program here. Please suggest whether I can ignore the warnings and proceed with the analysis or what should I do to correct the warnings.

  • aneekaneek Hyderabad, IndiaMember ✭✭

    Hi, any suggestion please..

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @aneek
    Hi,

    You can proceed and ignore the warnings. Those are just telling you that some annotations are not present in the VCF records. The alternate alleles warning tells you only 6 most likely alternate alleles will be used. If you want more than 6 alternate alleles, you can change the settings using --max_alternate_alleles.

    -Sheila

  • aneekaneek Hyderabad, IndiaMember ✭✭

    @ Sheila

    Hi, Thank you very much for the suggestion. Since the warning message was showing 'chr1:984147 has 8 alternate alleles so only the top alleles will be used', I have selected --max_alternate_alleles value 8, and run the program. I hope this will be ok. Please correct me if I am wrong.

    Apart from this I have another query, in this database I have mixed two different category of samples. Both were sequenced in Illumina platforms however two different kits were used for exome capture. In some samples capturing was done using Agilent SureSelect kit and some using Nextera Rapid Capture kit. Therefore while analysing, for Agilent exomes I have used Agilent SureSelect bed file and for Nextera exomes I have used Nextera Rapid Capture bedfiles (provided by the vendor) in -L options with interval padding 100.

    My query is now if the allele frequency calculation by VCFtools will be wrong since I have mixed these two category of samples in the database?

    Thank you.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @aneek
    Hi,

    I think this thread will help.

    -Sheila

  • aneekaneek Hyderabad, IndiaMember ✭✭

    @ Sheila

    Hi,

    Thanks a lot. The information in the thread is very clear. What I have understood that, in RealignerTargetCreator, BaseRecalibrator and HaplotypeCaller I should per sample specfic bed files (interval) corresponding to how that sample was produced (which I have done). Now in this thread it has been recommended to perform the joint genotyping step using -isr INTERSECTION argument so that calls will be made only on those sites where all samples have data.

    Just one confusion. Should I have to use use both the interval files (like, –L Agilent –L Nextera -ip 100) in joint genotyping analysis (GenotypeGVCFs) along with the -isr INTERSECTION argument to achieve this?

    As far as I know you have not recommended to use -L option in any other steps than RealignerTargetCreator, BaseRecalibrator and HaplotypeCaller.

    http://gatkforums.broadinstitute.org/gatk/discussion/4133/when-should-i-use-l-to-pass-in-a-list-of-intervals

    Thank you.

  • aneekaneek Hyderabad, IndiaMember ✭✭

    Hi,

    Any suggestion please..

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @aneek
    Hi,

    Should I have to use use both the interval files (like, –L Agilent –L Nextera -ip 100) in joint genotyping analysis (GenotypeGVCFs) along with the -isr INTERSECTION argument to achieve this?

    Yes, that is correct.

    -Sheila

  • aneekaneek Hyderabad, IndiaMember ✭✭

    @ Sheila

    Hi, thanks a lot! I have understood clearly..

  • aneekaneek Hyderabad, IndiaMember ✭✭
    edited October 2016

    Hi,

    I have observed an issue in the multisample vcf generated after joint genotyping using GenotypeGVCFs. In some of the multiallelic site it has put a * as one of the alternate allele. Like as follows,

    chr1 26671575 . A G,* 17605.4 VQSRTrancheSNP99.90to100.00 AC=1,38;AF=0.010,0.388;AN=98;BaseQRankSum=2.79;ClippingRankSum=-4.670e-01;DP=2394;ExcessHet=0.3987;FS=0.550;InbreedingCoeff=0.1899;MLEAC=1,38;MLEAF=0.010,0.388;MQ=13.50;MQRankSum=-2.121e+00;QD=13.34;ReadPosRankSum=-1.200e-01;SOR=0.646;VQSLOD=-8.927e+00;culprit=MQ GT:AD:DP:GQ:PGT:PID:PL

    chr1 26671578 . A G,* 17609.73 VQSRTrancheSNP99.90to100.00 AC=1,38;AF=0.010,0.388;AN=98;BaseQRankSum=0.369;ClippingRankSum=-7.530e-01;DP=2356;ExcessHet=0.4700;FS=0.550;InbreedingCoeff=0.1815;MLEAC=1,38;MLEAF=0.010,0.388;MQ=13.34;MQRankSum=-3.350e+00;QD=13.36;ReadPosRankSum=0.930;SOR=0.646;VQSLOD=-8.746e+00;culprit=MQ GT:AD:DP:GQ:PGT:PID:PL

    chr1 1886517 . T TGAG,* 3105.97 PASS AC=2,12;AF=0.023,0.140;AN=86;DP=369;ExcessHet=0.0163;FS=0.000;InbreedingCoeff=0.3921;MLEAC=2,11;MLEAF=0.023,0.128;MQ=15.43;QD=28.24;SOR=0.894;VQSLOD=3.26;culprit=FS GT:AD:DP:GQ:PGT:PID:PL

    chr1 1886519 . C A,CGGCTCACACCGGAAGTGAGGCTCACACCGGAAGTGA,CTCACACCGGAAGTGA,* 5593.09 PASS AC=13,6,2,10;AF=0.232,0.107,0.036,0.179;AN=56;BaseQRankSum=-4.530e-01;ClippingRankSum=-5.330e-01;DP=358;ExcessHet=4.5175;FS=7.981;InbreedingCoeff=0.1537;MLEAC=12,6,2,9;MLEAF=0.214,0.107,0.036,0.161;MQ=40.65;MQRankSum=-5.920e-01;QD=28.83;ReadPosRankSum=-1.600e+00;SOR=1.649;VQSLOD=-4.104e-01;culprit=FS GT:AD:DP:GQ:PGT:PID:PL

    What this * means here? After splitting the multiallelic files into seperate lines, the * is considered as a variant like as follows,

    chr1 26671575 . A G 17605.4 VQSRTrancheSNP99.90to100.00 AC=1;AF=0.01;AN=98;BaseQRankSum=2.79;ClippingRankSum=-0.467;DP=2394;ExcessHet=0.3987;FS=0.55;InbreedingCoeff=0.1899;MLEAC=1;MLEAF=0.01;MQ=13.5;MQRankSum=-2.121;QD=13.34;ReadPosRankSum=-0.12;SOR=0.646;VQSLOD=-8.927;culprit=MQ GT:AD:DP:GQ:PGT:PID:PL
    chr1 26671575 . A * 17605.4 VQSRTrancheSNP99.90to100.00 AC=38;AF=0.388;AN=98;BaseQRankSum=2.79;ClippingRankSum=-0.467;DP=2394;ExcessHet=0.3987;FS=0.55;InbreedingCoeff=0.1899;MLEAC=38;MLEAF=0.388;MQ=13.5;MQRankSum=-2.121;QD=13.34;ReadPosRankSum=-0.12;SOR=0.646;VQSLOD=-8.927;culprit=MQ GT:AD:DP:GQ:PGT:PID:PL

    chr1 1886519 . C A 5593.09 PASS AC=13;AF=0.232;AN=56;BaseQRankSum=-0.453;ClippingRankSum=-0.533;DP=358;ExcessHet=4.5175;FS=7.981;InbreedingCoeff=0.1537;MLEAC=12;MLEAF=0.214;MQ=40.65;MQRankSum=-0.592;QD=28.83;ReadPosRankSum=-1.6;SOR=1.649;VQSLOD=-0.4104;culprit=FS GT:AD:DP:GQ:PGT:PID:PL
    chr1 1886519 . C CGGCTCACACCGGAAGTGAGGCTCACACCGGAAGTGA 5593.09 PASS AC=6;AF=0.107;AN=56;BaseQRankSum=-0.453;ClippingRankSum=-0.533;DP=358;ExcessHet=4.5175;FS=7.981;InbreedingCoeff=0.1537;MLEAC=6;MLEAF=0.107;MQ=40.65;MQRankSum=-0.592;QD=28.83;ReadPosRankSum=-1.6;SOR=1.649;VQSLOD=-0.4104;culprit=FS GT:AD:DP:GQ:PGT:PID:PL
    chr1 1886519 . C CTCACACCGGAAGTGA 5593.09 PASS AC=2;AF=0.036;AN=56;BaseQRankSum=-0.453;ClippingRankSum=-0.533;DP=358;ExcessHet=4.5175;FS=7.981;InbreedingCoeff=0.1537;MLEAC=2;MLEAF=0.036;MQ=40.65;MQRankSum=-0.592;QD=28.83;ReadPosRankSum=-1.6;SOR=1.649;VQSLOD=-0.4104;culprit=FS GT:AD:DP:GQ:PGT:PID:PL
    chr1 1886519 . C * 5593.09 PASS AC=10;AF=0.179;AN=56;BaseQRankSum=-0.453;ClippingRankSum=-0.533;DP=358;ExcessHet=4.5175;FS=7.981;InbreedingCoeff=0.1537;MLEAC=9;MLEAF=0.161;MQ=40.65;MQRankSum=-0.592;QD=28.83;ReadPosRankSum=-1.6;SOR=1.649;VQSLOD=-0.4104;culprit=FS GT:AD:DP:GQ:PGT:PID:PL

    If I put this multi sample VCF in variant annotation (using annovar) for a single sample, these alternate alleles are given as 0 like as follows,

    chr1 26671575 26671575 A 0 exonic AIM1L
    chr1 26671578 26671578 A 0 exonic AIM1L

    chr1 1886517 1886517 T 0 intronic CFAP74
    chr1 1886519 1886519 C 0 intronic CFAP74

    etc.....

    When I checked the BAM file for this sample in IGV the variants are chr1:26671575A>G, chr1:26671578A>G, chr1:1886517Tins variation is not present and chr1:1886519C>A is present.

    Hopefully I am able to put the problem correctly. I have copy pasted the lines here directly from the BQSR performed VCF and post VCF variant annotated file (annovar). I am confused what is happening here, if anyone can explain.

    Thanks & regards,
    Aneek

    Post edited by aneek on
  • aneekaneek Hyderabad, IndiaMember ✭✭

    Hi,
    Any comments or suggestion please..

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @aneek
    Hi Aneek,

    Have a look at this dictionary entry.

    -Sheila

  • aneekaneek Hyderabad, IndiaMember ✭✭

    @ Sheila

    Hi, Thanks a lot. Very important information.

    So the '0' value I am getting in the alternate allele columns after annotation are actually indicating deletions I think.

    chr1 26671575 26671575 A 0 exonic AIM1L
    chr1 26671578 26671578 A 0 exonic AIM1L

    chr1 1886517 1886517 T 0 intronic CFAP74
    chr1 1886519 1886519 C 0 intronic CFAP74

    etc.....

    I have checked some of these '0' variants in BAM file through IGV and found deletion region. Please correct me if I am wrong.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @aneek
    Hi,

    No, our tools do not output a "0" for the alternate allele column. Can you please post the entire VCF record for those sites? Please also post the BAM and bamout files for those positions.

    Thanks,
    Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    It sounds like the problem here is happening at the annotation stage. Since this involves non-GATK programs we can't help you with the interpretation of the output.
Sign In or Register to comment.