Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

Using Variant Filtration

delangeldelangel Posts: 71GATK Developer mod
edited November 2012 in Methods and Workflows

VariantFiltration

For a complete, detailed argument reference, refer to the GATK document page here.

The documentation for Using JEXL expressions within the GATK contains very important information about limitations of the filtering that can be done; in particular please note the section on working with complex expressions.

Filtering Individual Genotypes

One can now filter individual samples/genotypes in a VCF based on information from the FORMAT field: Variant Filtration will add the sample-level FT tag to the FORMAT field of filtered samples (this does not affect the record's FILTER tag). This is still a work in progress and isn't quite as flexible and powerful yet as we'd like it to be. For now, one can filter based on most fields as normal (e.g. GQ < 5.0), but the GT (genotype) field is an exception. We have put in convenience methods so that one can now filter out hets (isHet == 1), refs (isHomRef == 1), or homs (isHomVar == 1).

Post edited by yfarjoun on

Comments

  • isp2isp2 Posts: 7Member

    I'm having some problems marking heterozygous genotypes. I used the following command:

        java -Xmx12g -jar ${GATKPATH}/GenomeAnalysisTK.jar \
        -R reference.fa \
        -T VariantFiltration \
        -o markHET.vcf \
        --variant original.vcf \
        --genotypeFilterExpression "isHET == 1" \
        --genotypeFilterName "HET"
    

    But when I grep on "HET" I don't find anything. Looking at the vcf generated (markHET.vcf), for example the entry between '**' should have been marked:

        Chr1    1348    .       G       A       425.43  PASS    AC=3;AF=0.026;AN=116;BaseQRankSum=-0.841;DP=737;Dels=0.00;FS=3.899;HaplotypeScore=0.6335;InbreedingCoeff=0.5494;MLEAC=3;MLEAF=0.026;MQ=49.39;MQ0=0;MQRankSum=0.953;QD=23.64;ReadPosRankSum=-0.942;SB=-9.202e+01    GT:AD:DP:GQ:PL  0/0:11,0:11:33:0,33,414 0/0:8,0:8:15:0,15,183   0/0:9,0:9:27:0,27,326   0/0:18,0:18:51:0,51,640 0/0:7,0:7:21:0,21,264   0/0:13,0:13:39:0,39,479 0/0:14,0:15:42:0,42,527 0/0:18,1:19:23:0,23,522 **0/1:3,1:5:19:19,0,73**    0/0:19,0:19:57:0,57,678 [etc.]
    

    Is the command I'm using wrong?

    Thank you.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,046Administrator, GATK Developer admin

    Hi there,

    Sorry for the response delay, I had to check since this is a tricky topic (I hate JEXL, it's so finicky). Our documentation makes the genotype evaluation syntax sound too simple; it's actually a little more complicated than that (I will put fixing the doc on my to-do list). There are two options; either you access individual sample genotypes using 'vc.getGenotype("sampleName").isHet()', or you can filter over all sample genotypes using something like 'GT == 0/1'. I hope this helps.

    Geraldine Van der Auwera, PhD

  • flescaiflescai Posts: 53Member ✭✭

    Hi there, I have a problem when using VariantFiltration: it seems not to change the FORMAT field for all variants, which is obviously a big issue when subsequently using the VCF file with any program for the analysis..

    I'm applying a filter to the genotype field as follows:

    java -Xmx8g -Xms6g -jar /com/extra/GATK/2.7-2/jar-bin/GenomeAnalysisTK.jar \
    -T VariantFiltration \
    -R /data/refseq/tooldata/GATKbundle/2.5/b37/human_g1k_v37.fasta \
    -o UnifiedGenotyper_filteredGenotype_DP20GQ30.vcf \
    --variant UnifiedGenotyper.recalibrated.vcf \
    --genotypeFilterExpression "DP <20 && GQ <30" \
    --genotypeFilterName "FILTQDP"
    

    but in the output VCF file I have lines like this for some variants:

    1  762273  rs3115849   G   A   87315.72    PASS    AC=47;AF=0.783;AN=60;BaseQRankSum=4.447;DB;DP=4287;Dels=0.00;FS=33.490;HaplotypeScore=4.6255;InbreedingCoeff=
    -0.2766;MLEAC=47;MLEAF=0.783;MQ=44.06;MQ0=214;MQRankSum=-25.916;QD=20.37;ReadPosRankSum=-1.274;VQSLOD=-2.860e+00;culprit=MQRankSum;set=snps GT:AD:DP:GQ:PL  1/1:0,150:150:99:4365
    ,361,0  0/1:133,117:238:99:2646,0,3533  1/1:0,250:250:99:7128,587,0 1/1:1,249:243:99:6812,587,0 1/1:0,58:58:99:1806,150,0   0/1:143,106:237:99:2412,0,3790  0/1:129,121:2
    44:99:2867,0,3411   1/1:1,248:242:99:6890,593,0 1/1:0,247:247:99:6998,575,0 1/1:0,125:125:99:3559,298,0 0/1:96,66:154:99:1465,0,2348    0/1:76,65:135:99:1391,0,20630/1:73,82:148:99:1631,0,1886    1/1:0,148:148:99:4319,364,0 1/1:1,185:185:99:5001,427,0 1/1:0,106:106:99:3221,274,0 0/1:57,42:95:99:1016,0,1572 1/1:1,121:118:99:3145
    ,265,0  0/1:99,74:165:99:1702,0,2642    1/1:0,137:137:99:3993,343,0 0/1:54,36:86:99:909,0,1308  1/1:0,130:130:99:3784,322,0 0/1:39,46:81:99:893,0,1027  1/1:1,49:48:9
    9:1399,120,0    0/1:32,40:70:99:955,0,877   1/1:0,75:74:99:2045,174,0   0/1:55,46:96:99:1076,0,1283 1/1:0,52:52:99:1457,126,0   0/1:19,31:48:99:580,0,504   1/1:0
    ,67:67:99:1962,174,0

    and lines like this for others

    1  866319  rs9988021   G   A   10799.89    PASS    AC=60;AF=1.00;AN=60;DB;DP=355;Dels=0.00;FS=0.000;HaplotypeScore=0.4327;InbreedingCoeff=-0.0165;MLEAC=60;MLEAF
    =1.00;MQ=57.66;MQ0=0;QD=30.42;VQSLOD=7.16;culprit=FS;set=snps   GT:AD:DP:FT:GQ:PL   1/1:0,7:7:FILTQDP:21:245,21,0   1/1:0,18:19:PASS:36:459,36,0    1/1:0,14:14:PASS:30:396,30,01/1:0,11:11:PASS:30:380,30,0    1/1:0,4:4:FILTQDP:12:145,12,0   1/1:0,7:7:FILTQDP:18:224,18,0   1/1:0,13:13:PASS:39:464,39,0    1/1:0,23:23:PASS:57:720,57,0    1/1:0,9:9:FILTQDP:27:
    327,27,0    1/1:0,6:6:FILTQDP:15:182,15,0   1/1:0,4:4:FILTQDP:9:120,9,0 1/1:0,10:10:FILTQDP:21:268,21,0 1/1:0,15:15:PASS:33:414,33,0    1/1:0,2:2:FILTQDP:6:68,6,0  1/1:0
    ,12:12:PASS:30:383,30,0 1/1:0,8:8:FILTQDP:15:198,15,0   1/1:0,9:9:FILTQDP:24:294,24,0   1/1:0,14:14:PASS:39:472,39,0    1/1:0,37:37:PASS:90:1102,90,0   1/1:0,19:19:PASS:48:603,48,01/1:0,28:28:PASS:69:865,69,0    1/1:0,38:38:PASS:96:1192,96,0   1/1:0,4:4:FILTQDP:12:133,12,0   1/1:0,5:5:FILTQDP:12:152,12,0   1/1:0,4:4:FILTQDP:6:77,6,0  1/1:0,5:5:FILTQDP:12:
    130,12,0    1/1:0,12:12:FILTQDP:27:325,27,0 1/1:0,6:6:FILTQDP:15:169,15,0   1/1:0,5:5:FILTQDP:12:151,12,0   1/1:0,5:5:FILTQDP:15:165,15,0

    the first one has a FORMAT field GT:AD:DP:GQ:PL while the second one contains my filter GT:AD:DP:FT:GQ:PL. is there a specific reason why that happens?

    thanks, F

  • ebanksebanks Posts: 679GATK Developer mod

    When all genotypes within a record pass filters, it looks like as an optimization is just leaves out the PASS from each one (since it is implicit).

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • flescaiflescai Posts: 53Member ✭✭

    Hi Eric @ebanks, thanks for your answer, very valuable. I believe this efficiency, however, makes the output VCF less "rigorous" (if I may say that), because the sample field is not represented in a unique way for all the variants. While knowing that now I can manage my grep/filtering, it might create issues with other downstream applications. Although it's not a high priority, would be great if the output could be uniform. thanks, Francesco

Sign In or Register to comment.