Bug Bulletin: The GenomeLocPArser error in SplitNCigarReads has been fixed; if you encounter it, use the latest nightly build.

PL for Genotype?

schwarls37schwarls37 College Station, TXPosts: 2Member

Hi -- I'm trying out GATK for the first time, but seeing some "confusing" results. I've just run what seems to be an essentially standard unifiedGenotyper on two sorted bam files (single samples P1003 and 03090), and gotten what appears to be heavily prior/other parameter driven results. Here is the strangeness I'm seeing at, e.g., Chr1 811659.

For the first sample: A called homozygous non ref (1/1) based on ~40ref ~10alt bases with PL favoring non ref or het call :207,21,0

For the second sample: A called het (0/1) based on ~44ref ~6alt bases with strong PL het evidence: 91,0,154

The behavior doesn't seem consistent, or at the very least, it seems quite sensitive to "small" data fluctuations.
At any rate, I don't find the calls particularly appealing.

I haven't had any luck looking through the documentation to see what exactly the PL field is telling me (model, prior(?), etc.) so that I can work out why the numbers come out the way they do (i.e., whether or not they're right/I like what they're trying to compute).
Can anyone point me in the right direction?
It's quite possible that I'm not using some of the command line parameters correctly, but I think I'm just using standard defaults and I don't see any other parameters that would seem to be crucial to these results.

Anyway, any help to expedite my learning curve here would be much appreciated.

Thanks! Scott

grep "^#" -v P1003_TruSeq_1_ATCACG_R1.gatk.raw.vcf | head -3063 | tail -1 Chr1 811659 . C G 178.80 . AC=2;AF=1.00;AN=2;BaseQRankSum=0.733;DP=50;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=11.84;MQ0=28;MQRankSum=3.665;QD=3.58;ReadPosRankSum=1.273 GT:AD:DP:GQ:PL 1/1:40,10:48:21:207,21,0

grep "^#" -v 03090_F2_TruSeq_8_ACTTGA_R1.gatk.raw.vcf | head -3063 | tail -1 Chr1 811659 . C G 62.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-0.672;DP=50;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=1;MLEAF=0.500;MQ=14.29;MQ0=29;MQRankSum=1.567;QD=1.26;ReadPosRankSum=-1.746 GT:AD:DP:GQ:PL 0/1:44,6:48:91:91,0,154

for f in P1003_TruSeq_1_ATCACG_R1.sorted.bam 03090_F2_TruSeq_8_ACTTGA_R1.sorted.bam do ff=${f%.sorted.bam} echo $ff java -jar /usr/local/GenomeAnalysisTK-2.6-5-gba531bd/GenomeAnalysisTK.jar \ --validation_strictness LENIENT \ -nt 4 \ -R all.fasta \ -T UnifiedGenotyper \ -I $ff.sorted.bam \ -o $ff.gatk.raw.vcf \ -L potential.variants.intervals \ --output_mode EMIT_ALL_SITES & done

Answers

  • pdexheimerpdexheimer Posts: 360Member, GSA Collaborator ✭✭✭

    PL is the posterior probability.

    Neither of these are particularly strong variants - relatively low Quality, low QD, extremely low MQ and high MQ0. I don't think either of these will survive any sort of filter. There's a FAQ article about something like "Why wasn't my variant called?" - that will help you with some of your learning curve. I'd also strongly recommend reading through the best practices - it looks like you're just running UG on a raw BAM file, one sample at a time. You'll definitely want to spend some time on the pre-processing steps, and probably do a joint calling run

  • schwarls37schwarls37 College Station, TXPosts: 2Member

    Thank you very much, pdex! Okay yes, I definitely will go through these resources: they sound like exactly where I need to start.
    I kind of just rushed into things here. And while I did attempt to acclimate myself to the gatk environment, it sounds like I'm just not there yet. Okay, Cheers! That's perfect!

Sign In or Register to comment.