Find single mutation with unified genotyper

ualtermannualtermann Posts: 8Member
edited July 2013 in Ask the GATK team

Hi, I try to find a single mutation with unified genotyper but can't find it. In my sample (human gene SOD1 from chr21) I removed the stop codon by 2 deletions. Than i run UnifiedeGenotyper as follows:

java -Xmx4g -jar ../GenomeAnalysisTK-2.5-2-gf57256b/GenomeAnalysisTK.jar -T UnifiedGenotyper -R ../human_g1k_v37.fasta -I test/chr21_SOD1_ENSG00000142168_removedStop_DEL_2_at_33040889.bwa.bam -o test/chr21_SOD1_ENSG00000142168_removedStop_DEL_2_at_33040889.bwa.raw.indels.vcf -D ../dbsnp_137.b37.vcf -nt 2 -glm INDEL -stand_emit_conf 10 -stand_call_conf 10 -deletions -1 -L test.intervals.bed

The resulting .vcf-file contains only the header but no results.

I played with the following parameters, but with no success.

Parameters are:

--indelGapOpenPenalty 5 -minIndelFrac 0 --min_indel_count_for_genotyping 1 -dt NONE --genotyping_mode GENOTYPE_GIVEN_ALLELES --output_mode EMIT_ALL_SITES

What could I do to find my mutation?

gz
gz
chr21_SOD1_ENSG00000142168_removedStop_DEL_2_at_33040889.tar.gz
546K
Post edited by Geraldine_VdAuwera on

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,822Administrator, GATK Developer admin

    When you ran with --genotyping_mode GENOTYPE_GIVEN_ALLELES, what did you pass as alleles file? Also, what exactly is in your intervals bed file?

    Geraldine Van der Auwera, PhD

  • ualtermannualtermann Posts: 8Member

    Hi Geraldine,
    thank you to respond to my question at this time! The .bed-file contains only the position for chr21 (21\t0\t48129895\tdesc) to make the run shorter. The genotyping_mode parameter was my fault. This day I run some tests with other parameter values and now UnifiedeGenotyper finds some results. The full call now is:

    
    java -Xmx4g -jar ../GenomeAnalysisTK-2.5-2-gf57256b/GenomeAnalysisTK.jar -T UnifiedGenotyper -R ../human_g1k_v37.fasta -I test/chr21_SOD1_ENSG00000142168_removedStop_DEL_2_at_33040889.bwa.bam -o test/chr21_SOD1_ENSG00000142168_removedStop_DEL_2_at_33040889.bwa.raw.indels.vcf -D ../dbsnp_137.b37.vcf -nt 2 -glm INDEL -stand_emit_conf 10 -stand_call_conf 10 --filter_mismatching_base_and_quals -baq CALCULATE_AS_NECESSARY -deletions -1 -L test.intervals.bed -minIndelFrac 0 --min_indel_count_for_genotyping 1 --output_mode EMIT_ALL_SITES
    

    Results are:

    
    #CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  chr21_SOD1_ENSG00000142168_removedStop_DEL_2_at_33040889
    21  33040886    .   CAA .   10000037.27 .   DP=87;MQ=52.92;MQ0=0    GT  ./.
    21  33040887    .   AAT .   10000037.27 .   DP=83;MQ=52.56;MQ0=0    GT  ./.
    21  33040889    .   TAA .   10000037.27 .   DP=84;MQ=53.58;MQ0=0    GT  ./.
    

    The expected mutation is:

    
    deletion at 33040889:
    TTGTGGTGTAATTGGGATCGCCCAA-TA-AACATTCCCTTGGATGTAGTCTGAG
    TTGTGGTGTAATTGGGATCGCCCAA-  -AACATTCCCTTGGATGTAGTCTGAG
    

    If I remove the EMIT_ALL_SITES output_mode, no results will be found.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,822Administrator, GATK Developer admin

    Based on the results you get with EMIT_ALL_SITES, it looks like your sample is reference at that site. Did you look at the pileup in a genome viewer such as IGV, to check visually if the deletion is actually present?

    Geraldine Van der Auwera, PhD

  • ualtermannualtermann Posts: 8Member

    Hi Geraldine,
    the mutations are present in the pileup.

    
    samtools mpileup -f ../human_g1k_v37.fasta -r 21:33040884-33040893 test/chr21_SOD1_ENSG00000142168_removedStop_DEL_2_at_33040889.bwa.bam
    [mpileup] 1 samples in 1 input files
     Set max per-file depth to 8000
    21  33040884    C   94  .$,$,$,..,,,,,,,,,,,..,..,t,..,,..,.,.,.,,..,..,..,,,,,.,,,,.,,.,..,..,,,,.,g..G,.,,,,,,,.,,,.,.,   2222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222
    21  33040885    C   91  ,$..,,,,,,,,,,,..,..,,,..,,..,.,.,.,,..,..,..,,,,,.,,,,.,,.,..,..,,,,.,,...,.,,,,,,,.,,,.,.,    2222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222
    21  33040886    C   90  .$.$,$,$,$,,,,,,,,..,..,,,..,,..,.,.,.,,T.,..,..,,,,,.,,,,.-2AA,,.,..,..,,,,.,,...,.,,,,,,,.,,,.,., 222222222222222222222222222222222221222222222222222222222222222222222222222222222222222222
    21  33040887    A   83  ,,,,,-2at,,-2at,-2at.-2AT.-2AT,-2at.-2AT.-2AT,-2at,-2at,-2at.-2AT.-2AT,-2at,-2at.-2AT.-2AT,-2at.-2AT,-2at.-2AT,-2at.-2AT,-2at,-2at.-2AT.-2AT,-2at.-2AT,-2at.-2AT.-2AT,-2at,-2at,-2at,-2at,-2at.-2AT,-2at,-2at,-2at,-2at*,-2at,-2at.-2AT,-2at.-2AT.-2AT,-2at.-2AT.-2AT,-2at,-2at,-2at,-2at.-2AT,-2at,-2at.-2AT.-2AT.-2AT,-2at.-2AT,-2at,-2at,-2at,-2at,-2at,-2at.-2AT,-2at,-2at,-2at.-2AT,-2at.-2AT,-2at 00222222222222222222222222222202222212222222222222222222122222222222222222222222222
    21  33040888    A   1   *   2
    21  33040889    T   1   .   2
    21  33040890    A   1   .   2
    21  33040891    A   85  ,,,..,..,,,..,,..,.,.,.,,..,.,..,,,,.,,,,.,,.,..,..,,,,.,,...,.,,,,,,.,,,.,.,,,,,,.,,   2222222222222122222222222222222022222222222212222202222222222222222222222222222221222
    21  33040892    A   88  ,,,..,..,,,..,,..,.,.,.,,..,.,..,,,,.,,,,.,,.,..,..,,,,.,,...,.,,,,,,.,,,.,.,,,,,,.,,,^],^],    2222222222222222222222222222222222222222222222222222222222222222222222222222222222202222
    21  33040893    C   89  ,,,..,..,,,..g,..,.,.,.,,..,.,..,,,,.,,,,.,,T,..,..,,,,.,,...,.,,,,,,.,,,.,.,,,,,,.,,,,,^], 22222222222222222222222222222222222222222222222222222222222222222222222222222222222222222
    
    

    and the mutaion is found in the bamfile:

    
    $ samtools view chr21_SOD1_ENSG00000142168_removedStop_DEL_2_at_33040889.bwa.bam | grep GGGATCGCCCAAAACATTCCCTTGGAT | wc -l
    34
    

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,822Administrator, GATK Developer admin

    I see. Did you perform indel realignment on your bam file? And have you tried using HaplotypeCaller rather than the UG? It is better for calling indels.

    Geraldine Van der Auwera, PhD

  • ualtermannualtermann Posts: 8Member

    Hi Geraldine,
    thank you for your suggestions! No, indel realignment was not performed before. The HaplotypeCaller was used but even with no results.
    Based on your hints, I performed a new run.

    Run RealignerTargetCreator:

    
    java -jar ../../bin/GenomeAnalysisTK.jar -T RealignerTargetCreator -R ../../resources/human_g1k_v37.fasta -I chr21_SOD1_ENSG00000142168_removedStop_DEL_2_at_33040889.bwa.bam -o realigner.intervals
    



    Run IndelRealigner:

    
    java -Xmx4g -jar ../../bin/GenomeAnalysisTK.jar -T IndelRealigner -R ../../resources/human_g1k_v37.fasta -I chr21_SOD1_ENSG00000142168_removedStop_DEL_2_at_33040889.bwa.bam -targetIntervals realigner.intervals -o chr21_SOD1_ENSG00000142168_removedStop_DEL_2_at_33040889.bwa.real.bam
    

    The contend of the realigner.intervals file:

    
    $ cat realigner.intervals 
    21:33040887-33040889
    



    Run UnifiedGenotyper:

    
     java -Xmx4g -jar ../../bin/GenomeAnalysisTK.jar -T UnifiedGenotyper -R ../../resources/human_g1k_v37.fasta -I chr21_SOD1_ENSG00000142168_removedStop_DEL_2_at_33040889.bwa.real.bam -o chr21_SOD1_ENSG00000142168_removedStop_DEL_2_at_33040889.bwa.real.raw.indels3.vcf -D ../../resources/dbsnp_137.b37.vcf -nt 2 -glm INDEL -stand_emit_conf 10 -stand_call_conf 10 --filter_mismatching_base_and_quals -baq CALCULATE_AS_NECESSARY -deletions -1 -L ../intervals.bed -minIndelFrac 0 --min_indel_count_for_genotyping 1 --output_mode EMIT_ALL_SITES
    

    Here we have the same results as before. With EMIT_ALL_SITES somthing is found, without nothing is found.

    
    $ tail *.indels2.vcf
    #CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  chr21_SOD1_ENSG00000142168_removedStop_DEL_2_at_33040889
    

    
    $ tail *.indels3.vcf
    #CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  chr21_SOD1_ENSG00000142168_removedStop_DEL_2_at_33040889
    21  33040886    .   CAA .   10000037.27 .   DP=91;MQ=52.84;MQ0=0    GT  ./.
    21  33040887    .   AAT .   10000037.27 .   DP=89;MQ=52.96;MQ0=0    GT  ./.
    

    Than i tried the HaplotypeGenotyper on the realigned bam file...

    Run HaplotypeCaller:

    
    java -Xmx4g -jar ../../bin/GenomeAnalysisTK.jar -T HaplotypeCaller -R ../../resources/human_g1k_v37.fasta -I chr21_SOD1_ENSG00000142168_removedStop_DEL_2_at_33040889.bwa.real.bam -o chr21_SOD1_ENSG00000142168_removedStop_DEL_2_at_33040889.bwa.real.raw.indelsnps.vcf -D ../../resources/dbsnp_137.b37.vcf -stand_emit_conf 10 -stand_call_conf 10 -L ../intervals.bed
    

    ... but unfortunately without any success.

    
    $ tail *.indelsnps.vcf
    #CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  chr21_SOD1_ENSG00000142168_removedStop_DEL_2_at_33040889
    


    What could I try next?

  • pdexheimerpdexheimer Posts: 387Member, GSA Collaborator ✭✭✭✭

    I'm pretty sure you need to run UnifiedGenotyper with GGA if you're using the EMIT_ALL_SITES/INDEL combination. It looks like your pileup gives a pretty good indication of the deletion you're expecting, so you should be able to make an alleles file from that.

    Several versions ago, I ran into a similar problem with HC not finding a known deletion in my data. I had much higher coverage than it looks like you do (around 700-1000x), but the advice given to me at that time was to increase the value of -minPruning. HaplotypeCaller's gotten a lot of work since then, so I don't know if that would solve your problem or not, but you might want to try it with a minPruning of 5 or so. There are also a couple of debug parameters (-debug, -bamout, maybe others) that could help you understand why it's missing the variation

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,822Administrator, GATK Developer admin

    Those are all good points from @pdexheimer. Be sure to provide the alleles file (containing the deletion you are looking for) when you run in GENOTYPE_GIVEN_ALLELES mode. Using the -bamout option can also be very helpful to understand how the HC is interpreting your region of interest, as documented here.

    Geraldine Van der Auwera, PhD

  • ualtermannualtermann Posts: 8Member

    Hi all and thank you for the hints!

    Please can you tell me how to create an alleles file? I searched google but without any success
    @pdexheimer what did you mean by 'GGA'? For what stand this abbreviation?

    I run HC in debug mode and with the other recommended parameters.

    • minPruning of 5 did not fix the problem, no results found
    • bamout produces a file containing only the header
    • debug mode gives the following informations:
    
    INFO  21:53:09,914 ProgressMeter -        Location processed.active regions  runtime per.1M.active regions completed total.runtime remaining 
    INFO  21:53:33,089 HaplotypeCaller - Assembling 21:33040846-33040931 with 672 reads:    (with overlap region = 21:33040646-33041131) 
    INFO  21:53:33,130 DeBruijnAssembler - Found only the reference haplotype in the assembly graph. 
    INFO  21:53:33,130 DeBruijnAssembler - AAATATTAGTATATCTCTCTACTAGGATTAATGTTATTTTTCTAATATTATGAGGTTCTTAAACATCTTTTGGGTATTGTTGGGAGGAGGTAGTGATTACTTGACAGCCCAAAGTTATCTTCTTAAAATTTTTTACAGGTCCATGAAAAAGCAGATGACTTGGGCAAAGGTGGAAATGAAGAAAGTACAAAGACAGGAAACGCTGGAAGTCGTTTGGCTTGTGGTGTAATTGGGATCGCCCAATAAACATTCCCTTGGATGTAGTCTGAGGCCCCTTAACTCATCTGTTATCCTGCTAGCTGTAGAAATGTATCCTGATAAACATTAAACACTGTAATCTTAAAAGTGTAATTGTGTGACTTTTTCAGAGTTGCTTTAAAGTACCTGTAGTGAGAAACTGATTTATGATCACTTGGAAGATTTGTATAGTTTTATAAAACTCAGTTAAAATGTCTGTTTCAATGACCTGTATTTTGCCAGACTTAA 
    INFO  21:53:33,130 DeBruijnAssembler - > Cigar = 486M : 486 score 1.7976931348623157E308 
    INFO  21:53:33,131 EventMap - === Best Haplotypes === 
    INFO  21:53:33,132 EventMap - AAATATTAGTATATCTCTCTACTAGGATTAATGTTATTTTTCTAATATTATGAGGTTCTTAAACATCTTTTGGGTATTGTTGGGAGGAGGTAGTGATTACTTGACAGCCCAAAGTTATCTTCTTAAAATTTTTTACAGGTCCATGAAAAAGCAGATGACTTGGGCAAAGGTGGAAATGAAGAAAGTACAAAGACAGGAAACGCTGGAAGTCGTTTGGCTTGTGGTGTAATTGGGATCGCCCAATAAACATTCCCTTGGATGTAGTCTGAGGCCCCTTAACTCATCTGTTATCCTGCTAGCTGTAGAAATGTATCCTGATAAACATTAAACACTGTAATCTTAAAAGTGTAATTGTGTGACTTTTTCAGAGTTGCTTTAAAGTACCTGTAGTGAGAAACTGATTTATGATCACTTGGAAGATTTGTATAGTTTTATAAAACTCAGTTAAAATGTCTGTTTCAATGACCTGTATTTTGCCAGACTTAA 
    INFO  21:53:33,132 EventMap - > Cigar = 486M 
    

    How should I interpred this output? The range is ok, because the expected mutation is within the start and end position, but what did "Found only the reference haplotype in the assembly graph." mean? The range is found but no mutation?

  • pdexheimerpdexheimer Posts: 387Member, GSA Collaborator ✭✭✭✭

    GGA = GENOTYPE_GIVEN_ALLELES.

    It looks like HC is pretty convinced that there's no variation. It may be worth looking more closely at your alignments - do the alignments carrying the deletion have a very low MAPQ, for instance?

  • ualtermannualtermann Posts: 8Member
    edited July 2013

    Hi pdexheimer,
    the MAPQ is about 60

    
    $ ../../bin/samtools view chr21_SOD1_ENSG00000142168_removedStop_DEL_2_at_33040889.bwa.real.bam | grep GATCGCCCAAAACATTCCCTT | awk '{print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6}'
    21_8941_8897_1_0_0_0_1:0:0_1:0:0_371    163 21  33040831    60  57M2D13M
    21_8897_8783_1_0_0_0_0:0:0_2:0:0_9e8    83  21  33040831    60  57M2D13M
    21_8716_8898_0_1_0_0_0:0:0_3:0:0_1953   147 21  33040832    29  56M2D14M
    21_8898_8701_1_0_0_0_1:0:0_0:0:0_114d   83  21  33040832    60  56M2D14M
    21_9096_8899_1_0_0_0_0:0:0_2:0:0_17f1   163 21  33040833    60  55M2D15M
    21_8900_8693_1_0_0_0_4:0:0_1:0:0_1b5    83  21  33040834    29  54M2D16M
    21_8901_9096_0_1_0_0_1:0:0_1:0:0_b9b    99  21  33040835    60  53M2D17M
    ...
    21_8939_9053_0_1_0_0_1:0:0_4:0:0_61e    99  21  33040873    60  15M2D55M
    21_8939_8718_1_0_0_0_1:0:0_0:0:0_10a0   83  21  33040873    60  15M2D55M
    21_9077_8941_1_0_0_0_1:0:0_2:0:0_107d   163 21  33040875    29  13M2D57M
    21_8941_8826_1_0_0_0_0:0:0_1:0:0_2dc    83  21  33040875    60  13M2D57M
    21_8941_8897_1_0_0_0_1:0:0_1:0:0_371    83  21  33040875    60  13M2D57M
    21_8944_8716_1_0_0_0_2:0:0_0:0:0_1a19   83  21  33040878    60  10M2D60M
    21_8762_8945_0_1_0_0_5:0:0_0:0:0_c67    147 21  33040879    29  9M2D61M
    

    Post edited by ualtermann on
  • pdexheimerpdexheimer Posts: 387Member, GSA Collaborator ✭✭✭✭

    I'm running out of ideas. I can't read all the details in the samtools pileup (I just never use it), but it looks like this is homozygous deletion, right? Depths drop from 85-90 to 2? So I wouldn't expect this to be affected by the contamination filter. What did GGA mode give you?

  • KurtKurt Posts: 174Member ✭✭✭

    It's been a long while for me, but I think the last column (field #6) in the pileup represents the base call quality score. If this is Sanger encoding, then a "2" = phred 17 which i believe should be filtered out by the default settings of UG/HC. I can't remember if the default cut-off is =< 17 or just <17.

    `21 33040886 C 90 .$.$,$,$,$,,,,,,,,..,..,,,..,,..,.,.,.,,T.,..,..,,,,,.,,,,.-2AA,,.,..,..,,,,.,,...,.,,,,,,,.,,,.,., 222222222222222222222222222222222221222222222222222222222222222222222222222222222222222222'

  • KurtKurt Posts: 174Member ✭✭✭

    Yeah, the 6th column is base qualities, which are all 15-17 all around the region...so I guess they are all getting excluded.

    http://samtools.sourceforge.net/pileup.shtml

  • ualtermannualtermann Posts: 8Member
    edited July 2013

    @Kurt:
    If i run HC only HCMappingQualityFilter filtered out reads with a MAPQ (5ft column in the bam file) lower than the default of 20.

     
    INFO  12:32:21,808 DeBruijnAssembler - Found only the reference haplotype in the assembly graph. 
    INFO  12:32:21,809 DeBruijnAssembler - AAATATTAGTATATCTCTCTACTAGGATTAATGTTATTTTTCTAATATTATGAGGTTCTTAAACATCTTTTGGGTATTGTTGGGAGGAGGTAGTGATTACTTGACAGCCCAAAGTTATCTTCTTAAAATTTTTTACAGGTCCATGAAAAAGCAGATGACTTGGGCAAAGGTGGAAATGAAGAAAGTACAAAGACAGGAAACGCTGGAAGTCGTTTGGCTTGTGGTGTAATTGGGATCGCCCAATAAACATTCCCTTGGATGTAGTCTGAGGCCCCTTAACTCATCTGTTATCCTGCTAGCTGTAGAAATGTATCCTGATAAACATTAAACACTGTAATCTTAAAAGTGTAATTGTGTGACTTTTTCAGAGTTGCTTTAAAGTACCTGTAGTGAGAAACTGATTTATGATCACTTGGAAGATTTGTATAGTTTTATAAAACTCAGTTAAAATGTCTGTTTCAATGACCTGTATTTTGCCAGACTTAA 
    INFO  12:32:21,809 DeBruijnAssembler - > Cigar = 486M : 486 score 1.7976931348623157E308 
    INFO  12:32:21,810 EventMap - === Best Haplotypes === 
    INFO  12:32:21,811 EventMap - AAATATTAGTATATCTCTCTACTAGGATTAATGTTATTTTTCTAATATTATGAGGTTCTTAAACATCTTTTGGGTATTGTTGGGAGGAGGTAGTGATTACTTGACAGCCCAAAGTTATCTTCTTAAAATTTTTTACAGGTCCATGAAAAAGCAGATGACTTGGGCAAAGGTGGAAATGAAGAAAGTACAAAGACAGGAAACGCTGGAAGTCGTTTGGCTTGTGGTGTAATTGGGATCGCCCAATAAACATTCCCTTGGATGTAGTCTGAGGCCCCTTAACTCATCTGTTATCCTGCTAGCTGTAGAAATGTATCCTGATAAACATTAAACACTGTAATCTTAAAAGTGTAATTGTGTGACTTTTTCAGAGTTGCTTTAAAGTACCTGTAGTGAGAAACTGATTTATGATCACTTGGAAGATTTGTATAGTTTTATAAAACTCAGTTAAAATGTCTGTTTCAATGACCTGTATTTTGCCAGACTTAA 
    INFO  12:32:21,812 EventMap - > Cigar = 486M 
    INFO  12:32:21,812 EventMap - >> Events = EventMap{} 
    INFO  12:32:28,630 ProgressMeter -     21:41321778        3.30e+07   30.0 s        0.0 s     85.9%        34.0 s     4.0 s 
    INFO  12:32:33,410 HaplotypeCaller - Ran local assembly on 1 active regions 
    INFO  12:32:33,415 ProgressMeter -            done        4.81e+07   34.0 s        0.0 s    100.0%        34.0 s     0.0 s 
    INFO  12:32:33,416 ProgressMeter - Total runtime 34.79 secs, 0.58 min, 0.01 hours 
    INFO  12:32:33,416 MicroScheduler - 116 reads were filtered out during traversal out of 13229 total (0.88%) 
    INFO  12:32:33,416 MicroScheduler -   -> 116 reads (0.88% of total) failing HCMappingQualityFilter 
    INFO  12:32:34,816 GATKRunReport - Uploaded run statistics report to AWS S3 
    

    I checked the contend of the bam file and found the excluded read :

    
    $ ../../bin/samtools view chr21_SOD1_ENSG00000142168_removedStop_DEL_2_at_33040889.bwa.bam | awk '{if($3==21 && $5 < 20){ print $0}}' | wc -l
    116
    

    also the used reads:

    
    $ ../../bin/samtools view chr21_SOD1_ENSG00000142168_removedStop_DEL_2_at_33040889.bwa.bam | awk '{if($3==21){print $0}}' | wc -l
    13229
    

    These reads dosn't contain the mutations...

    
    $ ../../bin/samtools view chr21_SOD1_ENSG00000142168_removedStop_DEL_2_at_33040889.bwa.bam | awk '{if($3==21 && $5 < 20){ print $0}}' | grep GATCGCCCAAAACATTCCCTT | wc -l
    0
    

    ... but the rest.

    
    $ ../../bin/samtools view chr21_SOD1_ENSG00000142168_removedStop_DEL_2_at_33040889.bwa.bam | awk '{if($3==21 && $5 >= 20){ print $0}}' | grep GATCGCCCAAAACATTCCCTT | wc -l
    43
    


    @pdexheimer:
    I did not try with GGA because I dont know how to create a alleles file (I mentioned it some posts before).
    Please can you tell me how to prepare such an file?

    Post edited by ualtermann on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,822Administrator, GATK Developer admin

    The alleles file just needs to be a valid VCF with a variant record of the site and allele that you are looking for. You can create it by modifying the VCF you get from calling that interval, just substitute the desired allele in place of the reference call in the ALT field.

    Geraldine Van der Auwera, PhD

  • KurtKurt Posts: 174Member ✭✭✭

    Also, out of curiosity, is this ion torrent data?

  • ualtermannualtermann Posts: 8Member

    @Geraldine:
    I prepared the following knownalleles.vcf

    
    $ cat knownalleles.vcf
    ##fileformat=VCFv4.1
    #CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  chr21_SOD1_ENSG00000142168_removedStop_DEL_2_at_33040889
    21  33040889    .   T   .   100 .   .   GT  1/1
    21  33040890    .   A   .   100 .   .   GT  1/1
    

    because i expected tow deletions on this positions (please correct me if the knownalleles.vcf format is wrong). With this HC did not find the mutations.

    @Kurt:
    The recommended parameter will be ignored in cases of INDEL calling. Nevertheless your hint of the base quality was right!

    To prove this, I changed the base quality of the bam file by hand from 2 to 7

    
    ../../bin/samtools view -h chr21_SOD1_ENSG00000142168_removedStop_DEL_2_at_33040889.bwa.bam | awk -F"\t" -v OFS="\t" '{ gsub(/2/,"7",$11); print $0;}' > chr21_SOD1_ENSG00000142168_removedStop_DEL_2_at_33040889.bwa.repl.sam
    
    ../../bin/samtools view -S -b chr21_SOD1_ENSG00000142168_removedStop_DEL_2_at_33040889.bwa.repl.sam > chr21_SOD1_ENSG00000142168_removedStop_DEL_2_at_33040889.bwa.repl.bam

    
    $ ../../bin/samtools view chr21_SOD1_ENSG00000142168_removedStop_DEL_2_at_33040889.bwa.repl.bam | grep GATCGCCCAAAACATTCCCTT
    21_8941_8897_1_0_0_0_1:0:0_1:0:0_371    163 21  33040831    60  57M2D13M    =   33040875    116 TACAAAGACAGGAAACGCTGGAAGTCGTTTGGCTTCTGGTGTAATTGGGATCGCCCAAAACATTCCCTTG  7777777777777777777777777777777777777777777777777777777777777777777777  X0:i:1  X1:i:0  MD:Z:35G21^AT13RG:Z:chr21_SOD1_ENSG00000142168_removedStop_DEL_2_at_33040889    XG:i:2  AM:i:37 NM:i:3  SM:i:37 XM:i:1  XO:i:1  XT:A:U
    21_8897_8783_1_0_0_0_0:0:0_2:0:0_9e8    83  21  33040831    60  57M2D13M    =   33040717    -186    TACAAAGACAGGAAACGCTGGAAGTCGTTTGGCTTGTGGTGTAATTGGGATCGCCCAAAACATTCCCTTG  7777777777777777777777777777777777777777777777777777777777777777777777  X0:i:1  X1:i:0  MD:Z:57^AT13    RG:Z:chr21_SOD1_ENSG00000142168_removedStop_DEL_2_at_33040889   XG:i:2  AM:i:37 NM:i:2  SM:i:37 XM:i:0  XO:i:1  XT:A:U
    21_8716_8898_0_1_0_0_0:0:0_3:0:0_1953   147 21  33040832    29  56M2D14M    =   33040650    -254    ACAAAGACAGGCAACGCTGGAAGTCGTATGGCTTTTGGTGTAATTGGGATCGCCCAAAACATTCCCTTGG  7777777777777777777777777777777777777777777777777777777777777777777777  MD:Z:11A15T6G21^AT14    RG:Z:chr21_SOD1_ENSG00000142168_removedStop_DEL_2_at_33040889   XG:i:2  AM:i:29 NM:i:5  SM:i:29 XM:i:3  XO:i:1  XT:A:M
    21_8898_8701_1_0_0_0_1:0:0_0:0:0_114d   83  21  33040832    60  56M2D14M    =   33040635    -269    ACAAAGACAGGAAACGCTGGAATTCGTTTGGCTTGTGGTGTAATTGGGATCGCCCAAAACATTCCCTTGG  7777777777777777777777777777777777777777777777777777777777777777777777  X0:i:1  X1:i:0  MD:Z:22G33^AT14RG:Z:chr21_SOD1_ENSG00000142168_removedStop_DEL_2_at_33040889    XG:i:2  AM:i:37 NM:i:3  SM:i:37 XM:i:1  XO:i:1  XT:A:U
    21_9096_8899_1_0_0_0_0:0:0_2:0:0_17f1   163 21  33040833    60  55M2D15M    =   33041032    269 CAAAGACAGGACACGCTGGAAGTCGTTTGGCTTGTCGTGTAATTGGGATCGCCCAAAACATTCCCTTGGA  7777777777777777777777777777777777777777777777777777777777777777777777  X0:i:1  X1:i:0  MD:Z:11A23G19^AT15  RG:Z:chr21_SOD1_ENSG00000142168_removedStop_DEL_2_at_33040889   XG:i:2  AM:i:37 NM:i:4  SM:i:37 XM:i:2  XO:i:1  XT:A:U
    ...
    21_8941_8826_1_0_0_0_0:0:0_1:0:0_2dc    83  21  33040875    60  13M2D57M    =   33040760    -187    TTGGGATCGCCCAAAACATTCCCTTGGATGTAGTCTGAGGCCCCTTAACTCATCTGTTATCCTGCTAGCT  7777777777777777777777777777777777777777777777777777777777777777777777  X0:i:1  X1:i:0  MD:Z:13^AT57    RG:Z:chr21_SOD1_ENSG00000142168_removedStop_DEL_2_at_33040889   XG:i:2  AM:i:37 NM:i:2  SM:i:37 XM:i:0  XO:i:1  XT:A:U
    21_8941_8897_1_0_0_0_1:0:0_1:0:0_371    83  21  33040875    60  13M2D57M    =   33040831    -116    TTGGGATCGCCCAAAACATTCCCTTGGATGTAGTCTGAGGCCCCTTAACTCATCTGTTATTCTGCTAGCT  7777777777777777777777777777777777777777777777777777777777777777777777  X0:i:1  X1:i:0  MD:Z:13^AT47C9RG:Z:chr21_SOD1_ENSG00000142168_removedStop_DEL_2_at_33040889 XG:i:2  AM:i:37 NM:i:3  SM:i:37 XM:i:1  XO:i:1  XT:A:U
    21_8944_8716_1_0_0_0_2:0:0_0:0:0_1a19   83  21  33040878    60  10M2D60M    =   33040650    -300    GGATCGCCCAAAACATTCCCTTGGATGTAGTCTGAGGCCGCTTAACTCATCTGTTATCCTGCTAGCTGGA  7777777777777777777777777777777777777777777777777777777777777777777777  X0:i:1  X1:i:0  MD:Z:10^AT29C28T1   RG:Z:chr21_SOD1_ENSG00000142168_removedStop_DEL_2_at_33040889   XG:i:2  AM:i:37 NM:i:4  SM:i:37 XM:i:2  XO:i:1  XT:A:U
    21_8762_8945_0_1_0_0_5:0:0_0:0:0_c67    147 21  33040879    29  9M2D61M =   33040702    -249    GATCGCCCAAAACATTCCCTTGGATGTAGTCTGAGGCCCCTTAACTCATCTGTTATCCTGCTAGCTGTAG  7777777777777777777777777777777777777777777777777777777777777777777777  X0:i:1  X1:i:0  MD:Z:9^AT61 RG:Z:chr21_SOD1_ENSG00000142168_removedStop_DEL_2_at_33040889   XG:i:2  AM:i:29 NM:i:2  SM:i:29 XM:i:0  XO:i:1  XT:A:U
    

    ... ,run UG with the changed bam file...

    
    java -Xmx4g -jar ../../bin/GenomeAnalysisTK.jar -T UnifiedGenotyper -R ../../resources/human_g1k_v37.fasta -I chr21_SOD1_ENSG00000142168_removedStop_DEL_2_at_33040889.bwa.repl.bam -o chr21_SOD1_ENSG00000142168_removedStop_DEL_2_at_33040889.bwa.repl.raw.indels4.vcf -D ../../resources/dbsnp_137.b37.vcf -nt 2 -glm INDEL -stand_emit_conf 10 -stand_call_conf 10 --filter_mismatching_base_and_quals -deletions -1 -L ../../scripts/intervals.bed -minIndelFrac 0 -minIndelCnt 1
    

    ... and the deletions will be found.

    
    $ tail chr21_SOD1_ENSG00000142168_removedStop_DEL_2_at_33040889.bwa.repl.raw.indels4.vcf
    #CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  chr21_SOD1_ENSG00000142168_removedStop_DEL_2_at_33040889
    21  33040886    .   CAA C   2257.73 .   AC=2;AF=1.00;AN=2;BaseQRankSum=0.563;DP=89;FS=3.233;MLEAC=2;MLEAF=1.00;MQ=51.54;MQ0=0;MQRankSum=-0.087;QD=12.68;ReadPosRankSum=1.689    GT:AD:DP:GQ:PL  1/1:1,79:89:99:2295,143,0
    21  33040887    .   AAT A   4177.73 .   AC=2;AF=1.00;AN=2;DP=88;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=51.74;MQ0=0;QD=23.74 GT:AD:DP:GQ:PL  1/1:0,83:88:99:4215,250,0
    21  33040888    .   AT  A   1195.73 .   AC=2;AF=1.00;AN=2;DP=88;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=51.74;MQ0=0;QD=13.59 GT:AD:DP:GQ:PL  1/1:0,83:88:99:1233,233,0
    

    The same for the realigned bam file:

    
    $ tail chr21_SOD1_ENSG00000142168_removedStop_DEL_2_at_33040889.bwa.real.repl.raw.indels4.vcf
    #CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  chr21_SOD1_ENSG00000142168_removedStop_DEL_2_at_33040889
    21  33040886    .   CAA C   2309.73 .   AC=2;AF=1.00;AN=2;BaseQRankSum=0.591;DP=91;FS=3.340;MLEAC=2;MLEAF=1.00;MQ=52.84;MQ0=0;MQRankSum=-0.211;QD=12.69;ReadPosRankSum=1.605    GT:AD:DP:GQ:PL  1/1:1,81:91:99:2347,149,0
    21  33040887    .   AAT A   4220.73 .   AC=2;AF=1.00;AN=2;DP=89;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=52.96;MQ0=0;QD=23.71 GT:AD:DP:GQ:PL  1/1:0,84:89:99:4258,253,0
    



    Even so HC did find the mutations:

    
    java -Xmx4g -jar ../../bin/GenomeAnalysisTK.jar -T HaplotypeCaller -R ../../resources/human_g1k_v37.fasta -I chr21_SOD1_ENSG00000142168_removedStop_DEL_2_at_33040889.bwa.real.repl.bam -o chr21_SOD1_ENSG00000142168_removedStop_DEL_2_at_33040889.bwa.real.repl.raw.indelsnps4.vcf -D ../../resources/dbsnp_137.b37.vcf -stand_emit_conf 10 -stand_call_conf 10 -L ../../scripts/intervals.bed

    
    $ tail *real.repl.raw.indelsnps4.vcf
    #CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  chr21_SOD1_ENSG00000142168_removedStop_DEL_2_at_33040889
    21  33040887    .   AAT A   2722.73 .   AC=2;AF=1.00;AN=2;DP=57;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=59.09;MQ0=0;QD=23.88 GT:AD:GQ:PL 1/1:0,55:99:2760,168,0
    

    Many, many thanks @ALL

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,822Administrator, GATK Developer admin

    Glad you found your mutation. For the record, your known alleles file is not helpful because it just indicates what you expect the reference to be, not the allele you're looking for. In the ALT column you should indicate the mutation allele, as it appears in the successful HC call you posted. It should contain a line with:

     21 33040887 . AAT A [etc.]
    

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.