Download the latest Picard release at https://github.com/broadinstitute/picard/releases.
GATK version 4.beta.5 is out. See the GATK4 beta page for download and details.

vcf output question

In addition to the standard mutect output, I'm interested in vcf output, and was happy to find

a previous related question showing how to output vcf. However, I seem to be having some

trouble with what I think is misformed output. Specifically, the genotype field is "0" for

normal and "0/1" for tumor on every line

#CHROM  POS       ID           REF  ALT  QUAL  FILTER  INFO               FORMAT             normal  tumor  
7       55230840  rs7781264    A    G    .     REJECT  DB                 GT:AD:BQ:DP:FA     0:0,1:.:1:1.00                0/1:0,27:29:27:1.00           
7       55233109  rs150899403  G    A    .     PASS    DB;SOMATIC;VT=SNP  GT:AD:BQ:DP:FA:SS  0:134,0:.:132:0.00:0          0/1:380,296:24:688:0.438:2    
7       55233265  .            A    C    .     REJECT  .                  GT:AD:BQ:DP:FA     0:6,0:.:6:0.00                0/1:251,24:12:275:0.087       

The corresponding lines in the mutect output file are

## muTector v1.0.47986
contig  position  context  ref_allele  alt_allele  tumor_name   normal_name   score  dbsnp_site    covered    power     tumor_power  normal_power  total_pairs  improper_pairs  map_Q0_reads  t_lod_fstar  tumor_f   contaminant_fraction  contaminant_lod  t_ref_count  t_alt_count  t_ref_sum  t_alt_sum  t_ref_max_mapq  t_alt_max_mapq  t_ins_count  t_del_count  normal_best_gt  init_n_lod   n_ref_count  n_alt_count  n_ref_sum  n_alt_sum  judgement
7       55230840  ACTxTGC  A           G           tumor        normal        0      DBSNP         UNCOVERED  0         0.612407     0             30           1               0             93.647278    1         0.02                  -0.236654        0            27           0          808        0               70              0            0            GG              -3.882263    0            1            0          30         REJECT 
7       55233109  TGTxCCA  G           A           tumor        normal        0      DBSNP+COSMIC  COVERED    1         1            1             1140         3               7             665.64967    0.43787   0.02                  28.941878        380          296          10901      7263       70              70              0            0            GG              40.298595    134          0            4097       0          KEEP   
7       55233265  CCCxCAG  A           C           tumor        normal        0      NOVEL         UNCOVERED  0         1            0             305          5               0             8.112745     0.087273  0.02                  2.430961         251          24           6289       302        70              70              0            0            AA              1.803681     6            0            154        0          REJECT 

If it matters, this was with openjdk 1.6:

$ /usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/jre/bin/java -version
java version "1.6.0_24"
OpenJDK Runtime Environment (IcedTea6 1.11.5) (rhel-1.50.1.11.5.el6_3-x86_64)
OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)

Any idea what might be causing this, and is there anything you or I can do to fix it?

Thanks,
Kevin

Answers

  • I guess that's a little hard to read. Here are the actual files.

  • kcibulkcibul Cambridge, MAMember, Broadie, Dev

    Hi -- I believe what you're seeing is actually by design. The VCF output only contains the "passing" mutations (KEEP), and to be such a mutation you are variant in the tumor (0/1) and reference in the normal (0/0).

    Does that explain what you're seeing?

  • According to the vcf spec, an allele of "0" refers to a haploid call, and an allele of "0/0" refers to a diploid call. Since the "0" calls were for the normal (which should be diploid), I expected them to be "0/0" (except perhaps on the X or Y chromosome in a male sample).

    It's minor, and may not ever cause any problems, but the GATK folks, in general, have been pretty good at strictly following the specification to minimize any confusion in the future. You might consider modifying the code to output "0/0" for diploid normal calls.

  • UltimaSeqUltimaSeq Member
    edited June 2013

    Regarding genotype format in vcf, I also noticed that some variants in the tumor sample are genotyped as "0/1" even if the ref allele coverage is zero.
    Eg.

    chr10 105815204 . A T . PASS SOMATIC;VT=SNP GT:AD:BQ:DP:FA:SS 0/1:0,16:31:16:1.00:2 0:19,0:.:19:0.00:0

  • This actually does cause a problem in downstream tools (Annovar) which expects genotypes to be diploid. You'll get an error like:

    WARNING: Skipped 1 invalid genotype records in input file
    
  • (1) The genotype field is "0" for normal:
    As you say, genotype file should be "0/0" for normal instead of “0”, and its always 0/0 because you don't expect to find your tumor mutations in the normal.

    (2) The genotype field is "0/1" for tumor on every line:
    For genotyping, ie, whether the mutation is in one copy (0/1,1/0) or both copies of the gene (1/1), one must take into account the purity of tumor sample.
    If the purity is 100%, mutations found with a frequency greater than about 60% will be considered homozygous (1/1), and less than 60% will be considered heterozygous (0/1, 1/0). (http://www.bloodjournal.org/content/bloodjournal/121/9/1604.full.pdf)

    However, most of the times is almost impossible to obtain a 100% pure tumor sample, that means it will be sequenced "with contamination". So if the sample has a purity, for example, of 40%, we would never get mutations at a frequency greater than 60% and therefore considered homozygous (1/1) by variant calling methods. So maybe that's why the genotype field is "0/1" for tumor on every line.

    Another problem is that tumors are usually heterozygous: we get mutations present along the entire tumor, but also mutations present in subpopulations of this tumor! And determining these mutation genotypes will be much more difficult because its frequencies are conditioned with the size of its population within the tumor. For instance, considering a pure sample, a mutation with a frequency of 15% will be considered heterozygous (1/0), but it could be homozygous (1/1) if the mutation belongs to a tumor subpopulation that takes up 20% of the total tumor area.

    So you need to know the tumor sample purity to check genotypes in a vcf file. My problem is I didn't know it, so I tried to find out an approximation by analysing the maximum frequencies of found mutations. Thus, if in your tumor sample the maximum frequency between all found mutations you get is 80%, you could think your purity is more or less around 80%.

Sign In or Register to comment.