Ordering of tumor-normal column in output VCF file

Hi,

I have one normal sample and several tumor samples to compare against. I've been running Mutect to compare the normal sample to each tumor sample, and noticed some odd behavior in the output VCF file. In each run, I use essentially the same Mutect command (shown below), only changing the path specifying the tumor bam file. However, in the output VCF file, sometimes the column for the normal sample will appear before the column for the tumor sample, and sometimes the column for the normal sample will appear after the column for the tumor sample. Do you know what determines the order of the tumor-normal columns in the output VCF file? Just wondering if there might have been any mistake in my output, or if it's normal to see different column orderings. Thanks for the help!

Sincerely,
Henry

java -Xmx2g -jar -Djava.io.tmpdir=$TEMPDIR/${region} $MUTECTDIR/muTect-1.1.4.jar \
--analysis_type MuTect \
--enable_extended_output \
--reference_sequence $RESDIR/human_g1k_v37.fasta \
--cosmic $RESDIR/b37_cosmic_v54_120711.vcf \
--dbsnp $RESDIR/dbsnp_132_b37.leftAligned.vcf \
--input_file:normal ${NORMAL_PATH}/${region}.bam \
--input_file:tumor ${TUMOR_PATH}/${region}.bam \
--vcf $OUTDIR/${region}.vcf \
--out $OUTDIR/${region}.out \
--coverage_file $OUTDIR/${region}.wig.txt \

Answers

  • rnaharrnahar SingaporeMember

    Yes - That's true and I have also observed the same. It creates problems when you use the file for annotations as the order of the columns is not consistent in the vcf file. I hope this can be fixed in mutect.

  • Ah, I see. Thanks for confirming the issue rnahar. I hope they can fix the issue in the future also. Happy holidays!

  • Yes, I've known this issue for some time as well. It's minor annoyance when I have a handful of data sets, but becomes a major pain when I have a ton of data sets and I have to check for each file how the NORMAL and TUMOR were ordered.

  • has the ordering been fixed in mutect 1.1.7?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    I don't think so but I'll check with the developers tomorrow.

  • Thank you. That would be very helpful.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hey folk, sorry for the late response. This question fell out of my net.

    The developers were surprised to hear about this and asked for some additional details. Can one of you please post a few records from the outputs that show the different ordering?

  • MadhuMadhu United StatesMember
    edited February 2015

    I faced the same issue in the vcf ouput with Mutect version1.1.4. In the example below I used same command for both samples.
    Example : sample 1
    CHROM POS ID REF ALT QUAL FILTER INFO FORMAT tumor normal
    chr1 1198749 . C A . PASS SOMATIC;VT=SNP GT:AD:BQ:DP:FA:SS 0/1:21,23:29:44:0.523:2 0:41,0:.:41:0.00:0
    chr1 2419144 . T A . PASS SOMATIC;VT=SNP GT:AD:BQ:DP:FA:SS 0/1:9,7:31:16:0.438:2 0:24,0:.:24:0.00:0

    sample 2:
    CHROM POS ID REF ALT QUAL FILTER INFO FORMAT normal tumor
    chr1 1690583 . G A . PASS SOMATIC;VT=SNP GT:AD:BQ:DP:FA:SS 0:9,0:.:9:0.00:0 0/1:20,3:30:23:0.130:2
    chr1 3328659 rs2493292 C T . PASS DB;SOMATIC;VT=SNP GT:AD:BQ:DP:FA:SS 0:36,0:.:36:0.00:0 0/1:37,3:33:40:0.075:2

    Please note that the order of normal and tumor columns in the end is reversed. This can result into misleading downstream analysis.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Thanks for posting this, @Madhu. What you're seeing is not something we consider a problem. Basically, we cannot guarantee the order of sample columns, so your downstream analysis should not assume conservation of column order. Instead, it should use the column names to properly identify columns when parsing the data. That will make your analysis robust to any ordering changes.

  • Is it possible to ask the developers why the ordering of the tumor/normal calls isn't fixed? Did they intentionally add a randomization step at the end to output a random ordering to make life difficult for everyone?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    That's right, the developers have a lot of time on their hands since they have no other high priority / high pressure work to do, and so they spend all that free time thinking about ways to make the tools harder to use.

  • Sorry, I didn't mean to sound so "snarky." I do appreciate the work that has been done at the broad, and your replies also Geraldine! They have been helpful. In terms of the question, I was not meaning to ask about why the developers haven't had time to fix the issue - I was just really confused why a program that takes a precisely specified tumor and normal sample, somehow returns an output file in a randomized format, which isn't fixed (i.e. precisely defined). Can the engineers explain how or why that happens? It just seems a bit strange, and I've been wondering that for over a year now.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hah, no worries @henry2304 -- I rather enjoyed the excuse to be snarky myself ;)

    This is related to the VCF format specification -- the order of the sample columns is not precisely defined so it should never be taken for granted. IIRC the ordering is generally alphabetical by sample name. Parsing the VCF should be done accordingly, by checking column headers systematically for the sample columns.

Sign In or Register to comment.