MuTect2 sample names

My MuTect2 VCF records header looks like this:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT TUMOR NORMAL
The samples are named TUMOR and NORMAL rather than by the actual names. It doesn't appear like the real sample names are stored anywhere in the VCF file. Is that correct? Is there a way to add sample names to the VCF file?

Tagged:

Comments

  • trevorconleytrevorconley San DiegoMember

    I believe that an older version of MuTect had the option to specify the normal and tumor sample names with --tumor_sample_name. I do not use MuTect any more so I cannot confirm that this option still exists

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @igor @trevorconley
    Hi,

    MuTect2 does not have the option to specify the actual names in the output VCF. Because you can only input 1 tumor and 1 normal sample, you can specify which samples are in the final VCF in the VCF name.

    -Sheila

  • igorigor New YorkMember

    @Sheila said:
    MuTect2 does not have the option to specify the actual names in the output VCF. Because you can only input 1 tumor and 1 normal sample, you can specify which samples are in the final VCF in the VCF name.

    That's true, but it's not really possible to confirm that the three parameters (input T, input N, and output filename) are actually all properly matched up. By pulling the sample name from the BAM header and encoding it in the output VCF provides an extra confirmation of the sample names. Is there any way to have that option considered in the future?

    Issue · Github
    by Sheila

    Issue Number
    1494
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @igor
    Hi,

    I let the developers know, and they said to put in a feature request. However, please note this will be low priority. You can always submit a patch :smile:

    -Sheila

    P.S. I am about to put in the feature request.

    Issue · Github
    by Sheila

    Issue Number
    2291
    State
    closed
    Last Updated
    Assignee
    Array
    Closed By
    vdauwera
  • igorigor New YorkMember
    edited December 2016

    I don't think I am qualified to submit a patch, but thank you for putting in the feature request.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @igor and anyone else who might be interested: there are several ways we could do this (taking out the ambiguity) so would you have a preference between the options below?

    We need to put some more thought into what form this should take. How exactly do we want to encode this information?

    Right now we have (a)

    #CHROM POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  TUMOR   NORMAL
    

    So would you want to change that to this? (b)

    #CHROM POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  sample1   sample2
    

    Or would you want to make it look like e.g. this for maximal clarity? (c)

    #CHROM POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  T-sample1   N-sample2
    

    Or just add an informational note line elsewhere in the VCF header, like this? (d)

    ##samples=<TUMOR="sample1",NORMAL="sample2">
    

    Or a combination of d + a, b or c?

  • igorigor New YorkMember

    I am okay with any of the options. Out of all of them, I prefer b. It would nice to have d as well, since that's a little more verbose and probably doesn't interfere with any parsers.

    The one request I have is to always have T and N listed in the same order. I've seen somatic VCFs where the samples are arranged alphabetically, so the order doesn't actually tell you which one is T or N.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @igor We've decided to change the output to the following, which is essentially b+d :

    Add an extra definition line for each sample in the file header (exact form to be finalized):

    ##sample=<ID="NORMAL",Sample="sample1">
    ##sample=<ID="TUMOR",Sample="sample2">
    

    And use the sample names (from SM) as column headers, in alphabetical order, as is standard elsewhere:

    #CHROM POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  sample1   sample2
    

    Note that we can't accommodate your request for a specific/consistent ordering of the tumor/normal sample columns because that would break standardization on many other tools. The right thing to do here is make whatever code you write to ingest the VCF be able to recognize which column is T vs N depending on the sample names, using the header definition lines.

  • igorigor New YorkMember

    Thanks for the update. Obviously I am sad about the sample order, but it's good that the sample info is now included.

    Is this going to be in GATK 4.0 or is it too early to know?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Yes, this will be in GATK4. If there's interest in getting this out sooner we can backport the fix to GATK 3.7 since it's a small change and we're planning to cut a patch release next week.

  • igorigor New YorkMember

    I am okay with waiting until 4. I don't want this to break anyone's code (including mine).

Sign In or Register to comment.