why the GATK4 Mutect2 PoN select HG00190 NA19771 HG0275,

Dear professor,
thanks for your great tool. i want to ask why you select these three samples, I found them come from different countries in the 1000 genome website and for each sample , there are several kind of sequencing method data, which one do you choose? and do you know why the named with prefix NA and HG, thanks a lot


Best Answers

  • Accepted Answer

    but in the 1000 genome website, you can see there are alot of files, which is different from their ftp,can you tell me, which definite file do you use, thanks a lot.

    for example. I see exome alignment directory in ftp site, but is cram format, I tried to use the samtools converting to bam, and then use python package CrossMap converting from hg38 to hg19, you the step cram to bam seems to be still in progress,


  • picard_gatk_mjpicard_gatk_mj Unconfirmed
    Accepted Answer

    but in the 1000 genome website, you can see there are alot of files, which is different from their ftp,can you tell me, which definite file do you use, thanks a lot.

    for example. I see exome alignment directory in ftp site, but is cram format, I tried to use the samtools converting to bam, and then use python package CrossMap converting from hg38 to hg19, you the step cram to bam seems to be still in progress,

  • picard_gatk_mjpicard_gatk_mj Unconfirmed

    I want to make the pon, I try to do as follows, but i think it is too tedious, I want to how your three sample make the vcf.
    step1: download all the samples of CHB(because we are Chinese, I guess CHB is better for our control, do you think so)

    step2: change all the cram file in the directory in exome_alignment to bam by samtools or cramtools(1000 genome recommend cramtools, but the github of cramtools recommend samtools, a liitle funny )

    step3: because the data I download is hg38 , I want touse the python package 'CrossMap' to convert the bam to hg19.

    step4 : do MarkDuplicates,BuildBamIndex,BaseRecalibrator,ApplyBQSR,Mutect2 call, for each bam, get the vcf file,.

    step5: CreateSomaticPanelOfNormals with the vcf from the step4, get the final PoN vcf。

    but in your latest mutect2 version in website https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_mutect_Mutect2.php, there is only two mode (i) Tumor with matched normal and (ii) Tumor-only mode, the the normal bam is a must, and PoN is optional, so how should I do if I do not have the normal bam.
    I am a little lost in the navigator of gatk4 different version, I am not much sure about the difference of version to the updated one.

    In other words, if I want to do somatic snv and indel call, there is no pair.just has the tumor, and want to use the 1000 genome to creat PoN, which version of gath4 should I take, if just the, can this be done.

    to be more clear about my meaning, I pasted the link of Mutect2 in version and here

    no matter in which page, there is no clear description about how to use PoN to filter, just tell me to treat the normal as tumor to create vcf for PoN,
    So I am really worried about myself, I am very eager to hear from you soon, thanks very much,

    thanks a lot.

  • picard_gatk_mjpicard_gatk_mj Unconfirmed

    another question is in version, you supply the Single normal sample for panel of normals (PoN) creation and Single tumor sample, so does it said I can only use one sample for pon in version or many samples for pon better.

    gatk Mutect2 \
    -R ref_fasta.fa \
    -I normal1.bam \
    -tumor normal1_sample_name \
    --germline-resource af-only-gnomad.vcf.gz \
    -L intervals.list \
    -O normal1_for_pon.vcf.gz

    in this command, --germline-resource af-only-gnomad.vcf.gz, how should I prepare this file, thanks a lot.

  • picard_gatk_mjpicard_gatk_mj Unconfirmed

    there is a detailed comparsion between in gatk3 and gatk4, but no detailed compasion between and the afterwards updates,, forgive my poor knowledge to read the source code, thanks a lot

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @picard_gatk_mj Sorry for the delay due to holidays! I am working on an answer and will get back to you soon.

  • picard_gatk_mjpicard_gatk_mj Unconfirmed

    I also has one question, is the sample you used of exome data, and I do not know whether the data is WES, there seems to be not much info about that in http://www.internationalgenome.org/faq/what-capture-technology-does-exome-sequencing-used/
    thanks a lot

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin


    1. You can make a PON without a matching normal for tumor samples. In the tutorial example, a normal bam was used in the tumor-only mode but if you do not have the paired normal you can use other non-related “normal” samples that you have access to. From the documentation, (1) “normal" means derived from healthy tissue that is believed to not have any somatic alterations and (2) their main purpose is to capture recurrent technical artifacts in order to improve the results of the variant calling analysis. Again, since the PoN is used to alleviate systematic noise in the samples, they do not have to be related to your tumor samples but they need to be from the same experimental design (meaning the same techniques/technology/equipment are used) and they need to be normal.

    2. The Broad recommended number of samples for creation of a PoN is 40 samples. The tutorial example that you see in the documentation only shows the steps for 1 sample at a time to make sure that the tutorial steps finish in a timely manner. You will have to repeat the steps for each sample that you would like to use for your PoN. Once your PoN vcf has been generated using the tumor-only mode, you can implement it as a filter, via the -pon argument, in either the somatic OR tumor-only mode in the variant calling steps.

    3. If you do not have matched normals to your tumor samples, you can still use the tumor-only mode to run analysis. In the tumor-only mode, a single sample’s alignment data undergoes analysis for variant calling - the single sample can be normal OR tumor. In your case, since there is no matching normal to your tumor samples, you will not have the ability to filter out common gremlin variation and individual-specific artifacts which mean lead to an order of magnitude more calls. If you want to filter for sequencing errors, you will have to generate and implement a PoN vcf.

    4. In any of the cases listed above, it would be best to use the most recent version of Mutect4 ( since there have been numerous updates since the BETA version ( Regarding the comparison between GATK (version BETA) and later updates, here is a link to the GitHub Release Notes. If you search for “Mutect2”, you will be able to follow the updates from up until the current version

    5. You can find the germline resource file, af-only-gnomad.vcf.gz, here or from the best practices page. This version is simplified from the gnomAD browser to retain population allele frequencies.

    6. The 3 samples in the tutorial that were used are WES data from a breast cancer cell line and its matched normal cell line derived from blood (HCC1143 and HCC1143_BL, respectively). More details about the 3 samples, alignment, and pre-processing are listed in the Footnotes of the tutorial on Mutect2.

    Feel free to follow up with more questions if I missed anything!

  • picard_gatk_mjpicard_gatk_mj Unconfirmed

    I think @SChaluvadi your answer is very detaied and professional.
    I have some other opinons
    Q1: the HG00190(http://www.internationalgenome.org/data-portal/sample/HG00190) and NA19771(http://www.internationalgenome.org/data-portal/sample/NA19771) and HG02759(http://www.internationalgenome.org/data-portal/sample/HG02759) thees seems to be not cellines used inhttps://gatkforums.broadinstitute.org/gatk/discussion/11136/how-to-call-somatic-mutations-using-gatk4-mutect2

    Q2: In this website https://github.com/broadinstitute/gatk/releases, if you search mutect2, it can just ran into the source code to see the details.(in you fourth point of the superlink Release Notes)

    Q3: The superlink in point 5 is denied:gs://gatk-best-practices/ and ftp://[email protected]/bundle/, I can not get into them.

    Q4: The point six seems to be no clear description.

    Thanks a lot.

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin


    1. It looks like the HG00190, NA19771, and HG02759 samples are the normals whose exome data was used in the generation of the PoN (with tumor-only mode) in Step 2 of the tutorial. This step was to show how someone can make a PoN with samples. The cell lines, used to demonstrate the somatic mode, were breast cancer cell line (HCC1143, tumor) and the matched normal cell line derived from blood (HCC1143_BL, normal). The cell lines are independent of the 3 samples used to generate the example PoN. Also to note, the pon used in the somatic mode is chr17_pon.vcf.gz (which lives in the tutorial bucket).
    2. If you go to the Release notes page, cmd-f, and search for "Mutect2", you should be able to scroll through all the instances on the page where Mutect2 is mentioned. You will be able to see then under each version of Mutect2, to, the updates made for that specific version. Here is an attachment, of an example where I searched for Mutect2 and I saw the Release updates for the previous version.
    3. Please try this link to find the vcf file.
    4. I apologize, maybe I am not understanding what information you are looking to find about the 3 samples. Can you clarify more about what you would like for me to answer?
  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    We have had lots of trouble with spam on the forum so we have set up a new system that requires new users to gain points that allows them privileges like posting links on their discussions. Here is a blog that explains some activities that you can perform on the forum to gain more points. Once you are past the new user limits you should be able to post links again.

    Q1: The tutorial data was you can download the CRAM file from the 1000 Genomes website if you are unable to download a file the size of the fasta file. Samtools has the ability to convert your CRAM to a BAM. You can use samtools view with the -b option to get a resulting BAM file.

    Q2: From this bucket you should be able to access the reference fasta for b37/hg19. Here is a link that points to the Resource Bundle page that has what standard files we provide and where to find them. In the same bucket that I pointed to above, you can access the reference fasta files for hg38 which I believe you asked about in one of your questions. The files in this bucket are Broad used files but you are not restricted to them.

    Q3: The --germline-resource allows you to include a VCF file consisting of allele fractions within populations. For example, this might allow you to look at variants that are at low frequencies in the population but with a large VCF that was generated from thousands of human samples, you have more information for analysis. Here is a resource that contains information about the gnomAD project that works to develop the population resources.

    Q4: I am seeing here in documentation for the latest version of Mutect2 that you can use the --germline-resource and -pon options. The two version of the command that you have listed relate to calling Mutect2 in the somatic mode and the tumor-only mode, respectively. In either case you can implement the --germline-resource and -pon option.

    Q5: In reference to your error while using crossmap - it might be because the 'chr1_gl000192_random’ is in your VCF file but is not present in your FASTA file. You should also check that while performing cross map, the format in which your chromosomes are denoted is the same across your input file, the chain file, and also your fasta file.

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    You are using the wrong reference file. Bam might have been mapped to hg19.

    Check with samtools view -h bamfile.bam to check the header.

    It will tell you more.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Thank you for your input @SkyWarrior.

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    1. The error that you are seeing with Picard can arise if you have combined multiple fastq files together, resulting in reads with the same name. It can also arise if you have not used the -M option with bwa-mem to mark secondary hits appropriately. You can remove duplicates in your fastq file if you only have a few that are duplicates. Here are some blogposts that have addressed this similar error that you can use to determine if the error is within the fast file or not : post1, post2, post3.

    1. As the previous user has mentioned, please check the header of your bam to determine which reference it was aligned to. This could be a reason for your error. If so, you may need to realign your data to the proper reference.

    2. Using HaplotypeCaller for your vcfs is not the correct solution that you are looking for in creation of your PON. In your PON, you are not looking to isolate germline variants - you are looking to compare the normal sample to the tumor sample. If something is called in your normal, then it is probably not a somatic mutation. The --germline-resource, however, is the option that allows you to add a population based germline resource to filter for germline mutations. If you do not have matching normals to your tumor samples, you can use other normal samples that are as closely related to your tumor as possible. The file in the link that is af-only-gnomad.hg38.vcf.gz used in the tutorial with --germline-resource. In the same link you can also find the 1000G PON.

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    Here is a link to the bucket that contains the hg19 version of the af-only-gnomad.hg38.vcf.gz. It is called af-only-gnomad.raw.sites.vcf (please note that it is not in a compressed .gz format).

    I believe that based on this post, the 100g PoN has 40 samples.

    I believe there is some confusion around the PoN. The pon is used in somatic variant analysis - which is why we implement it within the Mutect2 tutorial; Mutect2 is used to determine somatic variants. The way it does this is by comparing a tumor sample with a normal sample. A normal sample is derived from healthy tissue with no known somatic alterations. Combining multiple normal samples, to create a pon (which is essentially a vcf file) allows you to compare the resulting vcf from a tumor sample against that of the normal. By doing so you can determine, for example, variants called in the tumor that are also called in the normal are probably not true somatic variants. If you were to use HaplotypeCaller, you would be looking at germline variants which would not help you resolve somatic variants in your tumor samples. You can read this link about Panel of Normals for a more detailed explanation.

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    That reply was written by the architect of the PoN as well as the tutorial/documentation for generation of a PoN as well as Mutect2. She was most likely aware of the dates that accompanied the versions of the PoNs that were posted so she was sharing her personal insight into the version.

    I believe that by non-primary, she is referring to the "ALT/decoy/HLA contigs" that are included in the 40 sample PoN. The ALT (alternate contigs) refers to representation of haplotypes that do not have a single reprentation. The HLA contigs are a subset of the alternate contigs and decoy contigs refer to reads that do not map to the human reference genome.

    Yes you can try to do a liftover of the 40 sample PoN to match your desired reference build.

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    There should not be a difference in the steps you would take in the generation of your PoN between WGS or WES samples. The 40 samples using in the example PoN are WES samples due to their smaller size - which makes tutorials easier to follow. However, all the steps in the Mutect2 short variant discovery tutorial are applicable to WGS data.

    Can you clarify your question about the bed file? Do you mean using the -L option in Mutect2? This indicates an intervals list to specify particular genomic intervals.

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @manba If I understand your question correctly, then the PoN should contain all the regions across all of your bed files. If this is the case, you can use the same PoN.

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    The PoN should contain any/all regions of interest within the combination of your bedfiles if you want to use the same PoN. For example, if you have 2 bed files, the PoN should contain all the regions listed within the 2 bed files. The bed file you are using just restricts the regions of the genome that you are applying the tool.

    In the tutorial data, as I have mentioned previously, not only is exome data used to expedite the analysis, chr17 was used as an example subset of data to also expedite analysis. The chr PoN was used against the chr17 germline resource to make the example run faster.

    In general the way to calculate the --af-of-alleles-not-in-resource value is 1/(2number of resources in germline-resource). For the tutorial data, exome data was used. In the gnomAD resource file there are 200K exomes. Therefore the calculation there was 1/(2200000) = 0.0000025. Depending on the number of samples that are you in your germline resource you can calculate this value. The default is 0.001.
    The --af-of-alleles-not-in-resource parameter is the assumed population frequency of alt alleles that aren't in the backing population resource (like dbSnp or gnomAD). Ultimately it affects whether variants will be filtered out by the germline filter. The higher the value, the more likely that an allele occurs in the population and the more likely that any alt allele found will be filtered out by the germline filter (if that alt allele didn't occur in dbSnp or gnomAD)

    The PoN and germline-resource parameters do different things. The PoN is used to capture TECHNICAL artifacts such as the different ways that a sample might have been generated. This is why samples used in the PoN should be as technically similar to the case samples even if they are not matched. The germline-resource is used to annotate variant alleles with frequencies. Please read the documentation for more detailed explanations and to understand better the application of the options to your specific use case.

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    The --af-of-alleles-not-in-resource parameter is the argument which relates to the filter. It is the value passed to this, not the --germine-resource, that acts as the filter. This filter does not act automatically, you have to set this parameter when you run Mutect2.

    The file that is in the google cloud bucket is compressed so it is possible that it has a smaller size. Can you check that the one on the gnomAD is also compressed? If not can you post the link to the gnomAD website link to the file?

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    Hi @manba, if you take a look at the bottom of this page, you will find instructions on how to move files using FTP. You may need to have a program on your computer such as Cyberduck or Fetch or WinSCP to accomplish this method.

    If there is a specific file that cannot be found in the FTP page, let us know and we will make it available, but they should be the same.

Sign In or Register to comment.