why the GATK4 Mutect2 PoN select HG00190 NA19771 HG0275,

Dear professor,
thanks for your great tool. i want to ask why you select these three samples, I found them come from different countries in the 1000 genome website and for each sample , there are several kind of sequencing method data, which one do you choose? and do you know why the named with prefix NA and HG, thanks a lot

Tagged:

Best Answers

  • Accepted Answer

    but in the 1000 genome website, you can see there are alot of files, which is different from their ftp,can you tell me, which definite file do you use, thanks a lot.

    for example. I see exome alignment directory in ftp site, but is cram format, I tried to use the samtools converting to bam, and then use python package CrossMap converting from hg38 to hg19, you the step cram to bam seems to be still in progress,

Answers

  • picard_gatk_mjpicard_gatk_mj Member
    Accepted Answer

    but in the 1000 genome website, you can see there are alot of files, which is different from their ftp,can you tell me, which definite file do you use, thanks a lot.

    for example. I see exome alignment directory in ftp site, but is cram format, I tried to use the samtools converting to bam, and then use python package CrossMap converting from hg38 to hg19, you the step cram to bam seems to be still in progress,

  • I want to make the pon, I try to do as follows, but i think it is too tedious, I want to how your three sample make the vcf.
    step1: download all the samples of CHB(because we are Chinese, I guess CHB is better for our control, do you think so)

    step2: change all the cram file in the directory in exome_alignment to bam by samtools or cramtools(1000 genome recommend cramtools, but the github of cramtools recommend samtools, a liitle funny )

    step3: because the data I download is hg38 , I want touse the python package 'CrossMap' to convert the bam to hg19.

    step4 : do MarkDuplicates,BuildBamIndex,BaseRecalibrator,ApplyBQSR,Mutect2 call, for each bam, get the vcf file,.

    step5: CreateSomaticPanelOfNormals with the vcf from the step4, get the final PoN vcf。

    but in your latest mutect2 version in website https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_mutect_Mutect2.php, there is only two mode (i) Tumor with matched normal and (ii) Tumor-only mode, the the normal bam is a must, and PoN is optional, so how should I do if I do not have the normal bam.
    I am a little lost in the navigator of gatk4 different version, I am not much sure about the difference of version 4.0.0.0 to the updated one.

    In other words, if I want to do somatic snv and indel call, there is no pair.just has the tumor, and want to use the 1000 genome to creat PoN, which version of gath4 should I take, if just the 4.0.0.0, can this be done.

    to be more clear about my meaning, I pasted the link of Mutect2 in version 4.0.0.0 and 4.0.11.0 here

    4.0.0.0
    https://software.broadinstitute.org/gatk/documentation/tooldocs/4.0.0.0/org_broadinstitute_hellbender_tools_walkers_mutect_Mutect2.php

    4.0.11.0
    https://software.broadinstitute.org/gatk/documentation/tooldocs/4.0.11.0/org_broadinstitute_hellbender_tools_walkers_mutect_Mutect2.php

    no matter in which page, there is no clear description about how to use PoN to filter, just tell me to treat the normal as tumor to create vcf for PoN,
    So I am really worried about myself, I am very eager to hear from you soon, thanks very much,

    thanks a lot.

  • another question is in 4.0.0.0 version, you supply the Single normal sample for panel of normals (PoN) creation and Single tumor sample, so does it said I can only use one sample for pon in 4.0.0.0 version or many samples for pon better.

    gatk Mutect2 \
    -R ref_fasta.fa \
    -I normal1.bam \
    -tumor normal1_sample_name \
    --germline-resource af-only-gnomad.vcf.gz \
    -L intervals.list \
    -O normal1_for_pon.vcf.gz

    in this command, --germline-resource af-only-gnomad.vcf.gz, how should I prepare this file, thanks a lot.

  • there is a detailed comparsion between in gatk3 and gatk4, but no detailed compasion between 4.0.0.0 and the afterwards updates,, forgive my poor knowledge to read the source code, thanks a lot

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @picard_gatk_mj Sorry for the delay due to holidays! I am working on an answer and will get back to you soon.

  • manbamanba Member
    edited November 27

    the question is nice, waiting for a detailed answer

  • I also has one question, is the sample you used of exome data, and I do not know whether the data is WES, there seems to be not much info about that in http://www.internationalgenome.org/faq/what-capture-technology-does-exome-sequencing-used/
    thanks a lot

  • manbamanba Member
    edited November 28

    I think 1KG should have a better interpretation in their in there website for how the seq

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @picard_gatk_mj

    1. You can make a PON without a matching normal for tumor samples. In the tutorial example, a normal bam was used in the tumor-only mode but if you do not have the paired normal you can use other non-related “normal” samples that you have access to. From the documentation, (1) “normal" means derived from healthy tissue that is believed to not have any somatic alterations and (2) their main purpose is to capture recurrent technical artifacts in order to improve the results of the variant calling analysis. Again, since the PoN is used to alleviate systematic noise in the samples, they do not have to be related to your tumor samples but they need to be from the same experimental design (meaning the same techniques/technology/equipment are used) and they need to be normal.

    2. The Broad recommended number of samples for creation of a PoN is 40 samples. The tutorial example that you see in the documentation only shows the steps for 1 sample at a time to make sure that the tutorial steps finish in a timely manner. You will have to repeat the steps for each sample that you would like to use for your PoN. Once your PoN vcf has been generated using the tumor-only mode, you can implement it as a filter, via the -pon argument, in either the somatic OR tumor-only mode in the variant calling steps.

    3. If you do not have matched normals to your tumor samples, you can still use the tumor-only mode to run analysis. In the tumor-only mode, a single sample’s alignment data undergoes analysis for variant calling - the single sample can be normal OR tumor. In your case, since there is no matching normal to your tumor samples, you will not have the ability to filter out common gremlin variation and individual-specific artifacts which mean lead to an order of magnitude more calls. If you want to filter for sequencing errors, you will have to generate and implement a PoN vcf.

    4. In any of the cases listed above, it would be best to use the most recent version of Mutect4 (4.0.11.0) since there have been numerous updates since the BETA version (4.0.0.0). Regarding the comparison between GATK 4.0.0.0 (version BETA) and later updates, here is a link to the GitHub Release Notes. If you search for “Mutect2”, you will be able to follow the updates from 4.0.0.0 up until the current version 4.0.11.0.

    5. You can find the germline resource file, af-only-gnomad.vcf.gz, here or from the best practices page. This version is simplified from the gnomAD browser to retain population allele frequencies.

    6. The 3 samples in the tutorial that were used are WES data from a breast cancer cell line and its matched normal cell line derived from blood (HCC1143 and HCC1143_BL, respectively). More details about the 3 samples, alignment, and pre-processing are listed in the Footnotes of the tutorial on Mutect2.

    Feel free to follow up with more questions if I missed anything!

  • I think @SChaluvadi your answer is very detaied and professional.
    I have some other opinons
    Q1: the HG00190(http://www.internationalgenome.org/data-portal/sample/HG00190) and NA19771(http://www.internationalgenome.org/data-portal/sample/NA19771) and HG02759(http://www.internationalgenome.org/data-portal/sample/HG02759) thees seems to be not cellines used inhttps://gatkforums.broadinstitute.org/gatk/discussion/11136/how-to-call-somatic-mutations-using-gatk4-mutect2

    Q2: In this website https://github.com/broadinstitute/gatk/releases, if you search mutect2, it can just ran into the source code to see the details.(in you fourth point of the superlink Release Notes)

    Q3: The superlink in point 5 is denied:gs://gatk-best-practices/ and ftp://[email protected]/bundle/, I can not get into them.

    Q4: The point six seems to be no clear description.

    Thanks a lot.

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @picard_gatk_mj

    1. It looks like the HG00190, NA19771, and HG02759 samples are the normals whose exome data was used in the generation of the PoN (with tumor-only mode) in Step 2 of the tutorial. This step was to show how someone can make a PoN with samples. The cell lines, used to demonstrate the somatic mode, were breast cancer cell line (HCC1143, tumor) and the matched normal cell line derived from blood (HCC1143_BL, normal). The cell lines are independent of the 3 samples used to generate the example PoN. Also to note, the pon used in the somatic mode is chr17_pon.vcf.gz (which lives in the tutorial bucket).
    2. If you go to the Release notes page, cmd-f, and search for "Mutect2", you should be able to scroll through all the instances on the page where Mutect2 is mentioned. You will be able to see then under each version of Mutect2, 4.0.0.0 to 4.0.11.0, the updates made for that specific version. Here is an attachment, of an example where I searched for Mutect2 and I saw the Release updates for the previous 4.0.10.0 version.
    3. Please try this link to find the vcf file.
    4. I apologize, maybe I am not understanding what information you are looking to find about the 3 samples. Can you clarify more about what you would like for me to answer?
  • manbamanba Member
    edited November 30

    thanks SChaluvadi, You are so kind and professional.
    because I do not know why the MAC can not support sign in with google account, so I sign up another account manba. and i do not why this new account are forbidden to post link.
    let me illustrate my meaning clearly.

    Q1: The three sample in 1000 genome website, they have different formats, even the exome data, you can search HG00190 in 1000 website, in their ftp, but they just supply exome data in cram format, so what I want to ask you is that do you get the final bam, for example, if you start from cram, do you use samtools change the format, or you just download the fastq data, and alignment by yourself, but if so, how you download so many samples.

    Q2: about the reference data, can you supply version of hg19, and I am not very clear about the function of--germline-resource(Population vcf of germline sequencing containing allele fractions.), can you explain it more clearly.

    Q3: you can check version 4.0.0.0 and 4.0.11.0, the 4.0.11.0 does not mention the parameter germline-resource and others for pon, but 4.0.0.0 does , if i want to use in 4.0.11.0, is it suitable or needed to add these parameters, I paste the two versions of commands here for you to compare.

    gatk Mutect2 \
    -R ref_fasta.fa \
    -I normal1.bam \
    -tumor normal1_sample_name \
    --germline-resource af-only-gnomad.vcf.gz \
    -L intervals.list \
    -O normal1_for_pon.vcf.gz

    gatk Mutect2 \
    -R reference.fa \
    -I sample.bam \
    -tumor sample_name \
    -O single_sample.vcf.gz

  • manbamanba Member

    another question I want to ask.

    do you have any experience in converting vcf or bam file in different versions of genome.
    I tried crossmap to convert vcf, but it has errors "KeyError: "sequence 'b'chr1_gl000192_random'' not present"

    I also tried to convert bam by crossmap. but some files stop sorting itself.

    do you have some advices for me, thanks a lot

  • manbamanba Member

    aother import question is the parameter -R, about the reference, should I choose GRCh38_full_analysis_set_plus_decoy_hla.fa or the reference you supply

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @manba
    We have had lots of trouble with spam on the forum so we have set up a new system that requires new users to gain points that allows them privileges like posting links on their discussions. Here is a blog that explains some activities that you can perform on the forum to gain more points. Once you are past the new user limits you should be able to post links again.

    Q1: The tutorial data was you can download the CRAM file from the 1000 Genomes website if you are unable to download a file the size of the fasta file. Samtools has the ability to convert your CRAM to a BAM. You can use samtools view with the -b option to get a resulting BAM file.

    Q2: From this bucket you should be able to access the reference fasta for b37/hg19. Here is a link that points to the Resource Bundle page that has what standard files we provide and where to find them. In the same bucket that I pointed to above, you can access the reference fasta files for hg38 which I believe you asked about in one of your questions. The files in this bucket are Broad used files but you are not restricted to them.

    Q3: The --germline-resource allows you to include a VCF file consisting of allele fractions within populations. For example, this might allow you to look at variants that are at low frequencies in the population but with a large VCF that was generated from thousands of human samples, you have more information for analysis. Here is a resource that contains information about the gnomAD project that works to develop the population resources.

    Q4: I am seeing here in documentation for the latest version of Mutect2 4.0.11.0 that you can use the --germline-resource and -pon options. The two version of the command that you have listed relate to calling Mutect2 in the somatic mode and the tumor-only mode, respectively. In either case you can implement the --germline-resource and -pon option.

    Q5: In reference to your error while using crossmap - it might be because the 'chr1_gl000192_random’ is in your VCF file but is not present in your FASTA file. You should also check that while performing cross map, the format in which your chromosomes are denoted is the same across your input file, the chain file, and also your fasta file.

  • manbamanba Member

    thanks for your kind and professional answer.
    because I can not post new quesiton, I have to ask here.
    I ran into java.lang.NullPointerException when I mutect2.
    my command is like such 'gatk Mutect2 -L xx.bed -I xx.sorted.bam -R gatk_hg38_reference/Homo_sapiens_assembly38.fasta -tumor NA18525 -O NA18525.somatic.raw.vcf'

    I checked my sorted bam with command ' java -jar picard.jar ValidateSamFile I=NA18525.sorted.bam MODE=SUMMARY' (picard version 2.18.17)
    the result is

    INFO 2018-12-01 13:59:40 SamFileValidator Validated Read 150,000,000 records. Elapsed time: 00:11:37s. Time for last 10,000,000: 41s. Last read position: chr19_KI270922v1_alt:19,869
    ERROR 2018-12-01 14:00:10 ValidateSamFile Value was put into PairInfoMap more than once. 3072: SRR1518049.880567
    [Sat Dec 01 14:00:10 CST 2018] picard.sam.ValidateSamFile done. Elapsed time: 12.12 minutes.
    Runtime.totalMemory()=1163395072

  • manbamanba Member

    my sorted.bam is nearly 7G

  • manbamanba Member

    about this parameter =--germline-resource, I see the vcf in the link is really very big vcfs for each and whole chromosome. can you tell me which is af-only-gnomad.vcf.gz,
    and finally is this parameter will filter the low alle, so I think it is very useful

    thanks a lot

  • manbamanba Member
    edited December 3

    hi, I also tried another way to check my bam(which converted from cram in 1000 genome)
    I ran the bam file in germline pipeline.
    I ran the the follwing steps:

    A: cram to bam
    B: samtools sort and index
    C: MarkDuplicates
    D: BuildBamIndex
    E: BaseRecalibrator

    But I ran with an error (in the BaseRecalibrator, which step use the refenrence), no matter in which kind of reference hg38(because link and screenshot is not allowed, I will tried to tell you through my description)
    the error is that with differerent length(A USER ERROR has occurred: Input files reference and features have incompatible contigs: Found contigs with the same name but different lengths:
    contig reference = chr1 / 248956422
    contig features = chr1 / 249250621.)

    (I have to replace some characters in the link to escape the check in your website for spam, I hope you know the real link)

    the first kind of hg38 is from https:__console.cloud.google.com_storage_browser_genomics-public-data_resources_broad_hg38_v0_pli=1

    the second kind of hg 38 is from 1000 genome website ftp:ftp.1000genomes.ebi.ac.uk_vol1_ftp_technical_reference_GRCh38_reference_genome

    Q2: to creat pon, firstly, we need to create some normal samples of vcf, your tutorial is to ran that with Mutect2, but can I ran with HaplotypeCaller, because we just need to create the files for filtering, to be more clear, we just need to call out the germline variants, so HaplotypeCaller is to call germline, can we ran HaplotypeCaller instead of Mutect2, after creating many vcf files, we then use CreateSomaticPanelOfNormals to get the final vcf, thanks a lot.

    I am really can not be more appreciated for your great help, your timely and high quality help really light up my puzzled gatk world

  • manbamanba Member

    can anyone help me, thanks a lot

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    You are using the wrong reference file. Bam might have been mapped to hg19.

    Check with samtools view -h bamfile.bam to check the header.

    It will tell you more.

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin

    Thank you for your input @SkyWarrior.

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @manba
    1. The error that you are seeing with Picard can arise if you have combined multiple fastq files together, resulting in reads with the same name. It can also arise if you have not used the -M option with bwa-mem to mark secondary hits appropriately. You can remove duplicates in your fastq file if you only have a few that are duplicates. Here are some blogposts that have addressed this similar error that you can use to determine if the error is within the fast file or not : post1, post2, post3.

    1. As the previous user has mentioned, please check the header of your bam to determine which reference it was aligned to. This could be a reason for your error. If so, you may need to realign your data to the proper reference.

    2. Using HaplotypeCaller for your vcfs is not the correct solution that you are looking for in creation of your PON. In your PON, you are not looking to isolate germline variants - you are looking to compare the normal sample to the tumor sample. If something is called in your normal, then it is probably not a somatic mutation. The --germline-resource, however, is the option that allows you to add a population based germline resource to filter for germline mutations. If you do not have matching normals to your tumor samples, you can use other normal samples that are as closely related to your tumor as possible. The file in the link that is af-only-gnomad.hg38.vcf.gz used in the tutorial with --germline-resource. In the same link you can also find the 1000G PON.

  • manbamanba Member

    Thanks a lot.
    is there hg19 version of that files,especially af-only-gnomad.hg38.vcf.gz
    how many samples does the 1000g_pon.hg38.vcf.gz have, it contains all the human genes?

  • manbamanba Member

    In your PON, you are not looking to isolate germline variants - you are looking to compare the normal sample to the tumor sample.

    but firstly, I need to create PoN which built by many normal sample vcf, and there vcf should contains a lot of germline variants, and haplotypecaller is to call germline variant, I just need these variant to be filtered from my tumor variants, I think haplotypecaller is a better choice for creating germline vcf than mutect2 in this aspect.

    thanks a lot

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @manba
    Here is a link to the bucket that contains the hg19 version of the af-only-gnomad.hg38.vcf.gz. It is called af-only-gnomad.raw.sites.vcf (please note that it is not in a compressed .gz format).

    I believe that based on this post, the 100g PoN has 40 samples.

    I believe there is some confusion around the PoN. The pon is used in somatic variant analysis - which is why we implement it within the Mutect2 tutorial; Mutect2 is used to determine somatic variants. The way it does this is by comparing a tumor sample with a normal sample. A normal sample is derived from healthy tissue with no known somatic alterations. Combining multiple normal samples, to create a pon (which is essentially a vcf file) allows you to compare the resulting vcf from a tumor sample against that of the normal. By doing so you can determine, for example, variants called in the tumor that are also called in the normal are probably not true somatic variants. If you were to use HaplotypeCaller, you would be looking at germline variants which would not help you resolve somatic variants in your tumor samples. You can read this link about Panel of Normals for a more detailed explanation.

  • manbamanba Member

    in https://gatkforums.broadinstitute.org/gatk/discussion/10983/firecloud-somatic-hg38-1000g-pon-resource-availability the author said

    "tells me it is the 39-sample PoN that does not cover the non-primary assembly contigs. I recently generated an updated 40-sample PoN that covers the ALT/decoy/HLA contigs, as one may need coverage for when doing analyses in GRCh38. "

    how did the author realized that she did not cover the non-primary assembly contigs. I know
    "Primary Assembly:
    Relevant for haploid assemblies only. The primary assemblies represents the collection of assembled chromosomes, unlocalized and unplaced sequences that, when combined, should represent a non-redundant haploid genome. This excludes any of the alternate locus groups."

    so non-primary assembly contigs means reductant sequence.

    abhout the 1000g PoN has 40 samples, though the author said that, can you confirm again, and I looked in hg19, there is no , so I need to crossmap the file into hg19, am I right, thanks a lot

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @manba
    That reply was written by the architect of the PoN as well as the tutorial/documentation for generation of a PoN as well as Mutect2. She was most likely aware of the dates that accompanied the versions of the PoNs that were posted so she was sharing her personal insight into the version.

    I believe that by non-primary, she is referring to the "ALT/decoy/HLA contigs" that are included in the 40 sample PoN. The ALT (alternate contigs) refers to representation of haplotypes that do not have a single reprentation. The HLA contigs are a subset of the alternate contigs and decoy contigs refer to reads that do not map to the human reference genome.

    Yes you can try to do a liftover of the 40 sample PoN to match your desired reference build.

  • manbamanba Member

    Is the 40 samples use WES or WGS data?
    and do you thinks there will be a big difference in using WES or WGS data? and what is the difference/
    thanks a lot.

    because you use your pon, where is your bed file, do you think if use pon for different bed, what should be worried about, if have to, what is the bed file relationship, one contains most sites of another, or must contains every site?
    thanks a lot

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @manba
    There should not be a difference in the steps you would take in the generation of your PoN between WGS or WES samples. The 40 samples using in the example PoN are WES samples due to their smaller size - which makes tutorials easier to follow. However, all the steps in the Mutect2 short variant discovery tutorial are applicable to WGS data.

    Can you clarify your question about the bed file? Do you mean using the -L option in Mutect2? This indicates an intervals list to specify particular genomic intervals.

  • manbamanba Member

    I mean if I just one bed file to create for a Pon for all bed file(different targeted experiments), what bad effect will bring? to minize the bad effect, should there has some relationship with the bed files, for examples, one bed contains all the sites or all the chr positions(small than start, bigger than end)

    I want to sat i just has normals for one bed file ,and want to apply the pon for all bed files
    thanks a lot.

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @manba If I understand your question correctly, then the PoN should contain all the regions across all of your bed files. If this is the case, you can use the same PoN.

  • manbamanba Member

    contain all the regions means contains all the genes or just the region in chromosome.

    for example, there are gene A, B, C in one bed, but there is a gene F contains the region of gene A, B, C, is it still ok. you know gene overlaps each other.

    and I apply one pon for different bedd files. but the variant sites in some samples higher after pon filter(just higher than 1 or 2), how to explain this, thanks a lot

  • manbamanba Member

    :D again, about the parameter germline-resource.
    "
    If a variant is absent from a given germline resource, then the value for --af-of-alleles-not-in-resource applies. For example, gnomAD's 16,000 samples (~32,000 homologs per locus) becomes a probability of one in 32,000 or less. Thus, an allele's absence from the germline resource becomes evidence that it is not a germline variant.

    "

    if I use the default value of this parameter
    "
    --germline-resource af-only-gnomad.vcf.gz \
    --af-of-alleles-not-in-resource 0.00003125 \
    "
    0.00003125*16000 = 0.5 means if the variant does not exist in 0.5 sample, half sample?, I mean does not exist, to be more clear, less than 0.5 sample, it will be defined as a somatic variant?

    but how to understand 0.5 sample, sample is not integer? can bee float?

    the second question the file size in your google clound is much smaller than the gemoAD, is the file right?

    thanks a lot.

  • manbamanba Member


    in https://gatkforums.broadinstitute.org/gatk/discussion/11136/how-to-call-somatic-mutations-using-gatk4-mutect2

    why just choose the chr17, and
    "
    -pon resources/chr17_pon.vcf.gz \
    --germline-resource resources/chr17_af-only-gnomad_grch38.vcf.gz \
    --af-of-alleles-not-in-resource 0.0000025 \
    "
    because you choose pon chr17, so you must use germline-resource resources/chr17, or reverse?

    and why you need the pon just for chr17, is the bed file just in the chr17 region?

    and the--af-of-alleles-not-in-resource 0.0000025 , the 0.0000025 changes from 0.00003125, why you decide to change? is there a rule for this value?

    thanks a lot

  • manbamanba Member

    should i use this parameter ? especially when I use one pon for different bed files? thanks a lot

    To enable genotyping of PoN sites, use the --genotype-pon-sites option. If the match is not exact, e.g. there is an allele-mismatch, the tool reassembles the region, emits the calls and annotates matches in the INFO field with IN_PON.

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @manba
    The PoN should contain any/all regions of interest within the combination of your bedfiles if you want to use the same PoN. For example, if you have 2 bed files, the PoN should contain all the regions listed within the 2 bed files. The bed file you are using just restricts the regions of the genome that you are applying the tool.

    In the tutorial data, as I have mentioned previously, not only is exome data used to expedite the analysis, chr17 was used as an example subset of data to also expedite analysis. The chr PoN was used against the chr17 germline resource to make the example run faster.

    In general the way to calculate the --af-of-alleles-not-in-resource value is 1/(2number of resources in germline-resource). For the tutorial data, exome data was used. In the gnomAD resource file there are 200K exomes. Therefore the calculation there was 1/(2200000) = 0.0000025. Depending on the number of samples that are you in your germline resource you can calculate this value. The default is 0.001.
    The --af-of-alleles-not-in-resource parameter is the assumed population frequency of alt alleles that aren't in the backing population resource (like dbSnp or gnomAD). Ultimately it affects whether variants will be filtered out by the germline filter. The higher the value, the more likely that an allele occurs in the population and the more likely that any alt allele found will be filtered out by the germline filter (if that alt allele didn't occur in dbSnp or gnomAD)

    The PoN and germline-resource parameters do different things. The PoN is used to capture TECHNICAL artifacts such as the different ways that a sample might have been generated. This is why samples used in the PoN should be as technically similar to the case samples even if they are not matched. The germline-resource is used to annotate variant alleles with frequencies. Please read the documentation for more detailed explanations and to understand better the application of the options to your specific use case.

Sign In or Register to comment.