Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

New to the forum? Ask your questions here!

Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin
edited November 2018 in Ask the GATK team

We are facing a high volume of spam that we are working to prevent. Due to this, new users are only able to comment on current question threads, not create new ones. Please post your GATK question here if this applies to you. We apologize for this inconvenience and should have this resolved in the next few days.

Best Answers

Answers

  • Dear GATK-Team,

    I want to use CombineGVCF to combine 242samples into one file to analysis with GenotypeGVCFs _and _SelectVariants.

    I think I don't see something obvious but how do I do that in the call without entering all 242 file names manually?
    I tried (which did not work):

        java -jar GenomeAnalysisTK.jar \
           -T CombineGVCFs \
           -R reference.fasta \
           --variant *.g.vcf \
           -o output.g.vcf
    

    Also I would say a 'for'- or 'while'-loop seem not applicable. So how can I do this without entering all 242 file names manually?

    Thank you for your time and Ideas.
    Best,
    Lucy

  • Dear GATK team,
    I have a trouble with applying hard filters to a call set. I did 1.Extract the SNPs from the call set, and I wanted to do 2. Apply the filter to the SNP call set and I typed: ./gatk-4.0.10.1/gatk VariantFiltration -V raw_snps.vcf -R /home/path/to reference.fa -O filtered_snps.vcf --"QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0"
    BUT a got the error message: QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" IS NOT A RECOGNIZED OPTION? Can you help me with this? Thank you in advance!
    All the best,
    Marija

  • @Geraldine_VdAuwera
    thank you very much Geraldine for your fast response and help. I will try it with the textfile and will look into GenomicsDB.

    All the best,
    Lucy

  • @Akidne Many thanks! That works!
    Marija

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    @Rita_S Welcome to the forum, we have received your question and will get back to you.

  • zongjizongji Member

    Hi there,

    Is it possible to use the set of SNPs generated by GATK to directly generate the site frequency spectrum? Although ANGSD has this function, its input is bam file.

    Can someone tell me how to calculate SFS using the results of GATK?

    Thanks a lot.

    This Ticket has been deleted from Zendesk
    This Ticket has been deleted from Zendesk
  • Several Cromwell newbie questions here:

    1. Are out-of-memory errors retry-able? This is unfortunately a common problem in my multi-user HPC environment, e.g. if a competing job on the same node as my task didn't request sufficient slots and squeezed out my task as a result. From my local testing, killing a task command manually from a terminal does not result in a retry, but I haven't had the chance to force an OOM failure yet.

    2. Is there a recommended way to write task definitions so that they can support multiple backends (including Local), without getting the log warning about "unrecognized runtime attribute"? Or do tasks and workflows always need to be backend-specific? The current behavior (at least with my rough workflows) seems too noisy and inflexible.

    3. When I tried to use the SGE backend I quickly ran into problems with executables not being on path. (A) is there a way to propagate $PATH for a workflow? (B) the task actually just hangs and never completes, rather than registering the failure (or even writing the error to stderr or stdout), is this a known bug? (Using 35-71debed-SNAP, if it matters - although what I actually did to build it was "git checkout 36".)

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    @natechols - Thank you for sharing your question, we did not have a link to the WDL/Cromwell forum in our banner, but it can be found here: https://gatkforums.broadinstitute.org/wdl/categories/ask-the-wdl-team

    I would recommend posting your question there to get a faster response. Let us know if you have any trouble doing so.

  • @AdelaideR I posted here because the WDL/Cromwell forum won't let new users post questions.

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    @natechols - We have a restriction on new users with less than 10 posting points. You are at 7 currently. Maybe if we have a little back and forth, you will be able to post after your next response?

  • ChrisLChrisL Cambridge, MAMember, Broadie, Moderator, Dev admin

    @natechols -

    1. Not at present but this is certainly on our radar. Unless @Ruchi can contradict me... did your "always retry this job up to n times" apply to SGE memory failures?
      1.b. Not terminating certainly sounds like a bug... could you provide more detail on what caused it?
    2. If you know that you've written a task to be multi-backend then it's fine to ignore those "unrecognized" warnings. They're intended to help you catch typos, say if you had dockar: "ubuntu:latest" for example.
    3. When you submit jobs to SGE without docker, I believe you're relying on whatever environment the cluster has set up for you. If that sounds OK because the environment should be ok, you might just need to use a fully qualified path to the tool you want to use?
  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    @zongii I found a reference to this issue from a [previous forum posting] (https://gatkforums.broadinstitute.org/gatk/discussion/5026/i-have-all-allele-frequency-0-5-or-1-0).

    They recommended RTGTools for calculating allele frequency from a VCF file.

    I would be interested to hear how this works out. We have the JEXL expression tool in GATK, but it may not be exactly what you are looking for if you care about the spectrum, and not just filtering on allele frequency.

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    @natechols We have added another link to the banner for the WDL/Cromwell section of the forum.

    Here is the link

    Thank you for your patience.

  • RuchiRuchi Member, Broadie, Moderator, Dev admin

    @ChrisL @natechols -- the SGE memory failures should certainly be retried by maxRetries. Please report back if you don't see that working!

  • Thanks all for the help, I will follow up with further questions on the other board. @ChrisL my complaint about (2) is that this makes debugging more difficult because there will always be a bunch of spurious warning messages. Obviously I can modify the code to change the log level, but I'd prefer to avoid using custom builds in production if possible.

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    Thanks @natechols, if you could move this discussion over the Cromwell forum to make sure that they see your questions right away. This is the GATK forum and they may not see your request. @ChrisL if you could keep an eye on this thread, it would be much appreciated in getting it into the Cromwell space for future reference.

  • jparker4jparker4 Member

    Hi GATK team, I posted here last Friday but my post seems to have disappeared. I am having a problem when running GenomicsDBImport on a set of gVCF files.

    The error in question is: A USER ERROR has occurred: Failed to create reader from file:///mnt/fastdata/mbp15jdp/GTEx/SRA-private/SCAPT_gVCF/Variantcalls.dir/Frontal_Cortex-5074-NPJ8.g.vcf.gz

    And the statement I am running is:
    gatk --java-options -Xmx10G GenomicsDBImport -V Variantcalls.dir/Frontal_Cortex-5074-NPJ8.g.vcf.gz -V Variantcalls.dir/Frontal_Cortex-4565-NPJ8.g.vcf.gz -V Variantcalls.dir/Frontal_Cortex-4996-NPJ8.g.vcf.gz -V Variantcalls.dir/Frontal_Cortex-4005-NPJ8.g.vcf.gz -V Variantcalls.dir/Frontal_Cortex-4435-NPJ8.g.vcf.gz -V Variantcalls.dir/Cerebellar_Hemisphere-4643-NPJ8.g.vcf.gz -V

    Any help would be much appreciated, thanks.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @jparker4

    Lets handle this issue here.

    Regards
    Bhanu

  • NeginNegin Member
    Dear GATK Team,
    I am working with a bacteria dna sequence and I am confused about which best practices I should follow in my case. I read your website with rapt attention, but I found only two best practices: one for somatic and another one for germline, but nothing for bacteria! Could you please tell me which of them I should apply for bacteria? Or do you have any other special best practices for bacteria?
    For example, I don't know which of HaplotypeCaller or MuTect2 I should use for my case! Could you please help me in this regard?
    Thank you in advance,
    Negin
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    @Negin

    We do not have any specific recommendation for bacterial dna variant calling. But my best guess would be to use Haplotypecaller for your purposes.

  • NeginNegin Member
    Thank you @bhanuGandham for your quick answer,
    In our group, we also work on RNA sequence of human. We have two groups: Normal and cancerous (Pancreas cancer) with 50 samples and 60 samples per each respectively. Again, the same question: which of HaplotypeCaller or MuTect2 we should use in this case? Or to be more general, which best practices we should follow? the one for somatic or the one for germline?
    Thanks again,
    Negin
  • MyoNaungMyoNaung Member
    Hi @Negin,

    My best guess here is I think HaplotypeCaller is common use among bacteria. HaplotypeCaller relies on ploidy assumption (diploid by default, and you can change it to ploidy = 1). MuTect2 allows for a varying allelic fraction for each variant, and more suitable for cancer genomic.

    Best,

    Myo
  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @Negin @MyoNaung is correct that for a single bacterial species HaplotypeCaller is correct for most purposes. If you are specifically interested in diversity (either a metagenome or just variation within a sample of cells from a single species) Mutect2 may be appropriate. I must admit that we have never tried Mutect2 on bacteria, but we would be interested in supporting any users who want to try!

  • NeginNegin Member
    Thanks all,
    Could also someone please answer my second question that I asked before. I will copy it again here:
    In our group, we also work on RNA sequence of human. We have two groups: Normal and cancerous (Pancreas cancer) with 50 samples and 60 samples per each respectively. Again, the same question: which of HaplotypeCaller or MuTect2 we should use in this case? Or to be more general, which best practices we should follow? the one for somatic or the one for germline?
    Thanks again,
    Negin
  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @Negin for calling somatic variants, use Mutect2. For your cancer samples with matched normals use tumor-normal mode (gatk Mutect2 -I tumor.bam -I normal.bam -normal normal_sample_name. . .) and for those without a matched normal use tumor-only mode (gatk Mutect2 -I tumor.bam . . .) If you are interested in the germline variants of your samples, use HaplotypeCaller.

  • NeginNegin Member
    @davidben
    thanks for your answer. The point is I have RNA sequence rather than DNA. So, is your answer still valid for this case?
    One more question, please: In the past, I used HapplotypeCaller for my case. Do you think it was totally wrong or just slightly different?
    Thanks,
    Negin
  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @Negin, yes my answer is the same for RNA. I would rate your previous use of HaplotypeCaller for somatic variants as mainly but not totally wrong. In practical terms, I would re-run analyses with the latest version of Mutect2 rather than rely on old results from HaplotypeCaller.

  • NeginNegin Member
    Another question regarding hard filtering:
    I found two different articles from GATK regarding hard filtering.
    The first article suggesting these values for hard filtering:
    For SNPs:
    QD < 2.0
    MQ < 40.0
    FS > 60.0
    SOR > 3.0
    MQRankSum < -12.5
    ReadPosRankSum < -8.0

    For indels:
    QD < 2.0
    ReadPosRankSum < -20.0
    InbreedingCoeff < -0.8
    FS > 200.0
    SOR > 10.0


    And, the second article suggesting, these:
    gatk VariantFiltration \
    -V snps.vcf.gz \
    -filter "QD < 2.0" --filter-name "QD2" \
    -filter "QUAL < 30.0" --filter-name "QUAL30" \
    -filter "SOR > 3.0" --filter-name "SOR3" \
    -filter "FS > 60.0" --filter-name "FS60" \
    -filter "MQ < 40.0" --filter-name "MQ40" \
    -filter "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" \
    -filter "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" \
    -O snps_filtered.vcf.gz

    gatk VariantFiltration \
    -V indels.vcf.gz \
    -filter "QD < 2.0" --filter-name "QD2" \
    -filter "QUAL < 30.0" --filter-name "QUAL30" \
    -filter "FS > 200.0" --filter-name "FS200" \
    -filter "ReadPosRankSum < -20.0" --filter-name "ReadPosRankSum-20" \
    -O indels_filtered.vcf.gz


    But, the filtering arguments in these two articles are not totally same!! For example, on the second article, you filter also based on Qual, but on the first one, you suggest no filtering for Qual!
    So which article is more reliable or should be followed, in general? or do you suggest any article to follow for hard filtering?
    Thanks in advance,
    Negin
  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    I will defer to @bhanuGandham as to germline best practices.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    @Negin

    Hard filtering is a tough one as there is no "one size fits all". The hard filtering thresholds in the GATK articles are what we recommend but you should adjust it for your specific analysis needs. Try with and without the QUAL filter and see how your results vary and then decide.

    Take a look at this document: https://software.broadinstitute.org/gatk/documentation/article?id=23216#2

  • NeginNegin Member
    Thanks @bhanuGandham,
    I would follow the recommendations on the suggested document by you, but do I need to adjust some arguments for bacteria or is it same?
  • minimaxminimax Member
    edited May 13
    (Sorry I have to post it here since somehow I could not select a category to post.)

    Hi, I'm new to this forum and GATK . I have a question and would appreciate your answers.
    I'm running GATK **Best practice pipeline** locally (on my University's linux server) without using docker (so I also need to modify the codes). I have a question regarding the relationship between **Picard** and **GATK**, specifically about two operations: **SortSam** and **MergeBamAlignment**.

    From usage example on the documentation webpage, I need to call them using something like `java -jar picard.jar SortSam`, but it seems that I could also run these two operations using GATK like `gatk SortSam`. Is this correct? This is a little confusing as in the documentation webpage, the title is SortSam (Picard) indicating that Picard should be used.

    Thank you very much for your clarification!

    Also, how long should I wait before I could post links? and use markdown syntax?
  • bshifawbshifaw Member, Broadie, Moderator admin

    @Negin

    Bhanu's earlier comment probably applies to your last question, we don't have any specific recommendations on bacteria as GATK is intended and tested for human sequences.

    @bhanuGandham said:
    We do not have any specific recommendation for bacterial dna variant calling. But my best guess would be to use Haplotypecaller for your purposes.

  • NeginNegin Member

    @bshifaw,
    Can I imply from your speech that all of QD, QUAL, SOR, FS, MQ, MQRankSum, and ReadPosRankSum are also valid for bacteria (haploid organism)?

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    @minimax -

    It depends on which version of GATK you have installed on your university's Linux system. Did you follow the instructions here?

    You should be able to call all of the tools by typing

    gatk toolName -parameters <arguments>
    
    

    For example:

    gatk SortSam -I input.bam -O output.sorted.bam
    
    

    It is not necessary to use java -jar picard.jar with the latest version of GATK.

  • minimaxminimax Member
  • stevenmstevenm Member
    Hi GATK team,

    I'm new to the forum and I have a question about VQSR.

    There are 31 exomes in my study but one half was captured using a different exome kit to the other half. Because there are regions of the exome where there are only effectively 15 individuals (the locus is only captured by one capture kit), does this mean VSQR is inappropriate for my data?

    Thanks!
  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    @stevenm

    There two different issues here.

    1.) VQSR should not be run on two different exome kits combined, so VQSR should be run on these exome kits separately.

    2.) 15 exomes is a small number of samples, and may not be enough data for VQSR.

    Is there any way to increase the number of samples inside each kit?

  • stevenmstevenm Member
    edited May 17
    Thanks for getting back to me @AdelaideR

    I suspected this might be the case. Could you point me to a resource that explains why different exome kits should be treated separately in VQSR?

    Unfortunately I can't increase the number of samples. I could add 1000 genomes data, but these exomes wouldn't use the same kit either.

    It seems like hard filtering might be the way to go here.
  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin
    edited May 17

    @stevenm The expectations for TiTv are based on what parts of the genome are being examined. If a different part of the exome is captured by one kit versus another, it would be impossible to have a basis for predicting what the value should be. This is true of any protocol where two different types of preparations, kits or sequencing are used. It is difficult to cross-compare them based on the assumptions made in the kit itself.

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    @trannguyen I have asked @bshifaw to take a look at this, but there is an extensive discussion on this error here.

    The simplest solution may be that the standard out may be running out of space. Are you running this on a local machine?

  • jayaramvjayaramv Member
    No calls for reference genotype in final vcf.

    I have a question regarding no calls at a particular site. A handful of samples from 600 samples are showing the PL values as 0,0,0.

    In the g.vcf
    14 95590730 . C <NON_REF> . . END=95590801 GT:DP:GQ:MIN_DP:PL 0/0:1060:0:864:0,0,0

    In the final vcf the genotype is not called as reference
    . ./.:452,0:452:.:0,0,0

    Human Gene panel germline calling pipeline (100 genes)
    GATK best practice (version 4)

    I have around 2KB of the region which I have performed HaplotypeCaller with bamout. Would anyone be willing to look at the command lines and bamout and tell me why this region ends up as ./. in few samples when the read depth is so high.

    I have a zip file with all the information if you require it for roughly 2Kb region.
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @jayaramv

    Would you please share your gvcf, bam and bamout to reproduce this issue. You can share the files with us using these instructions: https://software.broadinstitute.org/gatk/documentation/article?id=23405

  • jayaramvjayaramv Member
    Hi @bhanuGandham ,

    I have uploaded a 5 Kb region for 3 samples with no calls, 3 samples with reference calls and one sample with a heterozygote call with bam, gvcf and bamout files for each (ftp.broadinstitute.org: snippet_files_jayaramv.gz)

    Just to reiterate my issue:
    GATK version 4 following best practise.
    600 samples of targeted gene panel using nextera (100 genes)
    I use join calling after merging gvcfs.

    However in the final vcf I am noticing that for some variants around 30% have no calls (./.) and the rest are (0/0) or (1/0) if there are any...

    The regions with (./.) do not appear to have poor read depths and looks quite similar in quality to the samples that are called (0/0).

    So my question is why do a large percentage of samples have no calls at certain variant locus for no apparent reason?

    I have uploaded a 5kb region for seven samples which show this problem clearly. All samples have good read depths but three samples have no calls at 16:50826538, but three are 0/0 and one sample is (0/1).

    Thanks again for looking into this issue, I really appreciate it and I am struggling to understand why this is happening at various different loci.

    Best Jay
  • NicolasKNicolasK GermanyMember
    Sorry, but I could not find a solution in the Forum.
    Is it still true that no new questions threads are possible to create?
    Somehow I can not find a button or something to start one.

    My question:
    I am using docker version 9e737a9f562c and try to do a germline CNV analysis.

    I did the following steps, which seemed to work fine to construct a reference:
    - CollectReadCounts
    - DetermineGermlineContigPloidy, for my control samples
    - GermlineCNVCaller in "cohort" mode with all the control samples

    To analyze a single sample for CNVs I used the "case" mode of GermlineCNVCaller.
    This also finished without giving an error.

    I am bit confused as many on the result files in the "calls" folder have only one column:

    '''
    ...-calls/SAMPLE_0$ head *.tsv
    ==> baseline_copy_number_t.tsv <==
    @RG ID:GATKCopyNumber SM:WES_val18_1
    BASELINE_COPY_NUMBER
    2
    2
    2
    2
    2
    2
    2
    2

    ==> log_c_emission_tc.tsv <==
    @RG ID:GATKCopyNumber SM:WES_val18_1
    COPY_NUMBER_0 COPY_NUMBER_1 COPY_NUMBER_2 COPY_NUMBER_3 COPY_NUMBER_4 COPY_NUMBER_5
    -797.34101889758176 -17.212628790078728 -4.6462717128635065 -12.586281678466182 -23.891742471520427 -35.389929576114589
    -1.00503186717672 -10.648211871210762 -20.89445011760084 -30.326362140815757 -38.991535584990238 -46.98785797588387
    -4.1188692069108237 -47.753298268967058 -82.408039394691031 -107.9894715938793 -128.2574206310149 -145.04386068138322
    -62.271633124290645 -7.0034421825122433 -2.777263369916942 -4.3458447102733873 -7.8524844258694193 -12.136902708501102
    -1.0065902169640228 -1.1734868460018859 -1.3399852819200373 -1.506088146632609 -1.6717980295158033 -1.8371174880361167
    -1.0064403157192932 -1.0899363956661245 -1.1733270435413026 -1.2566126688740686 -1.3397936778523547 -1.4228704733668354
    -1.0064425823902747 -1.0528490937360051 -1.0992156829227817 -1.1455424634436031 -1.1918295481346934 -1.2380770491814308
    -1.0064947759496747 -1.0773752950431721 -1.1481603858476805 -1.2188504454203655 -1.2894458676780125 -1.3599470434348337

    ==> log_q_c_tc.tsv <==
    @RG ID:GATKCopyNumber SM:WES_val18_1
    COPY_NUMBER_0 COPY_NUMBER_1 COPY_NUMBER_2 COPY_NUMBER_3 COPY_NUMBER_4 COPY_NUMBER_5
    -25.32907097420383 -12.67487930097565 -0.00037783103665156581 -7.8895830567291814 -19.07496184157387 -25.101562866624523
    -5.8263193408736313e-07 -14.463566157471806 -16.636037912009918 -25.328900873319753 -25.32907094249471 -25.32907097419011
    -5.9594652934702452e-08 -25.329070974079954 -16.636371013365732 -25.32907097420383 -25.32907097420383 -25.32907097420383
    -25.329070974203827 -7.1278057259145235 -0.0098545999998225775 -4.7325281685256098 -8.5316538559995632 -12.935436583897218
    -6.9726225402324351 -6.6471683949094906 -0.0083730105237666552 -5.189886120363731 -7.9764189185566048 -8.584067472627904
    -6.957523518734833 -6.6497465979355113 -0.00820190583407443 -5.2245577156054317 -7.9686671888574763 -8.5492934964207823
    -6.9893946064302099 -6.7130512755749594 -0.0074016930851855678 -5.3703464995815455 -7.9347205871438975 -8.3896736100551035
    -7.1004626041849379 -6.8247142007574118 -0.0065205113196458719 -5.5175040624646243 -7.9848006124285078 -8.3708009541338857

    ==> mu_denoised_copy_ratio_t.tsv <==
    @RG ID:GATKCopyNumber SM:WES_val18_1
    DENOISED_COPY_RATIO_MEAN
    1.8425756936764504
    -0.00097141029423242057
    0.037559795837351383
    2.0355285173512092
    -8.6200957016579025
    -11.737327135297674
    -45.054135273187669
    -21.859914219872724
    '''

    When I try to run ModelSegments I get an error:
    (gatk) [email protected]:/gatk/my_data/CollectReadCounts/WES_val18_1_CNV# gatk ModelSegments --denoised-copy-ratios WES_val18_1-calls/SAMPLE_0/std_denoised_copy_ratio_t.tsv --output-prefix WES_val18_1 -O ModelSegments
    ....
    A USER ERROR has occurred: Bad input: Bad header in file. Not all mandatory columns are present. issing: CONTIG, LOG2_COPY_RATIO, START, END
    .....

    Thank you for your help and kind regards.
  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    @NicolasK - To avoid spam, we limit new users to this space. I have verified you, you should be able to post new topics now.

    I think there may be an issue with your input file, have you ValidateSam on that file?

  • NicolasKNicolasK GermanyMember
    @AdelaideR - thank you for letting me post new topics.
    Btw, should we movfe this conversation to a new topic?

    I ran ValidateSamFile on the BAM file, it gives me "Mate not found for paired read" for 12909 reads, no other errors.

    It is not clear to me how the BAM/SAM file is related to the error when I use "ModelSegments".
    Do you think the problem occurred already in the first step I used "CollectReadCounts", that did not throw any error.

    Thank you.
  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    Something is causing the MatePairs to be lost. It is probably before this step.

    Could you please provide the exact code for all of the steps that you have run? You may have set a parameter that is in conflict with the downstream code.

    Also, what version of GATK are you using?

    Thanks.

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    @NicolasK Has your question been resolved?

  • NicolasKNicolasK GermanyMember
    @AdelaideR - No, I have still the same question.

    The Genome Analysis Toolkit (GATK) v4.1.2.0
    HTSJDK Version: 2.19.0
    Picard Version: 2.19.0

    # CollectReadCounts, gave no error
    for i in *.bam; do echo $i; gatk CollectReadCounts -I $i -L ../../S31285117_Regions.interval_list --interval-merging-rule OVERLAPPING_ONLY -O ${i%.*}.counts.hdf5; done

    # DetermineGermlineContigPloidy, gave no error
    gatk DetermineGermlineContigPloidy -L ../S31285117_Regions.interval_list --interval-merging-rule OVERLAPPING_ONLY --input hdf5_samples/A0421601.counts.hdf5 --input hdf5_samples/A0451701.counts.hdf5 --input hdf5_samples/A0461702.counts.hdf5 --input hdf5_samples/A0551801.counts.hdf5 --input hdf5_samples/A0581901.counts.hdf5 --input hdf5_samples/A0612001.counts.hdf5 --input hdf5_samples/A0642101.counts.hdf5 --input hdf5_samples/A0652102.counts.hdf5 --input hdf5_samples/A0682201.counts.hdf5 --input hdf5_samples/A0752401.counts.hdf5 --input hdf5_samples/A0782501.counts.hdf5 --input hdf5_samples/A0792502.counts.hdf5 --input hdf5_samples/A0943001.counts.hdf5 --input hdf5_samples/A1003101.counts.hdf5 --input hdf5_samples/A1033201.counts.hdf5 --input hdf5_samples/A1063301.counts.hdf5 --input hdf5_samples/A1093401.counts.hdf5 --input hdf5_samples/A1123501.counts.hdf5 --input hdf5_samples/A1153601.counts.hdf5 --input hdf5_samples/A1183701.counts.hdf5 --input hdf5_samples/A1223801.counts.hdf5 --input hdf5_samples/A1253901.counts.hdf5 --input hdf5_samples/A1284001.counts.hdf5 --input hdf5_samples/A1294002.counts.hdf5 --input hdf5_samples/A1324101.counts.hdf5 --input hdf5_samples/A1354201.counts.hdf5 --input hdf5_samples/A1384301.counts.hdf5 --input hdf5_samples/A1394302.counts.hdf5 --input hdf5_samples/A1420405.counts.hdf5 --input hdf5_samples/A1434301.counts.hdf5 --input hdf5_samples/A1461408.counts.hdf5 --input hdf5_samples/A1471409.counts.hdf5 --input hdf5_samples/A1494401.counts.hdf5 --input hdf5_samples/A1524501.counts.hdf5 --input hdf5_samples/A1554601.counts.hdf5 --input hdf5_samples/A1554601s.counts.hdf5 --input hdf5_samples/A1584701.counts.hdf5 --input hdf5_samples/A1614801.counts.hdf5 --input hdf5_samples/A1644901.counts.hdf5 --input hdf5_samples/A1675001.counts.hdf5 --input hdf5_samples/A1705101.counts.hdf5 --input hdf5_samples/A1735201.counts.hdf5 --input hdf5_samples/A1735201s.counts.hdf5 --input hdf5_samples/A1765301.counts.hdf5 --input hdf5_samples/A1795401.counts.hdf5 --input hdf5_samples/A1825501.counts.hdf5 --input hdf5_samples/A1825501s.counts.hdf5 --input hdf5_samples/A1855601.counts.hdf5 --input hdf5_samples/A1885701.counts.hdf5 --input hdf5_samples/A1915801.counts.hdf5 --input hdf5_samples/A1925901.counts.hdf5 --input hdf5_samples/A1925901s.counts.hdf5 --input hdf5_samples/A1956001.counts.hdf5 --input hdf5_samples/A1966002.counts.hdf5 --input hdf5_samples/ERR233225.counts.hdf5 --input hdf5_samples/WES_val18_1.counts.hdf5 --input hdf5_samples/WES_val18_2.counts.hdf5 --input hdf5_samples/WES_val18_3.counts.hdf5 --input hdf5_samples/WES_val18_4_8_12.counts.hdf5 --input hdf5_samples/WES_val18_5_6_7.counts.hdf5 --input hdf5_samples/WES_val18_9.counts.hdf5 --input hdf5_samples/WES_val18_9_10_11.counts.hdf5 --input hdf5_samples/s001a.counts.hdf5 --input hdf5_samples/s001b.counts.hdf5 --contig-ploidy-priors ../ploidy_priors.tsv --output CNV_REF --output-prefix index_normal_cohort;

    # GermlineCNVCaller, gave no error
    gatk GermlineCNVCaller --run-mode COHORT -L ../S31285117_Regions.interval_list --interval-merging-rule OVERLAPPING_ONLY --contig-ploidy-calls CNV_REF/index_normal_cohort-calls --input hdf5_samples/A0421601.counts.hdf5 --input hdf5_samples/A0451701.counts.hdf5 --input hdf5_samples/A0461702.counts.hdf5 --input hdf5_samples/A0551801.counts.hdf5 --input hdf5_samples/A0581901.counts.hdf5 --input hdf5_samples/A0612001.counts.hdf5 --input hdf5_samples/A0642101.counts.hdf5 --input hdf5_samples/A0652102.counts.hdf5 --input hdf5_samples/A0682201.counts.hdf5 --input hdf5_samples/A0752401.counts.hdf5 --input hdf5_samples/A0782501.counts.hdf5 --input hdf5_samples/A0792502.counts.hdf5 --input hdf5_samples/A0943001.counts.hdf5 --input hdf5_samples/A1003101.counts.hdf5 --input hdf5_samples/A1033201.counts.hdf5 --input hdf5_samples/A1063301.counts.hdf5 --input hdf5_samples/A1093401.counts.hdf5 --input hdf5_samples/A1123501.counts.hdf5 --input hdf5_samples/A1153601.counts.hdf5 --input hdf5_samples/A1183701.counts.hdf5 --input hdf5_samples/A1223801.counts.hdf5 --input hdf5_samples/A1253901.counts.hdf5 --input hdf5_samples/A1284001.counts.hdf5 --input hdf5_samples/A1294002.counts.hdf5 --input hdf5_samples/A1324101.counts.hdf5 --input hdf5_samples/A1354201.counts.hdf5 --input hdf5_samples/A1384301.counts.hdf5 --input hdf5_samples/A1394302.counts.hdf5 --input hdf5_samples/A1420405.counts.hdf5 --input hdf5_samples/A1434301.counts.hdf5 --input hdf5_samples/A1461408.counts.hdf5 --input hdf5_samples/A1471409.counts.hdf5 --input hdf5_samples/A1494401.counts.hdf5 --input hdf5_samples/A1524501.counts.hdf5 --input hdf5_samples/A1554601.counts.hdf5 --input hdf5_samples/A1554601s.counts.hdf5 --input hdf5_samples/A1584701.counts.hdf5 --input hdf5_samples/A1614801.counts.hdf5 --input hdf5_samples/A1644901.counts.hdf5 --input hdf5_samples/A1675001.counts.hdf5 --input hdf5_samples/A1705101.counts.hdf5 --input hdf5_samples/A1735201.counts.hdf5 --input hdf5_samples/A1735201s.counts.hdf5 --input hdf5_samples/A1765301.counts.hdf5 --input hdf5_samples/A1795401.counts.hdf5 --input hdf5_samples/A1825501.counts.hdf5 --input hdf5_samples/A1825501s.counts.hdf5 --input hdf5_samples/A1855601.counts.hdf5 --input hdf5_samples/A1885701.counts.hdf5 --input hdf5_samples/A1915801.counts.hdf5 --input hdf5_samples/A1925901.counts.hdf5 --input hdf5_samples/A1925901s.counts.hdf5 --input hdf5_samples/A1956001.counts.hdf5 --input hdf5_samples/A1966002.counts.hdf5 --input hdf5_samples/ERR233225.counts.hdf5 --input hdf5_samples/WES_val18_1.counts.hdf5 --input hdf5_samples/WES_val18_2.counts.hdf5 --input hdf5_samples/WES_val18_3.counts.hdf5 --input hdf5_samples/WES_val18_4_8_12.counts.hdf5 --input hdf5_samples/WES_val18_5_6_7.counts.hdf5 --input hdf5_samples/WES_val18_9.counts.hdf5 --input hdf5_samples/WES_val18_9_10_11.counts.hdf5 --input hdf5_samples/s001a.counts.hdf5 --input hdf5_samples/s001b.counts.hdf5 --output cohort --output-prefix normal_cohort_run

    # ModelSegments, error
    gatk ModelSegments --denoised-copy-ratios cohort/normal_cohort_run-calls/SAMPLE_55/std_denoised_copy_ratio_t.tsv --output-prefix WES_val18_1 -O ModelSegments
    Using GATK jar /gatk/gatk-package-4.1.2.0-local.jar
    Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /gatk/gatk-package-4.1.2.0-local.jar ModelSegments --denoised-copy-ratios cohort/normal_cohort_run-calls/SAMPLE_55/std_denoised_copy_ratio_t.tsv --output-prefix WES_val18_1 -O ModelSegments
    08:09:56.353 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.1.2.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
    Aug 09, 2019 8:10:38 AM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
    INFO: Failed to detect whether we are running on Google Compute Engine.
    08:10:38.144 INFO ModelSegments - ------------------------------------------------------------
    08:10:38.144 INFO ModelSegments - The Genome Analysis Toolkit (GATK) v4.1.2.0
    08:10:38.144 INFO ModelSegments - For support and documentation go to ...
    08:10:38.145 INFO ModelSegments - Executing as [email protected] on Linux v4.15.0-43-generic amd64
    08:10:38.145 INFO ModelSegments - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_191-8u191-b12-0ubuntu0.16.04.1-b12
    08:10:38.145 INFO ModelSegments - Start Date/Time: August 9, 2019 8:09:56 AM UTC
    08:10:38.145 INFO ModelSegments - ------------------------------------------------------------
    08:10:38.145 INFO ModelSegments - ------------------------------------------------------------
    08:10:38.146 INFO ModelSegments - HTSJDK Version: 2.19.0
    08:10:38.146 INFO ModelSegments - Picard Version: 2.19.0
    08:10:38.146 INFO ModelSegments - HTSJDK Defaults.COMPRESSION_LEVEL : 2
    08:10:38.146 INFO ModelSegments - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    08:10:38.147 INFO ModelSegments - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    08:10:38.147 INFO ModelSegments - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    08:10:38.147 INFO ModelSegments - Deflater: IntelDeflater
    08:10:38.147 INFO ModelSegments - Inflater: IntelInflater
    08:10:38.147 INFO ModelSegments - GCS max retries/reopens: 20
    08:10:38.147 INFO ModelSegments - Requester pays: disabled
    08:10:38.147 INFO ModelSegments - Initializing engine
    08:10:38.147 INFO ModelSegments - Done initializing engine
    08:10:38.240 INFO ModelSegments - Reading file (cohort/normal_cohort_run-calls/SAMPLE_55/std_denoised_copy_ratio_t.tsv)...
    08:10:38.380 INFO ModelSegments - Shutting down engine
    ... done. Elapsed time: 0.70 minutes.
    Runtime.totalMemory()=2150629376
    ***********************************************************************

    A USER ERROR has occurred: Bad input: Bad header in file. Not all mandatory columns are present. Missing: CONTIG, LOG2_COPY_RATIO, START, END

    ***********************************************************************
    Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.
  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    @NicholasK it still reads as bad inputs.

    A USER ERROR has occurred: Bad input: Bad header in file. Not all mandatory columns are present. Missing: CONTIG, LOG2_COPY_RATIO, START, END

    Try running ValidateSam on all input bams at each step go back to the beginning.

    Try running samtools flagstats on the bams for each step as well to determine where the mate pair reads are lost in the process.

    Regards,

    Adelaide

  • jayaramvjayaramv Member
    Hi @bhanuGandham / @AdelaideR , Did you get an oppurtunity to have a look at the files I uploaded?
    many thanks,
    Jay
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @jayaramv

    Sorry we missed your question, it got lost in this long thread. In the future we recommend starting new threads so that this does not happen again. It is difficult for us to keep track of questions when the threads get long.
    I am looking into this and will get back to you shortly.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @jayaramv

    I took a look at the data you shared with me.

    1) 80019_snippet.g.vcf is an empty file in the bug report.
    2) I looked at the bamout for the other 6 samples, and all samples have no reads covering 14 95590730 region. That is the reason for the no call.
    3) And for this region:16:50826538, can you please post the joint called variant vcf record.

  • jayaramvjayaramv Member
    Hi @bhanuGandham

    Can I repost this as a new thread with your permission?

    I have now uploaded a new snippet file (snippet_jayaramv_Aug13) with a corrected 80019_snippet.g.vcf, merged gvcf and joint called vcf record.

    1) 80019_snippet.g.vcf is an empty file in the bug report.

    Reuploaded correct version

    2) I looked at the bamout for the other 6 samples, and all samples have no reads covering 14 95590730 region. That is the reason for the no call.

    I know, I didn't upload that region for the 7 samples. The region is 16:50824000-50829000 .

    3) And for this region:16:50826538, can you please post the joint called variant vcf record.

    Done


    Just to reiterate my previous question in a better way

    Study design
    GATK version 4 following best practice.
    600 samples of targeted gene panel using nextera (100 genes)
    I used join calling after merging gvcfs.

    Looking across the 600 samples [for example at variant 16:50826538, I have many similar loci across the panel] I notice the following:

    The final vcf for some variants around 30% samples have no calls (./.) and the rest are (0/0) or (1/0) if there are any...

    The regions with (./.) do not appear to have poor read depths and looks quite similar in quality to the samples that are called (0/0).

    So my question is why do a large percentage of samples have no calls at certain variant locus for no apparent reason?

    I have uploaded a 5 Kb region for 3 samples with no calls, 3 samples with reference calls and one sample with a heterozygote call with bam, gvcf , bamout, merged gvcf and vcf files for each.

    -L 16:50824000-50829000 was taken as an example.
    This region is just one of the many regions of my targeted resequencing just to highlight my issue.


    I have uploaded a 5kb region for seven samples which show this problem clearly.


    All samples have good read depths but three samples have no calls at 16:50826538, but three are 0/0 and one sample is (0/1).

    16:50826538
    79880: ./.
    80019: ./.
    80020: ./.

    80008: 0/0
    80009: 0/0
    80010: 0/0

    80600: 0/1

    My question why do some samples have no calls while others have.

    The VCF is before the step "Hard-filtering germline short variants" which may remove this variant because of QD<2. However I have many other variants that have say for example QD>20 that have the same no calls issue.


    Thanks again for looking into this issue, I really appreciate it and I am struggling to understand why this is happening at various different loci
  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    @NicholasK Have you checked your inputs?

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @jayaramv

    Yes it will be very helpful if you could please repost this in a new thread. In the mean time I will dig into the data you sent to me.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @jayaramv

    I looked at your data. If you look at the bamout for the samples 79880, 80019 and 80020 at the 16:50826538 region you will see that no reads align to this position. This is the reason for the no calls. And if you looking for why you see no reads in the bamout, that description is given in these docs:
    https://software.broadinstitute.org/gatk/documentation/article?id=11068
    https://gatkforums.broadinstitute.org/gatk/discussion/11076/local-re-assembly-and-haplotype-determination-haplotypecaller-mutect2#latest

Sign In or Register to comment.