Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office on November 11th and 13th 2019, due to the U.S. holiday(Veteran's day) and due to a team event(Nov 13th). We will return to monitoring the GATK forum on November 12th and 14th respectively. Thank you for your patience.

Inconsistent format of VCF files generated by PostprocessGermlineCNVCalls

Dear staff, I'm using gatk 4.1.3.0 to call CNVs from my WES data.
After running PostprocessGermlineCNVCalls on every single sample, I found that the VCF file generated has an inconsistent format regarding the FORMAT fields.

First, I found that in interval VCF file, the order of 'GT', 'CN', 'CNQ', 'CNLP' in FORMAT column seems not to be consistent across different files. For instance, I here paste two screenshots corresponding to two files:
file 1

file 2 (ideal format)

Second, I found the number corresponding to the 'GT' field does not consistently represent the same genotype. I believe in the official's document, 0 represents no copy number change, 1 represents deletion and 2 represents duplication. However, this rule can be a chaos in some files. Here I paste one screenshot of normal VCF file:

Here I paste more screenshots of abnormal VCF files(including both segments and interval VCF files) :


The most serious part is this rule, sometimes, is found to be inconsistent within a single VCF file. Which leaves me no choice but neglecting GT field at all for annotation and post-analysis.
Pls take a look at these issues and let me know what went wrong.

Here I also paste the command I used for PostprocessGermlineCNVCalls.
${gatk} PostprocessGermlineCNVCalls \ --model-shard-path ${v6_gCNV_model} \ --calls-shard-path ${cnv_dir}/${v6_gCNV_case_prefix}-calls \ --allosomal-contig chrX --allosomal-contig chrY \ --contig-ploidy-calls ${cnv_dir}/${v6_ploidy_case_prefix}-calls \ --sample-index ${sample_index} \ --output-denoised-copy-ratios ${cnv_dir}/${patientID}.sample_${sample_index}.denoised_copy_ration.tsv \ --output-genotyped-intervals ${cnv_dir}/genotyped-intervals-case-${patientID}-vs-v6cohort.vcf.gz \ --output-genotyped-segments ${cnv_dir}/genotyped-segments-case-${patientID}-vs-v6cohort.vcf.gz \ --sequence-dictionary ${ref_gen}/ucsc.hg19.dict

Thanks!

Answers

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    Interesting finding however mine don't look like this.

    Have you tried to merge or modify files with another tool by any means?

  • sleeslee Member, Broadie, Dev ✭✭✭

    Hi @Yangyxt, thanks for raising this issue. I don't see any obvious reason in the underlying code (which simply uses HTSJDK) that might cause this to happen. Did you merge/modify the files, as @SkyWarrior conjectured? In any case, could you provide test files and command lines so that we might try to reproduce the issue?

  • YangyxtYangyxt Member

    @slee said:
    Hi @Yangyxt, thanks for raising this issue. I don't see any obvious reason in the underlying code (which simply uses HTSJDK) that might cause this to happen. Did you merge/modify the files, as @SkyWarrior conjectured? In any case, could you provide test files and command lines so that we might try to reproduce the issue?

    Dear @slee ,@SkyWarrior

    To reproduce the issue, should I provide the raw BAM file ? Or is it sufficient to provide the built model directory(including ploidy) and the HDF5 files?

    And regarding to the merging/modifying, I don't think I've done any before inspecting the files. I directly inspect files right after the completion of PostprocessGermlineCNVCalls with less command.

    Thanks!

  • sleeslee Member, Broadie, Dev ✭✭✭

    Hi @Yangyxt, just a minimal set of the input files and the command line for the PostprocessGermlineCNVCalls step should be fine. For example, if you can reproduce the issue with just the HDF5 file from a single sample, you don't need to include files from all of your samples. Thanks!

  • YangyxtYangyxt Member

    @slee said:
    Hi @Yangyxt, just a minimal set of the input files and the command line for the PostprocessGermlineCNVCalls step should be fine. For example, if you can reproduce the issue with just the HDF5 file from a single sample, you don't need to include files from all of your samples. Thanks!

    Dear @slee and @SkyWarrior, I think I've found the problem. I actually did merge VCF files before inspecting them. Sorry, I forgot that last time I responded to you.
    I use vcftools to do a VCF merge with the command vcf-merge because some WES data are in trio(probands and parents).

    I checked the files again and found for those I did not need to execute a vcf-merge, the format is normal. While in other VCF files generated by merging, the format is a mess.

    But I wonder why this happens? Do you have other tools recommended for vcf merging? Thanks!

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭
    edited September 4

    Not all vcf parsers do a good job in keeping the format intact that's why.

    When I joint analyze my CNV vcf files I convert them to tabular format (using my custom tool using HTSJDK so format is always consistent) and also count the uniqueness of events at each sample to generate a table of unique events. From there I annotate intervals and try to find the culprit. Unlike SNP vcf files CNV vcf files will be harder to join in segment format I suppose.

  • YangyxtYangyxt Member

    @SkyWarrior said:
    Not all vcf parsers do a good job in keeping the format intact that's why.

    When I joint analyze my CNV vcf files I convert them to tabular format (using my custom tool using HTSJDK so format is always consistent) and also count the uniqueness of events at each sample to generate a table of unique events. From there I annotate intervals and try to find the culprit. Unlike SNP vcf files CNV vcf files will be harder to join in segment format I suppose.

    @slee and @SkyWarrior ,

    Thanks for the information. I find other issues regarding two VCF files.
    One is I found the segments VCF file's records are not in exact concordance with the intervals VCF file's records.
    Here I provide two VCF files of the same sample:
    interval VCF file:

    segment VCF file:

    You can notice that in the region(chr1:14576-chr1:17832) of deletion according to segment VCF file, there is a bin(chr1:15649-chr1:16030) annotated as amplification in interval VCF file. How does this happen and how can I treat this kind of conflicts?

    Another issue is I don't find detailed comments or description regarding fields of 'NP', 'QA', 'QS', 'QSE', 'QSS', even in https://gatkforums.broadinstitute.org/gatk/discussion/11686/ and https://gatkforums.broadinstitute.org/gatk/discussion/11685/How can I interpret them and why plotting them against each other can be used to filter false-positive calls?

    Also, I don't find a systematic/automatically way to filter falls positive calls in the two tutorial using Jupyter Notebook I mentioned above. All the processes recorded in those tutorials seem to be plotting for manual inspection(excluding the truth set overlap inspection).

    Any guidance are welcome and sincerely appreciated.

  • sleeslee Member, Broadie, Dev ✭✭✭

    @Yangyxt the *intervals.vcf is generated by concatenating the results of running the forward-backward algorithm in each shard, while the *segments.vcf gives the single-sample Viterbi segmentation across all shards. So differences such as that you observed might be expected.

    Brief descriptions for those annotations are given in the VCF header. Unfortunately, I don't think we have more detailed documentation (other than in the source code itself).

    Typically, we simply use QS > 50 for DUPs and QS > 100 for DELs, but you might need to adjust these thresholds or filter on additional annotations, depending on the purposes of your analysis and the quality of your data and model. Ideally, you would have some ground truth or known events that you could use for tuning.

  • YangyxtYangyxt Member

    @slee said:
    @Yangyxt the *intervals.vcf is generated by concatenating the results of running the forward-backward algorithm in each shard, while the *segments.vcf gives the single-sample Viterbi segmentation across all shards. So differences such as that you observed might be expected.

    Brief descriptions for those annotations are given in the VCF header. Unfortunately, I don't think we have more detailed documentation (other than in the source code itself).

    Typically, we simply use QS > 50 for DUPs and QS > 100 for DELs, but you might need to adjust these thresholds or filter on additional annotations, depending on the purposes of your analysis and the quality of your data and model. Ideally, you would have some ground truth or known events that you could use for tuning.

    @slee
    Thank you for your information!

    I saw the VCF headers and all these Quality scores kind of reflect how every bin included in the segments support the general genotype of the segment. But I wondered how you generate QSE and QSS? How to judge them to be genuine breakpoints?

    Regarding the two types of VCF files, I once saw you answered in another post, claiming that interval VCF file can be noisy. I now have a rough concept that these two VCFs are generated with different hidden-state inferring algorithms. But I am still kind of confused which one can I trust more regarding sensitivity and specificity, respectively.

    I understand that CNV breakpoints can be inaccurate for calls derived from WES data. But in my case, I need to do a pedigree-based filtration on the GT field. If I merge VCF files separately from proband and parents to one file. How can I merge and filter CNV calls since the start and end coordinates might be different? Do you have a threshold or cut-off value for the overlap proportions between two CNV events that can be used for filtration?

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭
    edited September 5

    Tabular format is still my recommendation.

    First column will be the sample names so I can easily track which samples are which and event count is a simple excel formula to count the interval name among all the interval names to see how many times I observe a particular locus to have a CNV.

    On the other hand by definition you cannot expect large CNVs (within exonic regions and more than a megabase or so) to be inherited because those tend to be in the category of contiguous gene disorders and usually they are pathogenic. Because of that filtering using event count and tabular format is simpler to follow.

  • YangyxtYangyxt Member

    @SkyWarrior said:
    Tabular format is still my recommendation.

    First column will be the sample names so I can easily track which samples are which and event count is a simple excel formula to count the interval name among all the interval names to see how many times I observe a particular locus to have a CNV.

    On the other hand by definition you cannot expect large CNVs (within exonic regions and more than a megabase or so) to be inherited because those tend to be in the category of contiguous gene disorders and usually they are pathogenic. Because of that filtering using event count and tabular format is simpler to follow.

    @SkyWarrior
    Thanks for your recommendation. It's just I use ANNOVAR as annotation tools, which takes vcfs as input files. Therefore, I need the merging (of probands and parents) happens as vcf files.

    Though I tried MergeVcfs(Picard tools) recommended by GATK, it seems requires the sample column are all the same in all input files. So it seems not fit my purpose.

    For vcftools and bcftools, they fail to merge vcfs with proper format.

    In this case, do you have any other vcf manipulation tools recommended?

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    You should try CombineVariants from gatk3 which some of the community is waiting to be ported to gatk4. CombineVariants does a better jon than any other.

  • sleeslee Member, Broadie, Dev ✭✭✭

    Thanks for your suggestions, @SkyWarrior. @Yangyxt if you'd like more detailed descriptions of the quality scores, you may find the code comments (and perhaps the code itself) at https://github.com/broadinstitute/gatk/blob/master/src/main/python/org/broadinstitute/hellbender/gcnvkernel/postprocess/segment_quality_utils.py useful. These qualities were inspired by analogous qualities reported by XHMM.

  • YangyxtYangyxt Member

    @SkyWarrior said:
    You should try CombineVariants from gatk3 which some of the community is waiting to be ported to gatk4. CombineVariants does a better job than any other.

    Thanks for the suggestions! I guess I just need to download an old release of GATK then.

Sign In or Register to comment.