We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

How to merge the sample_X_genotyped_intervals.vcf files created by PostprocessGermlineCNVCalls?

WimSWimS Member ✭✭

How to merge the sample_X_genotyped_intervals.vcf files created by PostprocessGermlineCNVCalls to a multi-sample VCF file?

The files all have the same bins/records, so it should be easy to created a multi-sample VCF of these files.

I normally use bcftools to merge vcf files. bcftools merge gives the following error when trying to merge the (bgzipped, tabix indexed) sample_X_genotyped_intervals.vcf files created by PostprocessGermlineCNVCalls

Incorrect number of FORMAT/CNLP values at Chr_01:1001, cannot merge. The tag is defined as Number=A, but found
6 values and 3 alleles. See also http://samtools.github.io/bcftools/howtos/FAQ.html#incorrect-nfields

Can you check if the FORMAT declaration of CNLP is correct.

And advise on if there is a tool in GATK to merge single sample vcf files (created by PostprocessGermlineCNVCalls) to a multi-sample VCF file.

For time being I wrote my own python text parsing script to create the multi-sample VCF file.
But this seems like something that should be possible with GATK or bcftools.

Thank you.

Best Answers


  • WimSWimS Member ✭✭

    Hi @slee Thank you for opening the github ticket.

    At this time I am only trying to merge the *interval.vcf files. This because it results in a nice square 1Kbp interval multi-sample CNV genotype matrix.

    I am not really sure what to do with the single sample *segment.vcf file.
    The segmentation is probably different per sample, so it will be difficult to merge these to a nice square multi-sample CNV genotype matrix.

    It makes more sense to me to do segmentation on the merged multi-sample *interval.vcf .
    Via a GATK tool, or (for the time being) a custom in house script.

    There the segmentation procedure can at least start out with a nice square matrix, and at least create high confidence reference blocks (CN=2 for multiple joining 1Kb bins ) that span all samples in the file.

  • WimSWimS Member ✭✭

    Hi @slee I can confirm that a known important copy number variant is picked up by GATK4 gCNV and that the copy numbers that GATK gCNV detected mostly match our 'truth' copy numbers.

    Our goal is to detect new copy number variants in the samples/species that we work on in known regions / genes of interest.

    To determine the exact breakpoints and the exact copy number states we use more targeted / sensitive / cost effect wetlab methods.

    To discover rough areas of interesting CNV variation, and to discover the rough differences in copy number state, the merged *intervals.vcf already seem to be useful enough.

    The entire GATK4 gCNV pipeline is a big improvement over manual CNV calling in IGV, and (much) more usable / reliable than other gCNV pipelines I tried. Thank you (and colleagues ) for this.

  • sleeslee Member, Broadie, Dev ✭✭✭

    Great to hear that, @WimS, thanks for the feedback!

Sign In or Register to comment.