We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

GATK4 CNV, output 3 million CNVs !!!

Hi, there:

I was very excited to know that GATK4 could now generate CNV data from WGS.

However, I spent long time to make it work, but I think it still does not.

I recently followed all the instructions and managed to generate an output file. But this file has almost 3 million rows. The first 15 rows of the output file is shown below.

I think each person is expected to have ~1,000 CNVs, not ~3 million!!!
I understand that GATK is using a 1KB sliding window to detect CNV. But then how could I get the ~1,000 CNVs that I could use to run downstream analyisis?

Your help would be greatly appreciated!

Thank you & best regards,
Jie

Answers

  • cruckertcruckert GermanyMember

    In your output you should also find a file with intervals merged into larger segments (parameter: --output-genotyped-segments). In this file consecutive 1KB windows with the same copy number are merged into larger segments resulting in far fewer entries.

    Best,
    Christian

  • Thanks, Christian!

    I now included --output-genotyped-segements. Please see screenshot below for the first few rows of my output file.

    My output file still has over 60,000 CNVs. I though that it should be only a few thousands. So, what is the normal range for the number of CNVs generated from GATK4 CNV?

    Also, there is no way to tell Deletion vs. Duplication from my output file. Did I miss something?

    Thank you & best regards,
    Jie

  • akovalskakovalsk Member, Broadie, Moderator admin

    Hi @jiehuang001 thanks for your question!

    It looks like the reason you have so many is that the segments vcf includes copy-neutral intervals. You should try to filter on GT > 0. We also suggest filtering the calls on QS score as well. Usually, taking calls with QS > 80 is a good starting point.

Sign In or Register to comment.