Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Germline CNV output

Hi GATK Team,

I am testing the Germline-CNV-Tools. In the tools documentation of PostprocessGermlineCNVCalls it is stated, that "... CNV call is either < DEL > or ...". In my genotyped_segments output file there is "N" as REF and "<DEL,DUP>" as ALT. Here is one example line:

chr2 26414098 CNV_chr2_26414098_179423395 N < DEL>,<DUP> . . END=179423395 GT:CN:NP:QA:QS:QSE:QSS 0:2:376:56:3077:175:267

I used the following parameters:
gatk PostprocessGermlineCNVCalls --calls-shard-path 1446-18_GCNV-calls/ --model-shard-path /media/Berechnungen/AnnotationDBs/CNV/Exom_neu/Exom_GCNV-model/ --sample-index 0 --autosomal-ref-copy-number 2 --allosomal-contig chrX --allosomal-contig chrY --output-genotyped-intervals 1446-18.genotyped_intervals.vcf --output-genotyped-segments 1446-18.genotyped_segments.vcf --contig-ploidy-calls 1446-18_DGCP-calls/

Why is my reference Tag "N" and not 2? Is there any error in my parameters?

Thanks in advance
Stefan

Tagged:

Answers

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭
    edited February 18

    According to VCF 4.1 standards there are only certain characters that could be used as reference and alternative genotypes (Actually HTSJDK is even stricter than those written standards.). Your actual result lies in the FORMAT section of GT and CN.

  • Hi @SkyWarrior ,
    thanks for your fast answer. So according to the example line above my copy number would be 2 which would be normal for a diploid organism. But what does the Genotyping (GT) field tell me in this case?

    Thanks
    Stefan

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    GT section is the genotype and 0 represents homozygous reference. 0 is used for CN 2 only I believe. 1 is used for CN 0,1 and 2 is used for CN 3 and above. I am not quite sure if tool can distinguish between number of copies between different homologues. I would work with CN first to filter out sites of interest.

  • Thanks for your help! I will look for CN to get the relevat CNVs. Did you use the GermlineCNV Tools on exome data? If so, how many CNV events do you see? For me it looks like there are a lot of false positiv calls in the data. Do you use any quality filter on the QA,QS,QSE or QSS fields to get rid of the false positiv calls?

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭
    edited February 18

    Yes I recently started using CNV tools for that. I have been using XHMM and some other tools before that. For filtering false positives I usually include known samples in the cohort pool to find the confirmed events. I also focus on clinically relevant regions designated by ClinDB, DGV and Decipher. There are many regions on the genome that show variable copy number.

  • I build my model with 40 samples from the same WES library prep protocoll sequenced on the same NextSeq in 4 different runs. How many samples do you use to build a model?
    Reducing the regions to clinically relevant ones is a good idea. Do you define this regions in your analysis with the -L parameter or do you use the whole exome to model and call CNVs and focus on clinically relevant data in the result set of PostprocessGermlineCNVCalls?

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    30 is a good number which is also the recommended value.

    I usually reduce the regions after calling so perform calling first then reduce.

  • Okay so I am save with my 40 samples :-)
    May I ask you how many CNVs (del or dup) you normally see in a WES sample raw genotyped_segments.vcf file?

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭
    edited February 18

    Good question.

    In an unfiltered setting I observe around 700 to 800 events in a single sample of which 50 60 of them are actually in clinically relevant regions and of which only a few of them are clinically significant changes and it usually occured to me about 70 in 1300 whole exomes and 800 clinical exomes (AKA glorified gene panels) . Though the final number may not be the final result however I have found around 70 clinically significant and confirmed CNV events in my whole dataset until today.

    I wrote myself a small java app utilizing htsjdk to parse those segmented vcfs to generate a file like this

    From here it gets really easy to filter and search for clinically significant regions.

    The red one is already confirmed and filed for the patient.

  • I am establishing the CNV right now wo I do not have that much data. But in my genotyped_segments.vcf file I have arround total 450 segments and 230 copy number variants. I will try to annotate them and see if I can filter them in a way you did. Do you confirm the CNVs with array data then or just doing qPCR?
    Have you ever thought of including SNP information to support a CNV call. So for example if there is a het deletion (CN=1), there should only be hom SNPs in that region? In a duplicated (CN =3) region there you should see variants poping up in 33% and 66% ratio...

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    Confirmation is usually done via MLPA or qPCR or event regular PCR (especially when the breakpoints are evident.) We can also do PCR confirmation without knowing the breakpoints by using an internal PCR control and a control sample since we know which exons to target.

    I have not tried overlapping SNPs with CNV events though that would be a very nice additional information to process. BAF ratios plotted along with CNV events would give you copy neutral ROHs as well as CNV LOHs.

  • I thought of using H3M2 to analyze ROHs of my WES. Maybe I can somehow combine the information of CNVs from GATK, ROHs from H3M2 and the varinats found by HaplotypeCaller to get a better differentiation of true positive CNVs from false positive.
    I will give it a when I have some time left during the week. If I get something usefull out of it I will let you know ;-)

    Thanks for your help!

  • xinzhengxinzheng Member
    > @SkyWarrior said:
    > GT section is the genotype and 0 represents homozygous reference. 0 is used for CN 2 only I believe. 1 is used for CN 0,1 and 2 is used for CN 3 and above. I am not quite sure if tool can distinguish between number of copies between different homologues. I would work with CN first to filter out sites of interest.

    Hi SkyWarrior :smile: GT section: 0 represents homozygous reference,but when CN is 0 or 1 and GT is 1 , what is GT=1 represents ?
  • xinzhengxinzheng Member
    @SkyWarrior
    In my data , it contains so many data were GT:CN was 1:0, I don't know if it's credible .
  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    GT does not seem too credible. I would go with CN only.

  • sleeslee Member, Broadie, Dev ✭✭✭

    @xinzheng @SkyWarrior along with absolute copy-number calls (CN), gCNV reports the symbolic ALT alleles <DEL>, <DUP>. So GT=1 represents any deletion, which may be either CN=0 or 1.

    @xinzheng if you are getting a lot of questionable deletions in your data, I would try using FilterIntervals to more aggressively filter out poorly covered regions or targets.

  • xinzhengxinzheng Member

    @slee thanks for your suggestion! I will do that step.

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    @slee said:
    @xinzheng @SkyWarrior along with absolute copy-number calls (CN), gCNV reports the symbolic ALT alleles <DEL>, <DUP>. So GT=1 represents any deletion, which may be either CN=0 or 1.

    Thanks @slee. That makes it clearer.

  • xinzhengxinzheng Member

    @slee @SkyWarrior hi all, when I view IGV result, something makes me confused, my cnv results show as below:
    1 13140450 CNV_1_13140450_13140742 N <DEL>,<DUP> . . END=13140742 GT:CN:CNLP:CNQ 1:0:0,110,72,110,110,110:72 1 13141129 CNV_1_13141129_13141583 N <DEL>,<DUP> . . END=13141583 GT:CN:CNLP:CNQ 2:3:108,108,36,0,3,20:3 1 13142563 CNV_1_13142563_13142689 N <DEL>,<DUP> . . END=13142689 GT:CN:CNLP:CNQ 2:3:108,91,27,0,3,21:3 1 13144439 CNV_1_13144439_13145000 N <DEL>,<DUP> . . END=13145000 GT:CN:CNLP:CNQ 2:3:38,14,10,0,4,17:4

    for CNV_1_13140450_13140742:

    • GT is 1 and CN is 0 means DEL and Copy Number is 0, and the igv view like fig1
      fig 1

    for CNV_1_13141129_13141583:

    • GT is 2 and CN is 3, igv result view as fig2
    • Compare the two sites, the total reads count are similar, why the result is different? Can you help me find some problems . Sincere thanks!
      fig2
  • sleeslee Member, Broadie, Dev ✭✭✭

    @xinzheng I think you are looking at the *intervals.vcf, which reports per-bin quantities that may be noisy. You will want to look at the *segments.vcf, which gives Viterbi-smoothed segment-level calls along with various quality scores that may be useful in filtering false positives.

  • xinzhengxinzheng Member

    hi @slee I have another question, when I compare 2 samples results, they have a similar total counts number and distribution(Fig 3). But the reulst is different . Sample GW9B0013B06A(above) result is DUP and CN is 3, but GW9B0013B05A(below) has no result in the VCF.
    So, I guess the reason may be COHORT mode(DetermineGermlineContigPloidy) result is not good.
    Because, I have not enouth samples, I use 14 smaples form one platform and 4 sample from other platform, some bed(WES DATA) regions and depth different. What do you think caused the inconsistency between the results of the two samples?

  • xinzhengxinzheng Member

    @slee I don't why the picture upload failed, I will try it later

  • sleeslee Member, Broadie, Dev ✭✭✭

    @xinzheng it sounds like you might not have enough samples to train a good model. Ideally, all of your samples would be from the same batch, and you would have at least several tens, if not 50-100.

    If you know your samples are from multiple batches, and you have enough samples in each batch, you may want to train different models for each. Alternatively, you could train a single model on multiple batches. This would be more convenient, but some principal components would then be expended to model batch membership (rather than within-batch systematic noise, which is what we want the principal components to model), and the final denoising result might not be ideal.

  • rsinghaniarsinghania Member
    edited April 25
    Hi, I'm trying to understand how in the VCF files the ALT can be "< DEL >,< DUP >". Shouldn't ALT be just one of them or neither? Somehow both the interval and segment VCF files I'm looking at have all positions marked as "< DEL >,< DUP >". Is this a cause for concern? Thanks very much.
    Post edited by rsinghania on
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited April 26

    Hi @rsinghania

    This question is better suited for www.biostars.org or www.seqanswers.com . Sorry we weren't of more help.

  • sleeslee Member, Broadie, Dev ✭✭✭

    @rsinghania see above---we list all possible ALT alleles, but the call made is specified by GT. So GT=0 is REF, GT=1 is DEL, and GT=2 is DUP. The CN annotation gives the absolute copy number and the qualities give various posterior probabilities associated with this copy-number state.

Sign In or Register to comment.