Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Calling copy number from WGS data using PON from different platform

Hi, I'm trying to call somatic copy number form WGS data using the workflow laid out in cnv_somatic_copy_ratio_bam_workflow.wdl
However, I do not have a panel of normals for the sample. The data was sequenced using HiSeq 4000 with 100bp reads. Would it be reasonable to construct a panel of normals from publicly available data (1000 genomes project) that was sequenced using HiSeq 2000 and has 90bp reads?

Tagged:

Answers

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited August 2017

    Hi @pkathail,

    You must have a PoN for the workflow so a less than ideal PoN is better than no PoN.

    Please see this tutorial https://drive.google.com/open?id=0BzI1CyccGsZiTU9aa1QyX0ZmSlE I wrote for the June 2017 workshop (in Cambridge and Edinburgh) that compares two different PoNs. See if the difference that we illustrate is acceptable to you for your analysis. Note that here we are handling WES data that have more extreme artifacts of technology when it comes to CNV analysis.

    We expect WGS data to be more even-keeled. However, consider the differences in sequencing technology.

    The HiSeq 4000 and 2000 use different flowcell configurations, i.e. patterned vs. non-patterned flowcells. There are some differences in the way data from these two sequencers are preprocessed (at least in our Best Practices), e.g. the OPTICAL_DUPLICATE_PIXEL_DISTANCE parameter in the MarkDuplicates step. What are the downstream artifacts of such differences in processing, I think will depend on the sample-prep, data preprocessing and (here I digress from your question) the reference the data was aligned to, i.e. GRCh38 should give better results than GRCh37. I would recommend you test out PoNs with data from the same sequencer as your sample and your proposed public data PoN to make sure that the resolution of your results is what you need and there are no strange artifacts stemming from flowcell configuration.

  • Thanks for the quick response! Yes, it would be ideal to test out a PoN from the same sequencer as my sample, but so far I haven't been able to find any normal public datasets sequenced using the HiSeq 4000. Do you know if there is public data from the HiSeq 4000 or another patterned flowcell (perhaps HiSeq X Ten) that could be used to construct a more suitable PoN?

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    @pkathail, please check with Illumina and also with gnomAD.

Sign In or Register to comment.