Dream challenge PON


The Supplementary Data for the Dream challenge (http://www.nature.com/nmeth/journal/v12/n7/full/nmeth.3407.html) indicates that a panel of normals filter was used for the Broad's Mutect submission:

"Panel of normals filter:
matlab: survey_panel_of_normals_for_mutations(maf_file, bamfile_list.txt) where bamfile_list.txt contains 258 normal whole genome bam files to use as an additional panel of normals."

Is this dataset available to the public?

Thank you.

Issue · Github
by Sheila

Issue Number
Last Updated
Closed By

Best Answer


  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @Imose,

    Can you point me to the exact Supplementary section that mentions the PON? If the PON was derived from TCGA data, then unfortunately, it cannot be shared, unless you have EraCommons permissions for TCGA data. If it was derived from 1000 Genomes Project data, I believe it can be shared and I can see if I can track it down for you.

    That being said, I would recommend constructing your own panel of normals from your normal samples. If you need to pad your normal sample set, as I believe PONs derived from larger datasets are generally better, you should at the least use samples prepared in the same manner (prep and tool-chain) as your samples and sequenced by the same center as your samples. The PON is meant to capture sequencing artifacts that may be different for different centers/sample-prep/tool-chains, and so matching its constituents closely to the provenance of your samples is ideal.

    See this document for how to create your PON. Be sure to search our forum also if you have questions as there are multiple threads that help people with their PON creation.

  • Hi,

    Please see SupplementaryTable_3/SuppTable_BroadSMC/IS3/* in Supplementary Data 1 from the article in the original post.

    Thanks for the tip on PON, and will definitely keep in mind that it should be technology / prep specific. For now, we were hoping to reproduce the Mutect results on the Dream challenge dataset, thus the request for the specific filters used in the challenge.


  • shleeshlee CambridgeMember, Broadie, Moderator

    Ok, thanks for the pointer. The PON is named wgs_hg19_125_cancer_blood_normal_panel.vcf. So it looks like it is derived from cancer patient matched normal samples, i.e. TCGA and perhaps other protected data. This may be a wild goose chase, but I can't help but think this is exactly the type of resource that may be made available in FireCloud. At the least, the FireCloud site has a document that outlines how you can obtain TCGA data and dbGaP authorization. If the PON was created using soley TCGA data, once you have permissions, you should be able to gain access assuming it is available--otherwise will need to ask someone from Broad CGA. If the PON incorporates other protected data, e.g. ICGA data, then the sharing situation may be more complicated.

  • shleeshlee CambridgeMember, Broadie, Moderator

    @Imose, I've just confirmed that this PON is unavailable on FireCloud. Apparently, we are not allowed to redistribute this data. It may be that you can recreate a similar PON with TCGA data, if you have permissions that is.

  • OK, thanks. The TCGA data is mostly exomes and RNA-seq whereas the Dream challenge was WGS. Was there a specific set of samples used to create the PON? All normal WGS from TCGA?

  • shleeshlee CambridgeMember, Broadie, Moderator

    @Imose, I'm told this particular PON was created using the methods outlined in the MuTect publication. Can you check their methods?

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @Imose,

    I just wanted to let you know that despite my answer above, I'm working on your behalf to make this particular PON available. In the meanwhile, I have some information for you that may be of interest in case my efforts to provide you the actual PON are unfruitful.

    Talking to folks here in the know, it appears that creating a different PON and using this PON should still yield similar results for your DREAM challenge recapitulation. This is because PONs capture common (n>=2) artifacts of sequencing and of tool-chains.

    So let me tell you what I learned about the data underlying the wgs_hg19_125_cancer_blood_normal_panel.vcf. This PON was created with 125 whole genome samples derived using 2012 technology. The sample libraries were from the blood normal tissue of cancer patients. We do NOT use matched normal tissue samples, as matched normal tissue samples can be contaminated with tumor/pre-tumor tissue, as they are typically derived from tissue adjacent to the tumor. The samples are deep coverage samples, ~30x, aligned to hg19. The libraries were paired end and approximately ~101 bp reads (2x101).

    I'll let you know if I make any progress in getting you the actual VCF. In the meanwhile, I hope this is enough information to enable you to progress in your research aims.

Sign In or Register to comment.