Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Questions about datasets for benchmarking

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
This discussion was created from comments split from: Which datasets should I use for reviewing or benchmarking purposes?.

Comments

  • angelangel Member
    edited December 2012

    Hi,
    From the 1000 Genome DCC link above, what is the difference between the set of NA12878 bam files by chromosome and the set containing just a single file?

    By chromosome:

    CEUTrio.HiSeq.WGS.jaffe.b37_decoy.NA12878.chr{1}.clean.dedup.recal.20120117.bam
    

    Single file:

    CEUTrio.HiSeq.WGS.b37_decoy.NA12878.clean.dedup.recal.20120117.bam
    

    Thanks,
    Angel

    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Angel,

    That's a question for the curators of the 1000 Genome DCC website -- they should be able to tell you what are the different files they offer for download.

  • egafniegafni Member

    Should we be using Picard's RevertSam before SamToFastq when trying to convert the bam to fastq? Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    That's a question for the Picard dev team -- they will be able to better tell you what their tools do.

  • charadecharade Member

    Hi. I was wondering if any fastq sequence file has been released for the data sets posted here. Preferably with seqence.index files. Thanks.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @charade,

    I believe these are all provided as bams, but if you want you can revert them to fastq using Picard tools.

  • charadecharade Member

    well. That I know. But I need the meta information of the sequence, like how many submissions are there, are they from different libraries with different insert size etc. Could you refer me to anyone who might know the details. Thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Any meta information available on these datasets should be provided by the 1000 Genomes project web page, I suggest you ask there.

  • SiyangLiuSiyangLiu Member

    The decoy bam doesn't seem to contain reads from the fragment library (correct me if I am wrong). Do you know where I can find the bam file for data from the fragment library? Thank you.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    I'm not sure what you're referring to. For questions about obtaining datasets from the 1000 Genomes project, please ask on the 1000 Genomes project website.

  • bettabergbettaberg ItalyMember

    Hi, I downloaded the file CEUTrio.HiSeq.WEx.b37_decoy.NA12878.clean.dedup.recal.20120117.bam from the link above and using mpileup of samtools I found out that the the greatest majority of positions in the genome have a very low coverage (0, 1 or 2 reads).
    I also computed the average coverage as the sum of the number of reads mapped to each position of the genome divided by the total number of positions and I obtained ~5, while I expected to get ~150 as stated in the table above. Have I misunderstood the mening of ~150x coverage? Sorry if the question may be silly but I just can't understand why what I get is so different from what I expected.
    Thank you,
    Elisabetta

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Elisabetta,

    To clarify, that dataset is an exome, not a whole genome, so you should compute the number of positions based on the exome capture intervals. Otherwise you're averaging covered regions with regions that have zero coverage by design.

  • bettabergbettaberg ItalyMember

    Ok, thank you. Could you please tell me where I can find the coordinates of these intervals?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi again and sorry for the response delay. Until now we did not provide the exome target list for this dataset, but we will now make it public via our resource bundle. It may take another day or two as we arrange final details, but it should be available in the bundle by the end of the week.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    And it's a little later than promised, but I have good news: the Broad exome target list is now available in our resource bundle on our FTP. We hope this will be useful for testing purposes.

  • fiapintofiapinto PortugalMember

    Hi,
    I would like to know if there are available VCF files for WGS and WEX of the samples NA12891 and NA12892, similar to what exists for the sample NA12878 in the NCBI site?
    Thank you,
    Sofia

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @fiapinto
    Hi Sofia,

    We do not offer those in our bundle. But, I think you can find them on the Illumina website.

    -Sheila

  • I am just wondering where I can download the Gnerre's data, as used in the ALLPATHS-LG paper, "High-quality draft assemblies of mammalian genomes from massively parallel sequence data"?

    Thanks.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Sorry @iceandfire, I have no idea. Have you tried asking the corresponding author?

  • mglclinicalmglclinical USAMember

    HI @Geraldine_VdAuwera ,

    I want to test to see how my exome pipeline(bwa + GATK HaplotypeCaller) will perform on an external reference dataset(NA12878), hence I have downloaded the BAM file CEUTrio.HiSeq.WEx.b37_decoy.NA12878.clean.dedup.recal.20120117.bam from this ftp link (ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/20120117_ceu_trio_b37_decoy/). I will get fastq files from this bam files.

    In this comments section it has been mentioned that exome target list has been uploaded to resource bundle , hence I guess the exome target list file that you refer to is "Broad.human.exome.b37.interval_list.gz" . I am wondering if I am using the correct exome intervals file or not, Please confirm ?

    In the downloaded exome file(Broad.human.exome.b37.interval_list.gz), I have noticed that the intervals are annotated as target_1, target_2, target_3, .... and so on in the 5th column. Some of the intervals are annotated as "new_exome_1.1_content". Should these intervals be treated differently ?

    Thanks,
    mglclinical

    Issue · Github
    by Sheila

    Issue Number
    2441
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    sooheelee
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @mglclinical
    Hi mglclinical,

    You are using the correct intervals file, and I don't think you need to treat those targets differently ( but let me confirm with the team and get back to you).

    -Sheila

  • mglclinicalmglclinical USAMember

    Hi @Sheila and @shlee ,

    Thank you for the answer, and my follow up questions is :

    I am following the Tutorial#6483(https://gatkforums.broadinstitute.org/gatk/discussion/6483) to extract fastq files from the downloaded BAM file (CEUTrio.HiSeq.WEx.b37_decoy.NA12878.clean.dedup.recal.20120117.bam)

    I have already performed the bamshuf and RevertSam. Now I am trying to do the MarkIlluminaAdapters step. The Tutorial#6483 says that the the default standard Illumina adapter sequences need to be adjusted to any adapter sequence using the FIVE_PRIME_ADAPTER and THREE_PRIME_ADAPTER parameters.

    So, I would like to know if there are any specific adapter sequences(5 prime and 3 prime) that needs to be given as inputs for this specific BAM file ?

    Or, Should I just let the tool MarkIlluminaAdapters use its default Illumina adapter sequences ?

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited August 2017

    Hi @mglclinical,

    You are downloading a WES BAM from NCBI, whose directory structure indicates the file originates from the 1000 Genomes Project. The file is CEUTrio.HiSeq.WEx.b37_decoy.NA12878.clean.dedup.recal.20120117.bam and you download it from ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/20120117_ceu_trio_b37_decoy/. Your questions are:

    [1]

    I am wondering if I am using the correct exome intervals file or not, Please confirm ?

    Here you refer to an exome targets file Broad.human.exome.b37.interval_list.gz from our resource bundle.

    [2]

    I would like to know if there are any specific adapter sequences(5 prime and 3 prime) that needs to be given as inputs for this specific BAM file ?

    I suggest you check the header of the BAM for information regarding the origin of the data, e.g. exome target list and preprocessing pipeline (@PG header lines). Depending on what the preprocessing pipeline included, you could check to see if there are program-specific tags, e.g. XT for MarkIlluminaAdapters, that remain that could clue you into what adapter kit was used. What you can also do is run some of the data through MarkIlluminaAdapters with the default settings that assume Illumina-specific adapters and see if the metrics look reasonable. Is there any reason for you to believe this sample was processed with other types of adapters?

    As for the exome target file in our bundle, I'm afraid I do not know anything about it. We'll see if others on the team might know. I think it would be better for you to check with NCBI or 1KGP for the correct exome target list for this sample.

Sign In or Register to comment.