Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

How to generate BAM file (from fastq files (Paired-end)) compatible with CollectInsertSizeMetrics

I have two fastq files (first and second reads in paired end data in separate .fastq files). I want to convert them to BAM file so that I can use the Picard Tool CollectInsertSizeMetrics. However, I am unable to use the tool as I get the following errors

#[Mon Oct 02 18:12:59 GMT 2017] CollectInsertSizeMetrics HISTOGRAM_FILE=insert_size_histogram.pdf INPUT=input.bam OUTPUT=output_insert_size_metrics.txt DEVIATIONS=10.0 MINIMUM_PCT=0.05 METRIC_ACCUMULATION_LEVEL=[ALL_READS] INCLUDE_DUPLICATES=false ASSUME_SORTED=true STOP_AFTER=0 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
WARNING 2017-10-02 18:12:59 SinglePassSamProgram File reports sort order 'queryname', assuming it's coordinate sorted anyway.
WARNING 2017-10-02 18:12:59 CollectInsertSizeMetrics All data categories were discarded because they contained < 0.05 of the total aligned paired data.
WARNING 2017-10-02 18:12:59 CollectInsertSizeMetrics Total mapped pairs in all categories: 0.0
[Mon Oct 02 18:12:59 GMT 2017] picard.analysis.CollectInsertSizeMetrics done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=126877696

I can't quite understand whether the BAM file I generated using FastqToSam Tool is incompatible?
If so how should I correct it?

What SORT_ORDER should I use?

Why does " CollectInsertSizeMetrics All data categories were discarded because they contained < 0.05 of the total aligned paired data." ?

What is meant by "CollectInsertSizeMetrics Total mapped pairs in all categories: 0.0"?

Additionally, what are the attributes of the BAM file that will allow the Picard Tool CollectInsertSizeMetrics to be compatible with it? Do I need to change the SORT_ORDER when creating the BAM using FastToSam from the two fastq files (1st and 2nd reads in paired-end data)? Which of the two "queryname" or "coordinate" is correct for the task I want to perform?

Please help?

Tagged:

Best Answers

Answers

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    I guess you are supposed to map the reads to a reference first before collecting any insert metrics.

  • Dear @SkyWarrior,

    I am actually trying to do a De Novo Assembly and as a pre processing step I want to collect the insert metrics and jump metrics for the FRAG_INSERT (paired-end), JUMP_READS (mate-pair) and LONG_JUMP_READS(mate-pair). This is only the first step and two more (similar steps) is what I perceive necessary (one each for the two types of mate-pair reads) for the pre-processing step.

    Following which I plan to create two csv files in_libs.csv and in_groups.csv and use ALLPATHS-LG (The ALLPATHS-LG manual r). Since, it is De Novo I don't think I need or rather don't have a reference available. I might be absolutely wrong but I don't know how else to proceed since I am stuck in the (pre) processing step of the ALLPATHS-LG input file generation.

    Please let me know if I am headed in the wrong direction and also how exactly should I plan from here.

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    If you are trying to do denovo assembly there are tools available that would guestimate appropriate insert size for proper contig/scaffold formation. AFAIK SPAdes does that to an extend (if you are working with microbial/prokaryotic genomes). SPAdes also has a alignment fix step that involves BWA to map reads to contigs to generate fixed genomes. But I don't think people can help you here for any denovo assemblers.

    Unless you exactly know how your library was prepared before sequencing and denovo assembly it is impossible to find out your insert size. To my knowledge average Nexterra insert size is between 200 to 1000 bps unless you perform size selection by bioanalyzer or gel electrophoresis.

    If none of these apply to you my best solution would be to try ranges of inserts from 200 to 1000 while trying to find out proper scaffold formation after denovo assembly.

  • Dear @SkyWarrior

    Thank you for the suggestions but my genome size is very large (reptile). I will try to use the tools you suggested.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @SkyWarrior
    Hi,

    Thanks for the suggestions!

    @bio_d
    Hi,

    It is true we do not work with de novo assembly much on my team. Most of the Picard tools take in an aligned BAM, so your unaligned SAM file is not going to work. Is there a reason you are against aligning then running QC?

    Thanks,
    Sheila

  • @Sheila
    I am not against aligning, the problem is I can't figure out how do I align given that a reference genome is not available (as I understand) for a de novo assembly.

  • Hi,

    I am still stuck with the collection of insert metrics. I followed the suggestions you ( @SkyWarrior and @Sheila ) gave and did the following.

    I used an assembly of contigs (from CLC workbench) as the reference genome for aligning paired-end library of predicted insert-size of 200bp (information from the colleague who did the experiments) and used BBMap to create an aligned file (AlignedPairedEnd_7.sam.gz). However, when I do a

    java -jar ./picard.jar CollectInsertSizeMetrics I=AlignedPairedEnd_7.sam.gz O=PEinsertsizemetrics.txt H=PEinsertsizemetrics.pdf M=0.5

    the job throws a lot of errors and stops. I can't figure out what the warning "SinglePassSamProgram File reports sort order 'unsorted', assuming it's coordinate sorted anyway." means. Could you please suggest a way out?

    11:56:29.480 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/user1/DE_NOVO_ASSEMBLY/picard.jar!/com/intel/gkl/native/libgkl_compression.so
    11:56:29.481 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/user1/DE_NOVO_ASSEMBLY/picard.jar!/com/intel/gkl/native/libgkl_compression.so
    [Thu Oct 19 11:56:29 CDT 2017] CollectInsertSizeMetrics HISTOGRAM_FILE=PEinsertsizemetrics.pdf MINIMUM_PCT=0.5 INPUT=AlignedPairedEnd_7.sam.gz OUTPUT=PEinsertsizemetrics.txt DEVIATIONS=10.0 METRIC_ACCUMULATION_LEVEL=[ALL_READS] INCLUDE_DUPLICATES=false ASSUME_SORTED=true STOP_AFTER=0 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false

    [Thu Oct 19 11:56:29 CDT 2017] CollectInsertSizeMetrics HISTOGRAM_FILE=PEinsertsizemetrics.pdf MINIMUM_PCT=0.5 INPUT=AlignedPairedEnd_7.sam.gz OUTPUT=PEinsertsizemetrics.txt DEVIATIONS=10.0 METRIC_ACCUMULATION_LEVEL=[ALL_READS] INCLUDE_DUPLICATES=false ASSUME_SORTED=true STOP_AFTER=0 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
    [Thu Oct 19 11:56:29 CDT 2017] Executing as [email protected] on Linux 3.10.0-514.26.2.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_144-b01; Deflater: Intel; Inflater: Intel; Picard version: 2.12.1-SNAPSHOT
    [Thu Oct 19 11:56:29 CDT 2017] Executing as [email protected] on Linux 3.10.0-514.26.2.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_144-b01; Deflater: Intel; Inflater: Intel; Picard version: 2.12.1-SNAPSHOT
    WARNING 2017-10-19 11:56:39 SinglePassSamProgram File reports sort order 'unsorted', assuming it's coordinate sorted anyway.
    INFO 2017-10-19 11:56:39 RExecutor Executing R script via command: Rscript /tmp/script70690755969894718.R /home/user1/DE_NOVO_ASSEMBLY/PEinsertsizemetrics.txt /home/user1/DE_NOVO_ASSEMBLY/PEinsertsizemetrics.pdf AlignedPairedEnd_7.sam.gz
    [[email protected] DE_NOVO_ASSEMBLY]$ cat PEinsertsizemetrics_err.txt
    11:56:29.480 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/user1/DE_NOVO_ASSEMBLY/picard.jar!/com/intel/gkl/native/libgkl_compression.so
    11:56:29.481 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/user1/DE_NOVO_ASSEMBLY/picard.jar!/com/intel/gkl/native/libgkl_compression.so
    [Thu Oct 19 11:56:29 CDT 2017] CollectInsertSizeMetrics HISTOGRAM_FILE=PEinsertsizemetrics.pdf MINIMUM_PCT=0.5 INPUT=AlignedPairedEnd_7.sam.gz OUTPUT=PEinsertsizemetrics.txt DEVIATIONS=10.0 METRIC_ACCUMULATION_LEVEL=[ALL_READS] INCLUDE_DUPLICATES=false ASSUME_SORTED=true STOP_AFTER=0 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
    [Thu Oct 19 11:56:29 CDT 2017] CollectInsertSizeMetrics HISTOGRAM_FILE=PEinsertsizemetrics.pdf MINIMUM_PCT=0.5 INPUT=AlignedPairedEnd_7.sam.gz OUTPUT=PEinsertsizemetrics.txt DEVIATIONS=10.0 METRIC_ACCUMULATION_LEVEL=[ALL_READS] INCLUDE_DUPLICATES=false ASSUME_SORTED=true STOP_AFTER=0 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
    [Thu Oct 19 11:56:29 CDT 2017] Executing as [email protected] on Linux 3.10.0-514.26.2.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_144-b01; Deflater: Intel; Inflater: Intel; Picard version: 2.12.1-SNAPSHOT
    [Thu Oct 19 11:56:29 CDT 2017] Executing as [email protected] on Linux 3.10.0-514.26.2.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_144-b01; Deflater: Intel; Inflater: Intel; Picard version: 2.12.1-SNAPSHOT
    WARNING 2017-10-19 11:56:39 SinglePassSamProgram File reports sort order 'unsorted', assuming it's coordinate sorted anyway.
    INFO 2017-10-19 11:56:39 RExecutor Executing R script via command: Rscript /tmp/script70690755969894718.R /home/user1/DE_NOVO_ASSEMBLY/PEinsertsizemetrics.txt /home/user1/DE_NOVO_ASSEMBLY/PEinsertsizemetrics.pdf AlignedPairedEnd_7.sam.gz
    [Thu Oct 19 11:56:39 CDT 2017] picard.analysis.CollectInsertSizeMetrics done. Elapsed time: 0.17 minutes.
    Runtime.totalMemory()=6565134336
    To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp

  • BegaliBegali GermanyMember ✭✭
    edited July 2018

    @SkyWarrior
    hi
    how long will take to generate sam file by using this command
    bwa mem Homo_sapiens_assembly38.fasta /home/pathology/Desktop/My_Data/after/new109-17.fq -> new109-17.sam
    [M::bwa_idx_load_from_disk] read 0 ALT contigs

    RAM memory is around 33 GB
    is it normal around 40 mins still at the first line

    My fastq is seq generated by cfDNA unsing NGS for Pancreatic_Cysts_Fluid

    Thanks in advance

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Begali
    Hi,

    Did the aligner run to completion, or are you still having problems?

    -Sheila

  • BegaliBegali GermanyMember ✭✭

    @Sheila

    Thanks for your reply and was solved suddenly worked ..

    -Begali

Sign In or Register to comment.