Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

(How to) Map and clean up short read sequence data efficiently

2»

Comments

  • macko.tipormacko.tipor HungaryMember

    I didn't create any log-file intentionally. If it is created automatically, where can it be found?
    Anyway, I can invoke the pipe again and save a log-file. Do you mean by it the info shown in the terminal while the command is running? If not, what shall I do to save one?
    One piece of the info shown in the terminal is the following:

    INFO    2018-08-16 15:56:10 SortingCollection   Creating merging iterator from 32 files
    INFO    2018-08-16 15:58:53 AbstractAlignmentMerger Written in coordinate order to output    10,000,000 records.  Elapsed time: 00:02:42s.  Time for last 10,000,000:  161s.  Last read position: 14:81,609,540
    INFO    2018-08-16 16:00:24 AbstractAlignmentMerger Wrote 15475840 alignment records and 0 unmapped reads.
    [Thu Aug 16 16:00:24 CEST 2018] picard.sam.MergeBamAlignment done. Elapsed time: 26.48 minutes.
    

    Shall I post the whole standard-out?

  • yfarjounyfarjoun ✭✭✭ Broad InstituteDev ✭✭✭

    I'd like to see the first few lines of that output please.

  • macko.tipormacko.tipor HungaryMember

    In the experiment running now:

    The sort order of the input file:

    [email protected]:~/NextSeq_Runs/Evaluations/QIASeq/Results/QIASeq_01/RUN_analysis/Picard/uBAM/MergedFromLanes/Downsampled$ samtools view -H 559T_S6.downsampled.bam | grep '@HD' 
    @HD VN:1.5  GO:none SO:queryname
    

    The start and end of the info shown in the terminal:

    15:33:55.641 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/vaszko/Softwares/picard-2.18.11/picard.jar!/com/intel/gkl/native/libgkl_compression.so
    [Thu Aug 16 15:33:55 CEST 2018] SamToFastq INPUT=/home/vaszko/NextSeq_Runs/Evaluations/QIASeq/Results/QIASeq_01/RUN_analysis/Picard/uBAM/MergedFromLanes/Downsampled/559T_S6.downsampled.bam FASTQ=/dev/stdout INTERLEAVE=true CLIPPING_ATTRIBUTE=XT CLIPPING_ACTION=2 TMP_DIR=[/home/vaszko/tempstore]  OUTPUT_PER_RG=false COMPRESS_OUTPUTS_PER_RG=false RG_TAG=PU RE_REVERSE=true INCLUDE_NON_PF_READS=false CLIPPING_MIN_LENGTH=0 READ1_TRIM=0 READ2_TRIM=0 INCLUDE_NON_PRIMARY_ALIGNMENTS=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
    [Thu Aug 16 15:33:55 CEST 2018] Executing as [email protected] on Linux 4.15.0-30-generic amd64; Java  HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13; Deflater: Intel; Inflater: Intel; Provider GCS is not available;  Picard version: 2.18.11-SNAPSHOT
    15:33:55.950 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/vaszko /Softwares/picard-2.18.11/picard.jar!/com/intel/gkl/native/libgkl_compression.so
    [Thu Aug 16 15:33:56 CEST 2018] MergeBamAlignment UNMAPPED_BAM=/home/vaszko/NextSeq_Runs/Evaluations/QIASeq/Results/QIASeq_01/RUN_analysis/Picard/uBAM/MergedFromLanes/Downsampled/559T_S6.downsampled.bam ALIGNED_BAM=[/dev/stdin] OUTPUT=/home/vaszko/NextSeq_Runs/Evaluations/QIASeq/Results/QIASeq_01/RUN_analysis/Picard/BAM/Merged/559T_S6.merged.bam ALIGNED_READS_ONLY=true MAX_INSERTIONS_OR_DELETIONS=-1 ADD_MATE_CIGAR=true CREATE_INDEX=false REFERENCE_SEQUENCE=/home/vaszko/Refseqs/b37_2013-12-08/human_g1k_v37.fasta  ADD_PG_TAG_TO_READS=true PAIRED_RUN=true CLIP_ADAPTERS=true IS_BISULFITE_SEQUENCE=false ATTRIBUTES_TO_REVERSE=[OQ, U2] ATTRIBUTES_TO_REVERSE_COMPLEMENT=[E2, SQ] READ1_TRIM=0 READ2_TRIM=0 ALIGNER_PROPER_PAIR_FLAGS=false SORT_ORDER=coordinate PRIMARY_ALIGNMENT_STRATEGY=BestMapq CLIP_OVERLAPPING_READS=true INCLUDE_SECONDARY_ALIGNMENTS=true UNMAP_CONTAMINANT_READS=false MIN_UNCLIPPED_BASES=32 MATCHING_DICTIONARY_TAGS=[M5, LN] UNMAPPED_READ_STRATEGY=DO_NOT_CHANGE VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
    [Thu Aug 16 15:33:56 CEST 2018] Executing as [email protected] on Linux 4.15.0-30-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.18.11-SNAPSHOT
    INFO    2018-08-16 15:33:56 SamAlignmentMerger  Processing SAM file(s): [/dev/stdin]
    .
    .
    .
    INFO    2018-08-16 16:02:39 SortingCollection   Creating merging iterator from 56 files
    INFO    2018-08-16 16:04:45 AbstractAlignmentMerger Written in coordinate order to output     10,000,000 records.  Elapsed time: 00:02:05s.  Time for last 10,000,000:  125s.  Last read position: 7:6,776,787
    INFO    2018-08-16 16:06:50 AbstractAlignmentMerger Written in coordinate order to output    20,000,000 records.  Elapsed time: 00:04:10s.  Time for last 10,000,000:  124s.  Last read position: 17:29,663,821
    INFO    2018-08-16 16:08:24 AbstractAlignmentMerger Wrote 27042371 alignment records and 0 unmapped reads.
    [Thu Aug 16 16:08:24 CEST 2018] picard.sam.MergeBamAlignment done. Elapsed time: 34.47minutes.
    Runtime.totalMemory()=3066560512
    

    The sort order of the output file:

    [email protected]:~/NextSeq_Runs/Evaluations/QIASeq/Results/QIASeq_01/RUN_analysis/Picard/BAM/Merged$ samtools view -H 559T_S6.merged.bam | grep '@HD'
    @HD VN:1.5  SO:coordinate
    

    As I reviewed the command line, I noticed further strange things:
    The arguments for PRIMARY_ALIGNMENT_STRATEGY and INCLUDE_SECONDARY_ALIGNMENTS were "MostDistant" and "false", respectively. The same kind of problem as with SORT_ORDER.

  • yfarjounyfarjoun ✭✭✭ Broad InstituteDev ✭✭✭

    from your logs: "SORT_ORDER=coordinate"

  • macko.tipormacko.tipor HungaryMember

    Definitely. But the argument used for SORT_ORDER was "queryname".

    So do you think the source of the problem is in my function? I don't yet see where it is. Do you have any idea? Anyway, I will test it in several different forms again.

    Thanks again for your help!

  • yfarjounyfarjoun ✭✭✭ Broad InstituteDev ✭✭✭

    could you "echo" your commandline before you execute it? that way you should be able to see the exact value the parameters have....I suspect that it isn't "coordinate"

  • macko.tipormacko.tipor HungaryMember

    I echoed it and found that all the arguments were seen right inside the function.

    Then I rewrote the whole function character by character.
    Finally I found the reason. I have recently put some notes in the working version. One of them was in wrong position and "shadowed" all the options in the following lines. However, there was no error or exeption since the invisible options were either of not required status or had some default value. That is why the default values came into effect instead of the actual arguments.

    So it works well now!

    Sorry for the trouble and thank you very much for your help once again!

  • jejacobs23jejacobs23 Portland, ORMember

    Thanks for a great tutorial.

    I have a question regarding the final comments on multiplexed samples. I'm not sure I understand why it is necessary to run MarkDuplicates twice (once prior to merging the cleaned .bam files from different lanes and once after merging). From my reading of the GATK Tool Documentation, MarkDuplicates is capable of detecting both optical and PCR duplicates. Would it be OK to merge all the cleaned .bam files first and then run MarkDuplicates once?

    Thanks.

  • shleeshlee ✭✭✭✭✭ CambridgeMember, Broadie ✭✭✭✭✭

    Hi @jejacobs23,

    You'll find an explanation to your question in the discussion thread following Tutorial#6747. Basically, the twice-marking is of interest to sequencing centers who care about keeping track of the quality of samples and runs and delivering on promised complexity and coverage, as the Broad Genomics Platforms does. The rest of the research community probably will only want to mark duplicates once, at sample aggregation.

Sign In or Register to comment.