Service notice: Several of our team members are on vacation so service will be slow through at least July 13th, possibly longer depending on how much backlog accumulates during that time. This means that for a while it may take us more time than usual to answer your questions. Thank you for your patience.

merge multi-sample bam files

nansnans Member
edited January 2015 in Ask the GATK team

Hi,
I have with me 1000 samples that were pooled together in 5 different libraries and sequenced for genomic regions. So first, for each library, I independantly performed alignment, marked duplicates,realigned the bam files and also did recalibration and separated the bam file for each individual. Now I have 1000 individual bam files and I wanted to check the target coverage and perform variant calling.
I wanted to know if it makes sense to merge the bam files according to cases:controls and perform the analysis with 2 bam files or is it better to carry on with single individual bam files ?

Many thanks.

Post edited by nans on

Best Answer

Answers

  • So to run HaplotypeCaller, does it accept a list of bam files or we need to give the path of each bam file with the -I option ?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @nans‌

    Hi,

    Haplotype Caller can accept a list of bam files, however, if you think more samples may be added to your experiment in the future, you should run Haplotype Caller in GVCF mode. GVCF mode only accepts one bam file at a time. The articles Geraldine referred you to will help clarify the workflow.

    -Sheila

  • Hi, I have been trying to run the HaplotypeCaller command

        java -Xmx6g -jar /GATK/GenomeAnalysisTK.jar \
                -T HaplotypeCaller \
                -R /GATK/GATK_resources/ucsc.hg19.fasta \
                -I ${file}_sample.bam \   ##sample names(n=1000)
                -ERC GVCF \
                --variant_index_type LINEAR \
                --variant_index_parameter 128000 \
                -D /GATK/GATK_resources/dbsnp_138.hg19.vcf \
                -L $interval \
                -o $output/${file}.raw.snps.indels.g.vcf
    

    but running into an error

    INFO  14:38:06,240 HelpFormatter - -------------------------------------------------------------------------------- 
    INFO  14:38:06,242 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.3-0-g37228af, Compiled 2014/10/24 01:07:22 
    INFO  14:38:06,242 HelpFormatter - Copyright (c) 2010 The Broad Institute 
    INFO  14:38:06,242 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
    INFO  14:38:06,247 HelpFormatter - Program Args: -T HaplotypeCaller -R /GATK/GATK_resources/ucsc.hg19.fasta -I 1_sample.bam -ERC GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -D /GATK/GATK_resources/dbsnp_138.hg19.vcf -L /gatk_regions.intervals -o /samples/1.raw.snps.indels.g.vcf 
    INFO  14:38:06,252 HelpFormatter - Executing as [email protected] on Linux 2.6.32-504.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0-b132. 
    INFO  14:38:06,252 HelpFormatter - Date/Time: 2015/01/19 14:38:06 
    INFO  14:38:06,253 HelpFormatter - -------------------------------------------------------------------------------- 
    INFO  14:38:06,253 HelpFormatter - -------------------------------------------------------------------------------- 
    INFO  14:38:07,062 GenomeAnalysisEngine - Strictness is SILENT 
    INFO  14:38:07,160 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250 
    INFO  14:38:07,170 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
    INFO  14:38:07,224 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.05 
    INFO  14:38:07,251 HCMappingQualityFilter - Filtering out reads with MAPQ < 20 
    INFO  14:38:07,494 IntervalUtils - Processing 94588 bp from intervals 
    INFO  14:38:07,576 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files 
    INFO  14:38:07,602 GenomeAnalysisEngine - Done preparing for traversal 
    INFO  14:38:07,602 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
    INFO  14:38:07,603 ProgressMeter -                 |      processed |    time |         per 1M |           |   total | remaining 
    INFO  14:38:07,603 ProgressMeter -        Location | active regions | elapsed | active regions | completed | runtime |   runtime 
    INFO  14:38:07,603 HaplotypeCaller - Standard Emitting and Calling confidence set to 0.0 for reference-model confidence output 
    INFO  14:38:07,603 HaplotypeCaller - All sites annotated with PLs forced to true for reference-model confidence output 
    INFO  14:38:08,147 HttpMethodDirector - I/O exception (java.net.ConnectException) caught when processing request: Network is unreachable 
    INFO  14:38:08,147 HttpMethodDirector - Retrying request 
    INFO  14:38:08,149 HttpMethodDirector - I/O exception (java.net.ConnectException) caught when processing request: Network is unreachable 
    INFO  14:38:08,149 HttpMethodDirector - Retrying request 
    INFO  14:38:08,151 HttpMethodDirector - I/O exception (java.net.ConnectException) caught when processing request: Network is unreachable 
    INFO  14:38:08,151 HttpMethodDirector - Retrying request 
    INFO  14:38:08,152 HttpMethodDirector - I/O exception (java.net.ConnectException) caught when processing request: Network is unreachable 
    INFO  14:38:08,153 HttpMethodDirector - Retrying request 
    INFO  14:38:08,154 HttpMethodDirector - I/O exception (java.net.ConnectException) caught when processing request: Network is unreachable 
    INFO  14:38:08,154 HttpMethodDirector - Retrying request 
    ##### ERROR ------------------------------------------------------------------------------------------
    ##### ERROR A USER ERROR has occurred (version 3.3-0-g37228af): 
    ##### ERROR
    ##### ERROR This means that one or more arguments or inputs in your command are incorrect.
    ##### ERROR The error message below tells you what is the problem.
    ##### ERROR
    ##### ERROR If the problem is an invalid argument, please check the online documentation guide
    ##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
    ##### ERROR
    ##### ERROR Visit our website and forum for extensive documentation and answers to 
    ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ##### ERROR
    ##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
    ##### ERROR
    ##### ERROR MESSAGE: Invalid command line: Argument emitRefConfidence has a bad value: Can only be used in single sample mode currently. Use the sample_name argument to run on a single sample out of a multi-sample BAM file.
    ##### ERROR ------------------------------------------------------------------------------------------
    

    I am not understanding why I get the error "Network is unreachable" and "Invalid command line: Argument emitRefConfidence has a bad value". My input is single sample bam files and not the merged bam file.

    Many thanks,

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @nans You can ignore the network error message. Assuming you're running on a single bam file, you should check the header to make sure all your read groups have the same sample identifier (SM tag).

  • robertbrobertb torontoMember

    Hello,

    I'm trying to add to my high coverage ( 30X ) control samples by pooling low coverage 1000GP samples to reach the desired coverage. I ran the GATK pipeline as usual for each 1000GP sample and now have finished the BQSR step for the individual sample data bams. Normally, the next step would be haplotype caller. However, I was thinking about merging these low coverage 1000GP sample bams to create high coverage meta-sample bams that approach the high coverage of my other samples. Is this the right approach given what I want to do? Thanks.

    Issue · Github
    by Sheila

    Issue Number
    518
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • robertbrobertb torontoMember

    I'm aware that extreme allele allele frequencies are not handled well by the software. This is obviously not an ideal situation but I need more high coverage samples. Presumably in a meta-sample those variants that are common to the individual samples that comprise the meta-sample would likely be called whereas rare variants would either not be called or not pass quality filter.

    My only concern is how the software will handle these meta-sample bams.
    My plan was to change the sample ID in the read group information for each of the sample bams that I want to merge. Once they all have the same sample ID I would then merge the sample BAMs together the same way you would for example if you had separate bams for multiple sequencing runs of the same sample.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @robertb I think I see what you're trying to do, but I wouldn't advise continuing on that track. These meta-samples are going to confuse the heck out of the variant calling and genotyping algorithms (not to mention filtering, because the technical differences and pooling are going to wreak havoc on the annotations clustering trends). For pure joint variant calling purposes you'd actually be better off just adding those low-coverage samples as they are, without any pooling. Admittedly for filtering, if you want to use recalibration, the coverage distribution differences could be a source of noise -- but not as bad as you'd have with the meta-samples pooling.

Sign In or Register to comment.