The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Get notifications!


You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

Formatting tip!


Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block as demonstrated here.

Jump to another community
Picard 2.9.4 is now available. Download and read release notes here.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

Collected FAQs about input files for sequence read data (BAM/CRAM)

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
edited November 2015 in FAQs

1. What file formats do you support for sequence data input?

The GATK supports the BAM format for reads, quality scores, alignments, and metadata (e.g. the lane of sequencing, center of origin, sample name, etc.). Starting with version 3.5, the CRAM format is supported as well. SAM format is not supported but can be easily converted with Picard tools.


2. How do I get my data into BAM format?

The GATK doesn't have any tools for getting data into BAM format, but many other toolkits exist for this purpose. We recommend you look at Picard and Samtools for creating and manipulating BAM files. Also, many aligners are starting to emit BAM files directly. See BWA for one such aligner.


3. What are the formatting requirements for my BAM file(s)?

All BAM/CRAM files must satisfy the following requirements:

  • It must be aligned to one of the references described here.
  • It must be sorted in coordinate order (not by queryname and not "unsorted").
  • It must list the read groups with sample names in the header.
  • Every read must belong to a read group.
  • The BAM file must pass Picard ValidateSamFile validation.

See the official BAM specification for more information on what constitutes a valid BAM file.


4. What is the canonical ordering of human reference contigs in a BAM file?

It depends on whether you're using the NCBI/GRC build 36/build 37 version of the human genome, or the UCSC hg18/hg19 version of the human genome. While substantially equivalent, the naming conventions are different. The canonical ordering of contigs for these genomes is as follows:

Human genome reference consortium standard ordering and names (b3x):
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT...

UCSC convention (hg1x):
chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY...


5. How can I tell if my BAM file is sorted properly?

The easiest way to do it is to download Samtools and run the following command to examine the header of your file:

$ samtools view -H /path/to/my.bam
@HD     VN:1.0  GO:none SO:coordinate
@SQ     SN:1    LN:247249719
@SQ     SN:2    LN:242951149
@SQ     SN:3    LN:199501827
@SQ     SN:4    LN:191273063
@SQ     SN:5    LN:180857866
@SQ     SN:6    LN:170899992
@SQ     SN:7    LN:158821424
@SQ     SN:8    LN:146274826
@SQ     SN:9    LN:140273252
@SQ     SN:10   LN:135374737
@SQ     SN:11   LN:134452384
@SQ     SN:12   LN:132349534
@SQ     SN:13   LN:114142980
@SQ     SN:14   LN:106368585
@SQ     SN:15   LN:100338915
@SQ     SN:16   LN:88827254
@SQ     SN:17   LN:78774742
@SQ     SN:18   LN:76117153
@SQ     SN:19   LN:63811651
@SQ     SN:20   LN:62435964
@SQ     SN:21   LN:46944323
@SQ     SN:22   LN:49691432
@SQ     SN:X    LN:154913754
@SQ     SN:Y    LN:57772954
@SQ     SN:MT   LN:16571
@SQ     SN:NT_113887    LN:3994
...

If the order of the contigs here matches the contig ordering specified above, and the SO:coordinate flag appears in your header, then your contig and read ordering satisfies the GATK requirements.


6. My BAM file isn't sorted that way. How can I fix it?

Picard offers a tool called SortSam that will sort a BAM file properly. A similar utility exists in Samtools, but we recommend the Picard tool because SortSam will also set a flag in the header that specifies that the file is correctly sorted, and this flag is necessary for the GATK to know it is safe to process the data. Also, you can use the ReorderSam command to make a BAM file SQ order match another reference sequence.


7. How can I tell if my BAM file has read group and sample information?

A quick Unix command using Samtools will do the trick:

$ samtools view -H /path/to/my.bam | grep '^@RG'
@RG ID:0    PL:solid    PU:Solid0044_20080829_1_Pilot1_Ceph_12414_B_lib_1_2Kb_MP_Pilot1_Ceph_12414_B_lib_1_2Kb_MP   LB:Lib1 PI:2750 DT:2008-08-28T20:00:00-0400 SM:NA12414  CN:bcm
@RG ID:1    PL:solid    PU:0083_BCM_20080719_1_Pilot1_Ceph_12414_B_lib_1_2Kb_MP_Pilot1_Ceph_12414_B_lib_1_2Kb_MP    LB:Lib1 PI:2750 DT:2008-07-18T20:00:00-0400 SM:NA12414  CN:bcm
@RG ID:2    PL:LS454    PU:R_2008_10_02_06_06_12_FLX01080312_retry  LB:HL#01_NA11881    PI:0    SM:NA11881  CN:454MSC
@RG ID:3    PL:LS454    PU:R_2008_10_02_06_07_08_rig19_retry    LB:HL#01_NA11881    PI:0    SM:NA11881  CN:454MSC
@RG ID:4    PL:LS454    PU:R_2008_10_02_17_50_32_FLX03080339_retry  LB:HL#01_NA11881    PI:0    SM:NA11881  CN:454MSC
...

The presence of the @RG tags indicate the presence of read groups. Each read group has a SM tag, indicating the sample from which the reads belonging to that read group originate.

In addition to the presence of a read group in the header, each read must belong to one and only one read group. Given the following example reads,

$ samtools view /path/to/my.bam | grep '^@RG'
EAS139_44:2:61:681:18781    35  1   1   0   51M =   9   59  TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA B<>;==?=?<==?=?=>>?>><=<?=?8<=?>?<:=?>?<==?=>:;<?:= RG:Z:4  MF:i:18 Aq:i:0  NM:i:0  UQ:i:0  H0:i:85 H1:i:31
EAS139_44:7:84:1300:7601    35  1   1   0   51M =   12  62  TAACCCTAAGCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA G<>;==?=?&=>?=?<==?>?<>>?=?<==?>?<==?>?1==@>?;<=><; RG:Z:3  MF:i:18 Aq:i:0  NM:i:1  UQ:i:5  H0:i:0  H1:i:85
EAS139_44:8:59:118:13881    35  1   1   0   51M =   2   52  TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA @<>;<=?=?==>?>?<==?=><=>?-?;=>?:><==?7?;<>?5?<<=>:; RG:Z:1  MF:i:18 Aq:i:0  NM:i:0  UQ:i:0  H0:i:85 H1:i:31
EAS139_46:3:75:1326:2391    35  1   1   0   51M =   12  62  TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA @<>==>?>@???B>A>?>A?A>??A?@>?@A?@;??A>@7>?>>@:>=@;@ RG:Z:0  MF:i:18 Aq:i:0  NM:i:0  UQ:i:0  H0:i:85 H1:i:31
...

membership in a read group is specified by the RG:Z:* tag. For instance, the first read belongs to read group 4 (sample NA11881), while the last read shown here belongs to read group 0 (sample NA12414).


8. My BAM file doesn't have read group and sample information. Do I really need it?

Yes! Many algorithms in the GATK need to know that certain reads were sequenced together on a specific lane, as they attempt to compensate for variability from one sequencing run to the next. Others need to know that the data represents not just one, but many samples. Without the read group and sample information, the GATK has no way of determining this critical information. You can use Picard's AddOrReplaceReadGroups tool to add read group information.


11. What's the best way to create a subset of my BAM file containing only reads over a small interval?

You can use the GATK to do the following:

java -jar GenomeAnalysisTK.jar -R reference.fasta -I full_input.bam -T PrintReads -L chr1:10-20 -o subset_input.bam

and you'll get a BAM file containing only reads overlapping those points. This operation retains the complete BAM header from the full file (this was the reference aligned to, after all) so that the BAM remains easy to work with. We routinely use these features for testing and high-performance analysis with the GATK.

Post edited by Geraldine_VdAuwera on
Tagged:

Issue · Github
by Sheila

Issue Number
368
State
closed
Last Updated
Milestone
Array
Closed By
vdauwera

Comments

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Questions and comments up to August 2014 have been moved to an archival thread here:

    http://gatkforums.broadinstitute.org/discussion/4558/questions-about-bam-files

  • everestial007everestial007 GreensboroMember

    HI @Geraldine_VdAuwera @Sheila

    After we do the bwa alignment > sorting of the bam > and mark-duplicates, should we get rid of unmapped reads and/or reads with low mapQ (based on SAM specification) for downstream analyses.

    I hunch is both yes and no based on what my overall goal is (that is what best practices recommend). But, how would it affect my choice if my main goal is SNP vs. Indel calling.

    Also, https://www.broadinstitute.org/gatk/guide/article?id=2799 link is down.

    Thanks,

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    You can get rid of those reads if you want, it won't hurt the analysis. The downside is that those reads will be gone so you can't change your mind. We don't remove anything.

    That document has been replaced by https://www.broadinstitute.org/gatk/guide/article?id=6747

    I'll fix the link so that it doesn't just break.

  • sqanbarsqanbar UppsalaMember

    I am feeding a list of BAM.gz files into PrintReads to get a subset of BAM files for a region. I get this error message !

    ERROR MESSAGE: SAM/BAM/CRAM file BL01.sorted.dedup.rlgn.bam.gz is malformed. Please see http://gatkforums.broadinstitute.org/discussion/1317/collected-faqs-about-input-files-for-sequence-read-data-bam-cramfor more information. Error details: SAM file doesn't have any read groups defined in the header. The GATK no longer supports SAM files without read groups

    I do not see any problem with BAM files. the question is can I feed "bam.gz" files into PrintReads ?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @sqanbar The error message says that the header section of your bam file does not have Read Group information (lines starting with @RG). If that is true, you must add these before proceeding with analysis. See Picard AddOrReplaceReadGroups. If this is not true and you can show us the RG lines, I will check why you would get this error.

  • JuttaJutta Bonn, GermanyMember

    Hi Geraldine,

    i run again into a new problem with my RNA-seq data set. The following is my program code for HaplotypeCaller according to the best practice pipeline. The previous steps run without any problems for this sample.

    java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R Zea_mays.AGPv3.31.dna.genome.edit.fa -I sample_SplitNCigar.bam -dontUseSoftClippedBases -stand_call_conf 20.0 -stand_emit_conf 20.0 -o sample_HaploCaller.vcf

    I obtained the following error message:

    ERROR ------------------------------------------------------------------------------------------
    ERROR A USER ERROR has occurred (version 3.6-0-g89b7209):
    ERROR
    ERROR This means that one or more arguments or inputs in your command are incorrect.
    ERROR The error message below tells you what is the problem.
    ERROR
    ERROR If the problem is an invalid argument, please check the online documentation guide
    ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
    ERROR
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions https://www.broadinstitute.org/gatk
    ERROR
    ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
    ERROR
    ERROR MESSAGE: SAM/BAM/CRAM file /media/storage/Serverdaten/GATK-SNP_calling/SplitNTrim/A554-II-2_SplitNCigar.bam is malformed. Please see https://www.broadinstitute.org/gatk/guide/article?id=1317for more information. Error details: Unrecognized tag type: :
    ERROR ------------------------------------------------------------------------------------------

    As described on the website, I checked my reorder.bam file, the one which I reorder using ReorderSam after removing duplicates using MarkDuplicate, with ValidateSamFile:
    java -jar /opt/picard/picard.jar ValidateSamFile I=sample_reordered.bam MODE=SUMMARY

    HISTOGRAM java.lang.String

    Error Type Count
    ERROR:MATE_NOT_FOUND 4378094

    Furthermore, I re-run MarkDuplicates and reorderSam and checked my .bam file again. Still I obtained the same ERROR message.

    I also checked ValidateSamFile for another sample for which Haplotype-calling step worked fine without any error messages. For this sample I obtained the same SUMMARY output of ValidateSAMFile as for the first sample, were I run into the problem with the "unrecognized tag type".

    Now, I am wondering, if the I should correct the error "MATE_NOT_FOUND", although it seems, that this error is not the cause for the failure of HaplotypeCaller.
    In case it is, how do I correct for this error?
    And otherwise, how can I remove the "Unrecognized tag type: : ". What is the cause for this error message?

    I am happy for any recommendations :)
    Best wishes
    Jutta

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    Hmm, that's one I have not seen before. What did you use to align your data?
  • JuttaJutta Bonn, GermanyMember

    I used TopHat2

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    I would recommend using STAR instead, as recommended in our best practices. In our hands it is the aligner that is most well-behaved.
  • JuttaJutta Bonn, GermanyMember

    Hm, I have in total 180 RNA-samples and for 179 the whole pipeline worked quite well, just for this one sample not. To redo the mapping with STAR for all the samples would take to much time and I have to do a lot of other analyses again. I guess, just mapping the one error-prone sample with STAR and keeping the other samples mapped with TopHat is not recommendable?!

    So, I would very appreciate it, if anyone has a solution how to fix the problem with the "unrecognized tag type"

    Thanks for your help,
    Jutta

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @Jutta,

    If I understand correctly, only one of 180 files is causing an error for you. All other samples processed successfully. So, let's try to figure out what is different about this one file compared to the other 179 and then figure out what may be the actual cause of error--missing mates or unrecognized tag types. I think it may be more likely that the former is the case since I would assume all your samples run with the same pipeline would give you the same set of tags. Let's confirm.

    First, have you tried rerunning this particular sample through your pipeline? Do you get the same exact output? We should rule out whether the data was truncated due to some random glitch that would cause mates to go missing.

    If you can reproduce the error, then we should figure out why this one file is different from your other 179.

    To list all tags within a BAM, use this command:

    samtools view input.bam | cut -f 12- | tr '\t' '\n' | cut -d ':' -f 1 | awk '{ if(!x[$1]++) { print }}' 
    

    Do you see any unusual tags not in your other files?

    To obtain summary metrics of properly paired mates, use this command:

    samtools flagstat input.bam
    

    Do you see an unexpected proportion of unpaired mates?

    If unpaired mates is the problem, then you should know that HaplotypeCaller does not use mate information in genotyping. So one quick solution I think is for you to unpair your reads by removing the 0x1 bitwise flag. The following command will create SE data:

    samtools view -h input_pe.bam | gawk '{printf "%s\t", $1; if(and($2,0x1))
    {t=$2-0x1}else{t=$2}; printf "%s\t" , t; for (i=3; i<NF; i++){printf "%s\t", $i} ; 
    printf "%s\n",$NF}'| samtools view -Sb - > output_se.bam
    

    Let us know what you find and any solutions that work or don't work.

  • shleeshlee CambridgeMember, Broadie, Moderator

    @Jutta,

    I need to clarify that HaplotypeCaller does use mate information to determine whether it will use reads in genotyping. If reads are paired, then the mates must align to the same contig for HaplotypeCaller to consider them in genotyping. If the mates align to different contigs, HaplotypeCaller ignores the reads in genotyping. What I meant to say above was that the mate information does not contribute to better genotyping, e.g. phasing. The last tentative solution I suggest above is one way for you to test whether unpaired mates is the reason for the error.

  • JuttaJutta Bonn, GermanyMember

    Hi Shlee,

    First, yes I rerun the sample with the same pipeline again and I could reproduce the mentioned error message. Therefore, i performed all the three suggestions you made:

    1.check the tags within the bam file:
    I applied to provided code to each bam file I produced within the pipeline based on the GATK Best practice.
    The output for the raw mapped reads (direct unprocessed output of TopHat2) is following:
    samtools view accepted_hits.bam | cut -f 12- | tr '\t' '\n' | cut -d ':' -f 1 | awk '{ if(!x[$1]++) { print }}'

    AS
    XM
    XO
    XG
    MD
    NM
    XS
    NH
    XN
    YT
    CC
    CP
    HI

    This output is the same as for other samples which I checked randomly.
    The output for the deupped and reordered bam file of this samples looks similar and is also the same for other samples, which I checked:
    samtools view dedupped.bam | cut -f 12- | tr '\t' '\n' | cut -d ':' -f 1 | awk '{ if(!x[$1]++) { print }}'

    MD
    PG
    RG
    XG
    NH
    NM
    XM
    XN
    XO
    AS
    YT
    HI
    CC
    CP
    XS

    A problem occurred, when I checked the tags for the bam file resulted from SplitNCigarReads:
    samtools view SplitNCigar.bam | cut -f 12- | tr '\t' '\n' | cut -d ':' -f 1 | awk '{ if(!x[$1]++) { print }}'

    MD
    PG
    RG
    XG
    NH
    NM
    XM
    XN
    XO
    AS
    YT
    HI
    CC
    CP
    XS
    XX
    1
    20
    �Y
    ZU
    �Y
    Y
    Here I observed these last 7 "tags", which I didn't observe in other samples, for which the whole pipeline worked fine! Seems like there is a problem with SplitNCigar? The code I used for all samples is:
    java -jar .../GenomeAnalysisTK.jar -T SplitNCigarReads -R Zea_mays.AGPv3.31.dna.genome.edit.fa -I reordered.bam -o SplitNCigar.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS

    2.unexpected proportion of unpaired reads?
    samtools flagstat SplitNCigar.bam

    84282659 + 0 in total (QC-passed reads + QC-failed reads)
    8821999 + 0 secondary
    0 + 0 supplementary
    29552683 + 0 duplicates
    84282659 + 0 mapped (100.00% : N/A)
    74719435 + 0 paired in sequencing
    37500927 + 0 read1
    37218508 + 0 read2
    63033090 + 0 properly paired (84.36% : N/A)
    69735729 + 0 with itself and mate mapped
    4983706 + 0 singletons (6.67% : N/A)
    1355867 + 0 with mate mapped to a different chr
    863143 + 0 with mate mapped to a different chr (mapQ>=5)

    I can't observe here any unexpected proportions? Or do you?

    3.I unpaired the mates with the code you provided and used the output file as input for Haplotype caller, but I again obtained the following error. Not sure, if I used the right input file or if I can somehow index the file prior to inputting it to Haplotypecaller?!
    samtools view -h SplitNCigar.bam | gawk '{printf "%s\t", $1; if(and($2,0x1)) {t=$2-0x1}else{t=$2}; printf "%s\t" , t; for (i=3; i<NF; i++){printf "%s\t", $i} ; printf "%s\n",$NF}' | samtools view -Sb - > SplitNCigar_se.bam

    [E::sam_parse1] incomplete aux field
    [W::sam_read1] parse error at line 12778196
    [main_samview] truncated file.

    I don't unterstand what this output tells me and if the "unpairing" of the mates was successfully. Nevertheless, I just used SplitNCigar_se.bam as input for Haplotypecaller with the following output:
    ...
    INFO 11:08:12,732 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
    INFO 11:08:12,810 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.08
    INFO 11:08:12,847 HCMappingQualityFilter - Filtering out reads with MAPQ < 20
    INFO 11:08:13,039 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files

    ERROR ------------------------------------------------------------------------------------------
    ERROR A USER ERROR has occurred (version 3.6-0-g89b7209):
    ERROR
    ERROR This means that one or more arguments or inputs in your command are incorrect.
    ERROR The error message below tells you what is the problem.
    ERROR
    ERROR If the problem is an invalid argument, please check the online documentation guide
    ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
    ERROR
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions https://www.broadinstitute.org/gatk
    ERROR
    ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
    ERROR
    ERROR MESSAGE: Invalid command line: Cannot process the provided BAM/CRAM file(s) because they were not indexed. The GATK does offer limited processing of unindexed BAM/CRAMs in --unsafe mode, but this feature is unsupported -- use it at your own risk!

    Sorry, for this long post!

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @Jutta,

    All this information is helpful so no need to apologize for the long post. Thanks for going through all the steps and also figuring out where in the pipeline the odd tags come from. Your SplitNCigarReads command must be fine because it works for the other samples without error. For the single problem sample, the extra tags (XX, 1, 20, �Y, ZU, �Y, Y) are definitely unexpected and likely the cause of error. Since you can recapitulate the error with the same sample file, my guess is that this is something different about the original sample data file that is causing SplitNCigarReads to produce strangely tagged records.

    Results from (2) are less interesting so let's consider the first results from (3). Samtools is saying the file is truncated (incomplete aux field, truncated file) and let's assume for now that this relates to the odd tags. If I google [E::sam_parse1] incomplete aux field, I come upon this github thread: https://github.com/chapmanb/bcbio-nextgen/issues/1639. To quote from it:

    It's typically due to some sort of memory issue during the [pipeline] run.

    The thread suggests this message appears due to a memory allocation issue in a pipeline. Errors due to memory constraints is not an area I am familiar with so please let me know if this makes sense to you. Is it possible your particular sample requires more memory to process than the other samples? Is the file size larger than the others or is there reason to believe it may have other unusual properties that necessitate more compute (for SplitNCigarReads or prior steps)? If you process this sample through the same steps manually, outside the pipeline, with more memory, or within the pipeline with more memory, do you get the same odd tagging?

  • shleeshlee CambridgeMember, Broadie, Moderator

    P.S. @Jutta. Can you also run your problem BAM through ValidateSamFile? This is an easier way to suss out potential issues with BAMs. Hopefully, our tool catches such odd files as yours.

Sign In or Register to comment.