Forum Login Issue:
Currently the "Log in with Google" button redirects you to a "Page not found." Our forum vendors have implemented a fix, and now we are just waiting on a patch to be released. In the meantime, while on the "Page not found" you can edit the URL to delete the second gatk, firecloud, or wdl (depending on what subforum you are acessing).
ex: https://gatkforums.broadinstitute.org/gatk/gatk/entry/...

Collected FAQs about input files for sequence read data (BAM/CRAM)

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
edited November 2015 in Frequently Asked Questions

1. What file formats do you support for sequence data input?

The GATK supports the BAM format for reads, quality scores, alignments, and metadata (e.g. the lane of sequencing, center of origin, sample name, etc.). Starting with version 3.5, the CRAM format is supported as well. SAM format is not supported but can be easily converted with Picard tools.


2. How do I get my data into BAM format?

The GATK doesn't have any tools for getting data into BAM format, but many other toolkits exist for this purpose. We recommend you look at Picard and Samtools for creating and manipulating BAM files. Also, many aligners are starting to emit BAM files directly. See BWA for one such aligner.


3. What are the formatting requirements for my BAM file(s)?

All BAM/CRAM files must satisfy the following requirements:

  • It must be aligned to one of the references described here.
  • It must be sorted in coordinate order (not by queryname and not "unsorted").
  • It must list the read groups with sample names in the header.
  • Every read must belong to a read group.
  • The BAM file must pass Picard ValidateSamFile validation.

See the official BAM specification for more information on what constitutes a valid BAM file.


4. What is the canonical ordering of human reference contigs in a BAM file?

It depends on whether you're using the NCBI/GRC build 36/build 37 version of the human genome, or the UCSC hg18/hg19 version of the human genome. While substantially equivalent, the naming conventions are different. The canonical ordering of contigs for these genomes is as follows:

Human genome reference consortium standard ordering and names (b3x):
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT...

UCSC convention (hg1x):
chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY...


5. How can I tell if my BAM file is sorted properly?

The easiest way to do it is to download Samtools and run the following command to examine the header of your file:

$ samtools view -H /path/to/my.bam
@HD     VN:1.0  GO:none SO:coordinate
@SQ     SN:1    LN:247249719
@SQ     SN:2    LN:242951149
@SQ     SN:3    LN:199501827
@SQ     SN:4    LN:191273063
@SQ     SN:5    LN:180857866
@SQ     SN:6    LN:170899992
@SQ     SN:7    LN:158821424
@SQ     SN:8    LN:146274826
@SQ     SN:9    LN:140273252
@SQ     SN:10   LN:135374737
@SQ     SN:11   LN:134452384
@SQ     SN:12   LN:132349534
@SQ     SN:13   LN:114142980
@SQ     SN:14   LN:106368585
@SQ     SN:15   LN:100338915
@SQ     SN:16   LN:88827254
@SQ     SN:17   LN:78774742
@SQ     SN:18   LN:76117153
@SQ     SN:19   LN:63811651
@SQ     SN:20   LN:62435964
@SQ     SN:21   LN:46944323
@SQ     SN:22   LN:49691432
@SQ     SN:X    LN:154913754
@SQ     SN:Y    LN:57772954
@SQ     SN:MT   LN:16571
@SQ     SN:NT_113887    LN:3994
...

If the order of the contigs here matches the contig ordering specified above, and the SO:coordinate flag appears in your header, then your contig and read ordering satisfies the GATK requirements.


6. My BAM file isn't sorted that way. How can I fix it?

Picard offers a tool called SortSam that will sort a BAM file properly. A similar utility exists in Samtools, but we recommend the Picard tool because SortSam will also set a flag in the header that specifies that the file is correctly sorted, and this flag is necessary for the GATK to know it is safe to process the data. Also, you can use the ReorderSam command to make a BAM file SQ order match another reference sequence.


7. How can I tell if my BAM file has read group and sample information?

A quick Unix command using Samtools will do the trick:

$ samtools view -H /path/to/my.bam | grep '^@RG'
@RG ID:0    PL:solid    PU:Solid0044_20080829_1_Pilot1_Ceph_12414_B_lib_1_2Kb_MP_Pilot1_Ceph_12414_B_lib_1_2Kb_MP   LB:Lib1 PI:2750 DT:2008-08-28T20:00:00-0400 SM:NA12414  CN:bcm
@RG ID:1    PL:solid    PU:0083_BCM_20080719_1_Pilot1_Ceph_12414_B_lib_1_2Kb_MP_Pilot1_Ceph_12414_B_lib_1_2Kb_MP    LB:Lib1 PI:2750 DT:2008-07-18T20:00:00-0400 SM:NA12414  CN:bcm
@RG ID:2    PL:LS454    PU:R_2008_10_02_06_06_12_FLX01080312_retry  LB:HL#01_NA11881    PI:0    SM:NA11881  CN:454MSC
@RG ID:3    PL:LS454    PU:R_2008_10_02_06_07_08_rig19_retry    LB:HL#01_NA11881    PI:0    SM:NA11881  CN:454MSC
@RG ID:4    PL:LS454    PU:R_2008_10_02_17_50_32_FLX03080339_retry  LB:HL#01_NA11881    PI:0    SM:NA11881  CN:454MSC
...

The presence of the @RG tags indicate the presence of read groups. Each read group has a SM tag, indicating the sample from which the reads belonging to that read group originate.

In addition to the presence of a read group in the header, each read must belong to one and only one read group. Given the following example reads,

$ samtools view /path/to/my.bam | grep '^@RG'
EAS139_44:2:61:681:18781    35  1   1   0   51M =   9   59  TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA B<>;==?=?<==?=?=>>?>><=<?=?8<=?>?<:=?>?<==?=>:;<?:= RG:Z:4  MF:i:18 Aq:i:0  NM:i:0  UQ:i:0  H0:i:85 H1:i:31
EAS139_44:7:84:1300:7601    35  1   1   0   51M =   12  62  TAACCCTAAGCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA G<>;==?=?&=>?=?<==?>?<>>?=?<==?>?<==?>?1==@>?;<=><; RG:Z:3  MF:i:18 Aq:i:0  NM:i:1  UQ:i:5  H0:i:0  H1:i:85
EAS139_44:8:59:118:13881    35  1   1   0   51M =   2   52  TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA @<>;<=?=?==>?>?<==?=><=>?-?;=>?:><==?7?;<>?5?<<=>:; RG:Z:1  MF:i:18 Aq:i:0  NM:i:0  UQ:i:0  H0:i:85 H1:i:31
EAS139_46:3:75:1326:2391    35  1   1   0   51M =   12  62  TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA @<>==>?>@???B>A>?>A?A>??A?@>?@A?@;??A>@7>?>>@:>=@;@ RG:Z:0  MF:i:18 Aq:i:0  NM:i:0  UQ:i:0  H0:i:85 H1:i:31
...

membership in a read group is specified by the RG:Z:* tag. For instance, the first read belongs to read group 4 (sample NA11881), while the last read shown here belongs to read group 0 (sample NA12414).


8. My BAM file doesn't have read group and sample information. Do I really need it?

Yes! Many algorithms in the GATK need to know that certain reads were sequenced together on a specific lane, as they attempt to compensate for variability from one sequencing run to the next. Others need to know that the data represents not just one, but many samples. Without the read group and sample information, the GATK has no way of determining this critical information. You can use Picard's AddOrReplaceReadGroups tool to add read group information.


11. What's the best way to create a subset of my BAM file containing only reads over a small interval?

You can use the GATK to do the following:

java -jar GenomeAnalysisTK.jar -R reference.fasta -I full_input.bam -T PrintReads -L chr1:10-20 -o subset_input.bam

and you'll get a BAM file containing only reads overlapping those points. This operation retains the complete BAM header from the full file (this was the reference aligned to, after all) so that the BAM remains easy to work with. We routinely use these features for testing and high-performance analysis with the GATK.

Post edited by Geraldine_VdAuwera on
Tagged:

Issue · Github
by Sheila

Issue Number
368
State
closed
Last Updated
Milestone
Array
Closed By
vdauwera

Comments

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Questions and comments up to August 2014 have been moved to an archival thread here:

    http://gatkforums.broadinstitute.org/discussion/4558/questions-about-bam-files

  • everestial007everestial007 GreensboroMember

    HI @Geraldine_VdAuwera @Sheila

    After we do the bwa alignment > sorting of the bam > and mark-duplicates, should we get rid of unmapped reads and/or reads with low mapQ (based on SAM specification) for downstream analyses.

    I hunch is both yes and no based on what my overall goal is (that is what best practices recommend). But, how would it affect my choice if my main goal is SNP vs. Indel calling.

    Also, https://www.broadinstitute.org/gatk/guide/article?id=2799 link is down.

    Thanks,

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    You can get rid of those reads if you want, it won't hurt the analysis. The downside is that those reads will be gone so you can't change your mind. We don't remove anything.

    That document has been replaced by https://www.broadinstitute.org/gatk/guide/article?id=6747

    I'll fix the link so that it doesn't just break.

  • sqanbarsqanbar UppsalaMember

    I am feeding a list of BAM.gz files into PrintReads to get a subset of BAM files for a region. I get this error message !

    ERROR MESSAGE: SAM/BAM/CRAM file BL01.sorted.dedup.rlgn.bam.gz is malformed. Please see http://gatkforums.broadinstitute.org/discussion/1317/collected-faqs-about-input-files-for-sequence-read-data-bam-cramfor more information. Error details: SAM file doesn't have any read groups defined in the header. The GATK no longer supports SAM files without read groups

    I do not see any problem with BAM files. the question is can I feed "bam.gz" files into PrintReads ?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @sqanbar The error message says that the header section of your bam file does not have Read Group information (lines starting with @RG). If that is true, you must add these before proceeding with analysis. See Picard AddOrReplaceReadGroups. If this is not true and you can show us the RG lines, I will check why you would get this error.

  • JuttaJutta Bonn, GermanyMember

    Hi Geraldine,

    i run again into a new problem with my RNA-seq data set. The following is my program code for HaplotypeCaller according to the best practice pipeline. The previous steps run without any problems for this sample.

    java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R Zea_mays.AGPv3.31.dna.genome.edit.fa -I sample_SplitNCigar.bam -dontUseSoftClippedBases -stand_call_conf 20.0 -stand_emit_conf 20.0 -o sample_HaploCaller.vcf

    I obtained the following error message:

    ERROR ------------------------------------------------------------------------------------------
    ERROR A USER ERROR has occurred (version 3.6-0-g89b7209):
    ERROR
    ERROR This means that one or more arguments or inputs in your command are incorrect.
    ERROR The error message below tells you what is the problem.
    ERROR
    ERROR If the problem is an invalid argument, please check the online documentation guide
    ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
    ERROR
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions https://www.broadinstitute.org/gatk
    ERROR
    ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
    ERROR
    ERROR MESSAGE: SAM/BAM/CRAM file /media/storage/Serverdaten/GATK-SNP_calling/SplitNTrim/A554-II-2_SplitNCigar.bam is malformed. Please see https://www.broadinstitute.org/gatk/guide/article?id=1317for more information. Error details: Unrecognized tag type: :
    ERROR ------------------------------------------------------------------------------------------

    As described on the website, I checked my reorder.bam file, the one which I reorder using ReorderSam after removing duplicates using MarkDuplicate, with ValidateSamFile:
    java -jar /opt/picard/picard.jar ValidateSamFile I=sample_reordered.bam MODE=SUMMARY

    HISTOGRAM java.lang.String

    Error Type Count
    ERROR:MATE_NOT_FOUND 4378094

    Furthermore, I re-run MarkDuplicates and reorderSam and checked my .bam file again. Still I obtained the same ERROR message.

    I also checked ValidateSamFile for another sample for which Haplotype-calling step worked fine without any error messages. For this sample I obtained the same SUMMARY output of ValidateSAMFile as for the first sample, were I run into the problem with the "unrecognized tag type".

    Now, I am wondering, if the I should correct the error "MATE_NOT_FOUND", although it seems, that this error is not the cause for the failure of HaplotypeCaller.
    In case it is, how do I correct for this error?
    And otherwise, how can I remove the "Unrecognized tag type: : ". What is the cause for this error message?

    I am happy for any recommendations :)
    Best wishes
    Jutta

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    Hmm, that's one I have not seen before. What did you use to align your data?
  • JuttaJutta Bonn, GermanyMember

    I used TopHat2

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    I would recommend using STAR instead, as recommended in our best practices. In our hands it is the aligner that is most well-behaved.
  • JuttaJutta Bonn, GermanyMember

    Hm, I have in total 180 RNA-samples and for 179 the whole pipeline worked quite well, just for this one sample not. To redo the mapping with STAR for all the samples would take to much time and I have to do a lot of other analyses again. I guess, just mapping the one error-prone sample with STAR and keeping the other samples mapped with TopHat is not recommendable?!

    So, I would very appreciate it, if anyone has a solution how to fix the problem with the "unrecognized tag type"

    Thanks for your help,
    Jutta

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @Jutta,

    If I understand correctly, only one of 180 files is causing an error for you. All other samples processed successfully. So, let's try to figure out what is different about this one file compared to the other 179 and then figure out what may be the actual cause of error--missing mates or unrecognized tag types. I think it may be more likely that the former is the case since I would assume all your samples run with the same pipeline would give you the same set of tags. Let's confirm.

    First, have you tried rerunning this particular sample through your pipeline? Do you get the same exact output? We should rule out whether the data was truncated due to some random glitch that would cause mates to go missing.

    If you can reproduce the error, then we should figure out why this one file is different from your other 179.

    To list all tags within a BAM, use this command:

    samtools view input.bam | cut -f 12- | tr '\t' '\n' | cut -d ':' -f 1 | awk '{ if(!x[$1]++) { print }}' 
    

    Do you see any unusual tags not in your other files?

    To obtain summary metrics of properly paired mates, use this command:

    samtools flagstat input.bam
    

    Do you see an unexpected proportion of unpaired mates?

    If unpaired mates is the problem, then you should know that HaplotypeCaller does not use mate information in genotyping. So one quick solution I think is for you to unpair your reads by removing the 0x1 bitwise flag. The following command will create SE data:

    samtools view -h input_pe.bam | gawk '{printf "%s\t", $1; if(and($2,0x1))
    {t=$2-0x1}else{t=$2}; printf "%s\t" , t; for (i=3; i<NF; i++){printf "%s\t", $i} ; 
    printf "%s\n",$NF}'| samtools view -Sb - > output_se.bam
    

    Let us know what you find and any solutions that work or don't work.

  • shleeshlee CambridgeMember, Broadie, Moderator

    @Jutta,

    I need to clarify that HaplotypeCaller does use mate information to determine whether it will use reads in genotyping. If reads are paired, then the mates must align to the same contig for HaplotypeCaller to consider them in genotyping. If the mates align to different contigs, HaplotypeCaller ignores the reads in genotyping. What I meant to say above was that the mate information does not contribute to better genotyping, e.g. phasing. The last tentative solution I suggest above is one way for you to test whether unpaired mates is the reason for the error.

  • JuttaJutta Bonn, GermanyMember

    Hi Shlee,

    First, yes I rerun the sample with the same pipeline again and I could reproduce the mentioned error message. Therefore, i performed all the three suggestions you made:

    1.check the tags within the bam file:
    I applied to provided code to each bam file I produced within the pipeline based on the GATK Best practice.
    The output for the raw mapped reads (direct unprocessed output of TopHat2) is following:
    samtools view accepted_hits.bam | cut -f 12- | tr '\t' '\n' | cut -d ':' -f 1 | awk '{ if(!x[$1]++) { print }}'

    AS
    XM
    XO
    XG
    MD
    NM
    XS
    NH
    XN
    YT
    CC
    CP
    HI

    This output is the same as for other samples which I checked randomly.
    The output for the deupped and reordered bam file of this samples looks similar and is also the same for other samples, which I checked:
    samtools view dedupped.bam | cut -f 12- | tr '\t' '\n' | cut -d ':' -f 1 | awk '{ if(!x[$1]++) { print }}'

    MD
    PG
    RG
    XG
    NH
    NM
    XM
    XN
    XO
    AS
    YT
    HI
    CC
    CP
    XS

    A problem occurred, when I checked the tags for the bam file resulted from SplitNCigarReads:
    samtools view SplitNCigar.bam | cut -f 12- | tr '\t' '\n' | cut -d ':' -f 1 | awk '{ if(!x[$1]++) { print }}'

    MD
    PG
    RG
    XG
    NH
    NM
    XM
    XN
    XO
    AS
    YT
    HI
    CC
    CP
    XS
    XX
    1
    20
    �Y
    ZU
    �Y
    Y
    Here I observed these last 7 "tags", which I didn't observe in other samples, for which the whole pipeline worked fine! Seems like there is a problem with SplitNCigar? The code I used for all samples is:
    java -jar .../GenomeAnalysisTK.jar -T SplitNCigarReads -R Zea_mays.AGPv3.31.dna.genome.edit.fa -I reordered.bam -o SplitNCigar.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS

    2.unexpected proportion of unpaired reads?
    samtools flagstat SplitNCigar.bam

    84282659 + 0 in total (QC-passed reads + QC-failed reads)
    8821999 + 0 secondary
    0 + 0 supplementary
    29552683 + 0 duplicates
    84282659 + 0 mapped (100.00% : N/A)
    74719435 + 0 paired in sequencing
    37500927 + 0 read1
    37218508 + 0 read2
    63033090 + 0 properly paired (84.36% : N/A)
    69735729 + 0 with itself and mate mapped
    4983706 + 0 singletons (6.67% : N/A)
    1355867 + 0 with mate mapped to a different chr
    863143 + 0 with mate mapped to a different chr (mapQ>=5)

    I can't observe here any unexpected proportions? Or do you?

    3.I unpaired the mates with the code you provided and used the output file as input for Haplotype caller, but I again obtained the following error. Not sure, if I used the right input file or if I can somehow index the file prior to inputting it to Haplotypecaller?!
    samtools view -h SplitNCigar.bam | gawk '{printf "%s\t", $1; if(and($2,0x1)) {t=$2-0x1}else{t=$2}; printf "%s\t" , t; for (i=3; i<NF; i++){printf "%s\t", $i} ; printf "%s\n",$NF}' | samtools view -Sb - > SplitNCigar_se.bam

    [E::sam_parse1] incomplete aux field
    [W::sam_read1] parse error at line 12778196
    [main_samview] truncated file.

    I don't unterstand what this output tells me and if the "unpairing" of the mates was successfully. Nevertheless, I just used SplitNCigar_se.bam as input for Haplotypecaller with the following output:
    ...
    INFO 11:08:12,732 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
    INFO 11:08:12,810 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.08
    INFO 11:08:12,847 HCMappingQualityFilter - Filtering out reads with MAPQ < 20
    INFO 11:08:13,039 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files

    ERROR ------------------------------------------------------------------------------------------
    ERROR A USER ERROR has occurred (version 3.6-0-g89b7209):
    ERROR
    ERROR This means that one or more arguments or inputs in your command are incorrect.
    ERROR The error message below tells you what is the problem.
    ERROR
    ERROR If the problem is an invalid argument, please check the online documentation guide
    ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
    ERROR
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions https://www.broadinstitute.org/gatk
    ERROR
    ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
    ERROR
    ERROR MESSAGE: Invalid command line: Cannot process the provided BAM/CRAM file(s) because they were not indexed. The GATK does offer limited processing of unindexed BAM/CRAMs in --unsafe mode, but this feature is unsupported -- use it at your own risk!

    Sorry, for this long post!

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @Jutta,

    All this information is helpful so no need to apologize for the long post. Thanks for going through all the steps and also figuring out where in the pipeline the odd tags come from. Your SplitNCigarReads command must be fine because it works for the other samples without error. For the single problem sample, the extra tags (XX, 1, 20, �Y, ZU, �Y, Y) are definitely unexpected and likely the cause of error. Since you can recapitulate the error with the same sample file, my guess is that this is something different about the original sample data file that is causing SplitNCigarReads to produce strangely tagged records.

    Results from (2) are less interesting so let's consider the first results from (3). Samtools is saying the file is truncated (incomplete aux field, truncated file) and let's assume for now that this relates to the odd tags. If I google [E::sam_parse1] incomplete aux field, I come upon this github thread: https://github.com/chapmanb/bcbio-nextgen/issues/1639. To quote from it:

    It's typically due to some sort of memory issue during the [pipeline] run.

    The thread suggests this message appears due to a memory allocation issue in a pipeline. Errors due to memory constraints is not an area I am familiar with so please let me know if this makes sense to you. Is it possible your particular sample requires more memory to process than the other samples? Is the file size larger than the others or is there reason to believe it may have other unusual properties that necessitate more compute (for SplitNCigarReads or prior steps)? If you process this sample through the same steps manually, outside the pipeline, with more memory, or within the pipeline with more memory, do you get the same odd tagging?

  • shleeshlee CambridgeMember, Broadie, Moderator

    P.S. @Jutta. Can you also run your problem BAM through ValidateSamFile? This is an easier way to suss out potential issues with BAMs. Hopefully, our tool catches such odd files as yours.

  • Hi @Geraldine_VdAuwera

    In the section ”7. How can I tell if my BAM file has read group and sample information?“,you give a command samtools view /path/to/my.bam | grep '^@RGto print the read wich include "RG". I think it is useless to match "RG" in reads, because reads don't start with "RG". May you want to type samtools view /path/to/my.bam | grep 'RG', right?

    If I am wrong, Please tell me.

    yours, gossie

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    No, the command looks specifically for lines that start with @RG, which is specific of header lines that define read groups. You could restrict the samtools view command to just the header, for efficiency, but aside from that the command is correct.

  • @Geraldine_VdAuwera,

    Thanks for your reply. Yes, the command is good for header lines, but the same command is ok for read lines?

    Following content from your original:

    In addition to the presence of a read group in the header, each read must belong to one and only one read group. Given the following example reads,

    $ samtools view /path/to/my.bam | grep '^@RG'
    EAS139_44:2:61:681:18781    35  1   1   0   51M =   9   59  TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA B<>;==?=?><==?=?=>>?>><=<?=?8<=?>?<:=?>?<==?=>:;<?:= RG:Z:4  MF:i:18 Aq:i:0  NM:i:0  UQ:i:0  H0:i:85 H1:i:31
    EAS139_44:7:84:1300:7601    35  1   1   0   51M =   12  62  TAACCCTAAGCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA G<>;==?=?&=>?=?<==?>?<>>?=?<==?>?<==?>?1==@>?;<=><; RG:Z:3  MF:i:18 Aq:i:0  NM:i:1  UQ:i:5  H0:i:0  H1:i:85
    
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Ah, well that’s a different purpose. What we’re talking about here is a quick check to see if read group definitions are present or not. If you actually want to verify that each read has a read group, you’re better off running ValidateSamFile which will report the results more sanely.

  • It is you that use the command samtools view /path/to/my.bam | grep '^@RG to match read lines with "RG" and It's wrong, although it's a small mistake. I advise you to read the section in your original.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @Gossie
    Hi,

    I just re-read the section, and it looks fine. I am not sure where you are seeing "match read lines with "RG" "?

    -Sheila

  • @Geraldine_VdAuwera @Sheila
    Hi, when I run GATK HaplotypeCaller command I got an issue,but I got a result raw.vcf contain some information:

    ##### ERROR ------------------------------------------------------------------------------------------
    ##### ERROR A USER ERROR has occurred (version 3.8-0-ge9d806836):
    ##### ERROR
    ##### ERROR This means that one or more arguments or inputs in your command are incorrect.
    ##### ERROR The error message below tells you what is the problem.
    ##### ERROR
    ##### ERROR If the problem is an invalid argument, please check the online documentation guide
    ##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
    ##### ERROR
    ##### ERROR Visit our website and forum for extensive documentation and answers to
    ##### ERROR commonly asked questions https://software.broadinstitute.org/gatk
    ##### ERROR
    ##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
    ##### ERROR
    ##### ERROR MESSAGE: SAM/BAM/CRAM file /media/taoyan/cotton-zhangxianlong/analysis/bwa-bam/SRR1580583.marked.bam is malformed. Please see https://software.broadinstitute.org/gatk/documentation/article?id=1317for more information. Error details: Did not inflate expected amount
    ##### ERROR ------------------------------------------------------------------------------------------
    

    And my command is:

    #!/bin/bash
    java -Xmx32g -jar /home/taoyan/biosoft/gatk/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar \
         -T HaplotypeCaller \
         -R /media/taoyan/reference-seq/biodata/cotton-ref/NBI_Gossypium_hirsutum_v1.1.fa \
         -I /media/taoyan/cotton-zhangxianlong/analysis/bwa-bam/SRR1580583.marked.bam \
         --genotyping_mode DISCOVERY \
         -o /media/taoyan/cotton-zhangxianlong/data/seqs/SRR1580583.raw.snps.indels.vcf
    

    first I aligh the sequence with BWA,then sort 、reorder and Mrkduplicate with picard.My bam is like that:

    $ samtools view -H SRR1580583.marked.bam|grep '^@RG'
    @RG     ID:583  LB:583  PL:Illumina     PU:583  SM:583
    $ samtools view -H SRR1580583.marked.bam |head
    @HD     VN:1.5  SO:coordinate
    @SQ     SN:A01  LN:99884700     M5:12e28e0311e760f4fcee0d72006a2ab3     UR:file:/media/taoyan/reference-seq/biodata/cotton-ref/NBI_Gossypium_hirsutum_v1.1.fa
    @SQ     SN:A02  LN:83447906     M5:8c578a1af12657fa12427cf7f9048498     UR:file:/media/taoyan/reference-seq/biodata/cotton-ref/NBI_Gossypium_hirsutum_v1.1.fa
    @SQ     SN:A03  LN:100263045    M5:e152719f9d4abdb6754e69e0c83b5478     UR:file:/media/taoyan/reference-seq/biodata/cotton-ref/NBI_Gossypium_hirsutum_v1.1.fa
    @SQ     SN:A04  LN:62913772     M5:a8356a76049c178286d25786388baab9     UR:file:/media/taoyan/reference-seq/biodata/cotton-ref/NBI_Gossypium_hirsutum_v1.1.fa
    @SQ     SN:A05  LN:92047023     M5:2e21804703461bdad8f79727d964ca4a     UR:file:/media/taoyan/reference-seq/biodata/cotton-ref/NBI_Gossypium_hirsutum_v1.1.fa
    @SQ     SN:A06  LN:103170444    M5:68d43f50938bd83b112c446b9a4dc410     UR:file:/media/taoyan/reference-seq/biodata/cotton-ref/NBI_Gossypium_hirsutum_v1.1.fa
    @SQ     SN:A07  LN:78251018     M5:41f31563d8dc05dcd7fdf31000c2d393     UR:file:/media/taoyan/reference-seq/biodata/cotton-ref/NBI_Gossypium_hirsutum_v1.1.fa
    @SQ     SN:A08  LN:103626341    M5:aaea5b5a9ca8f8ad6b894b381bcfee1f     UR:file:/media/taoyan/reference-seq/biodata/cotton-ref/NBI_Gossypium_hirsutum_v1.1.fa
    @SQ     SN:A09  LN:74999931     M5:eb20c9cea068cd638a2e35e4226cada8     UR:file:/media/taoyan/reference-seq/biodata/cotton-ref/NBI_Gossypium_hirsutum_v1.1.fa
    

    So is there any suggestion how to fix it? I have run the command to validate it, and here it's the result:

    $ java -jar ~/biosoft/picard/picard.jar ValidateSamFile I=/media/taoyan/cotton-zhangxianlong/analysis/bwa-bam/SRR1580583.marked.bam MODE=SUMMARY
    14:19:22.327 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/taoyan/biosoft/picard/picard.jar!/com/intel/gkl/native/libgkl_compression.so
    [Sat Dec 16 14:19:22 CST 2017] ValidateSamFile INPUT=/media/taoyan/cotton-zhangxianlong/analysis/bwa-bam/SRR1580583.marked.bam MODE=SUMMARY    MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true INDEX_VALIDATION_STRINGENCY=EXHAUSTIVE IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 SKIP_MATE_VALIDATION=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
    [Sat Dec 16 14:19:22 CST 2017] Executing as taoyan@davinci on Linux 4.10.0-37-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_121-b15; Deflater: Intel; Inflater: Intel; Picard version: 2.15.0-SNAPSHOT
    INFO    2017-12-16 14:20:05     SamFileValidator        Validated Read    10,000,000 records.  Elapsed time: 00:00:41s.  Time for last 10,000,000:   41s.  Last read position: A04:24,798,585
    INFO    2017-12-16 14:20:45     SamFileValidator        Validated Read    20,000,000 records.  Elapsed time: 00:01:21s.  Time for last 10,000,000:   40s.  Last read position: A07:76,082,514
    INFO    2017-12-16 14:21:26     SamFileValidator        Validated Read    30,000,000 records.  Elapsed time: 00:02:02s.  Time for last 10,000,000:   40s.  Last read position: A11:35,688,146
    INFO    2017-12-16 14:22:06     SamFileValidator        Validated Read    40,000,000 records.  Elapsed time: 00:02:42s.  Time for last 10,000,000:   40s.  Last read position: D02:35,617,652
    INFO    2017-12-16 14:22:47     SamFileValidator        Validated Read    50,000,000 records.  Elapsed time: 00:03:23s.  Time for last 10,000,000:   40s.  Last read position: D08:24,990,584
    ERROR   2017-12-16 14:23:18     ValidateSamFile java.util.zip.DataFormatException: invalid distance too far back
    [Sat Dec 16 14:23:18 CST 2017] picard.sam.ValidateSamFile done. Elapsed time: 3.93 minutes.
    Runtime.totalMemory()=6561464320
    To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
    

    Thanks!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @taoyan
    Hi,

    Did you ever compress your file? It seems like something went wrong during compression or decompression. Can you go back through your steps and find the step where the error may have occurred? You can try running ValidateSamFile on each output after each step.

    -Sheila

  • @Sheila said:
    @taoyan
    Hi,

    Did you ever compress your file? It seems like something went wrong during compression or decompression. Can you go back through your steps and find the step where the error may have occurred? You can try running ValidateSamFile on each output after each step.

    -Sheila

    @Sheila yeah,I am going back through every step to see where cause the error.And anyway thank you very much for your answer!

Sign In or Register to comment.