Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office on November 11th and 13th 2019, due to the U.S. holiday(Veteran's day) and due to a team event(Nov 13th). We will return to monitoring the GATK forum on November 12th and 14th respectively. Thank you for your patience.

Identifying contaminant reads in MergeBamAlignment

I noticed that several of my WGS runs in GATK4's MergeBamAlignment step had high numbers of contaminant reads. A snippet of one of the stderr logs is shown below:

INFO    2018-07-02 02:12:29 AbstractAlignmentMerger 86389692 Reads have been unmapped due to being suspected of being Cross-species contamination.
INFO    2018-07-02 02:12:36 AbstractAlignmentMerger Wrote 689846082 alignment records and 119640074 unmapped reads.
[Mon Jul 02 02:12:36 UTC 2018] picard.sam.MergeBamAlignment done. Elapsed time: 434.47 minutes.
Runtime.totalMemory()=3040870400
Using GATK jar /gatk/build/libs/gatk-package-4.0.4.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Dsamjdk.compression_level=5 -Xms3000m -jar /gatk/build/libs/gatk-package-4.0.4.0-local.jar MergeBamAlignment --VALIDATION_STRINGENCY SILENT --EXPECTED_ORIENTATIONS FR --ATTRIBUTES_TO_RETAIN X0 --ALIGNED_BAM /cromwell_root/fc-4449151c-8501-4474-a203-83d0c4dbd051/6d545972-e762-4770-9e8c-7703f11301dc/PreProcessingForVariantDiscovery_GATK4/cc708444-df39-4766-aec7-92536ad81ea7/call-SamToFastqAndBwaMem/shard-0/916_3063_1_3_h5nlwalxx_3.unmapped.unmerged.bam --UNMAPPED_BAM /cromwell_root/fc-4449151c-8501-4474-a203-83d0c4dbd051/uBAMs/916_3063_1_3/916_3063_1_3_h5nlwalxx_3.unmapped.bam --OUTPUT 916_3063_1_3_h5nlwalxx_3.unmapped.aligned.unsorted.bam --REFERENCE_SEQUENCE /cromwell_root/broad-references/hg38/v0/Homo_sapiens_assembly38.fasta --PAIRED_RUN true --SORT_ORDER unsorted --IS_BISULFITE_SEQUENCE false --ALIGNED_READS_ONLY false --CLIP_ADAPTERS false --MAX_RECORDS_IN_RAM 2000000 --ADD_MATE_CIGAR true --MAX_INSERTIONS_OR_DELETIONS -1 --PRIMARY_ALIGNMENT_STRATEGY MostDistant --PROGRAM_RECORD_ID bwamem --PROGRAM_GROUP_VERSION 0.7.15-r1140 --PROGRAM_GROUP_COMMAND_LINE bwa mem -K 100000000 -p -v 3 -t 16 -Y /cromwell_root/broad-references/hg38/v0/Homo_sapiens_assembly38.fasta --PROGRAM_GROUP_NAME bwamem --UNMAPPED_READ_STRATEGY COPY_TO_TAG --ALIGNER_PROPER_PAIR_FLAGS true --UNMAP_CONTAMINANT_READS true

I would like to be able to extract these contaminant reads and BLAST them to investigate where they came from. However, I didn't see a way to pull out these contaminant reads using command line options, so I tried parsing the bam files.

To extract unmapped reads, I used samtools to filter for the '0x4' flag in the analysis ready bam file. The reads I pulled out from this process seemed to be reads with 0 mapping quality, and not necessarily contaminant reads. After some more digging, I found out that reads are marked for contamination when they are 1) soft clipped on both ends and 2) align with less than 32 bases. As a result, I wrote a simple "contaminant-finder" that parses my analysis ready bam file, and here's an example of a read I personally marked as contaminant:

E00489:73:H7VLWALXX:2:1121:2290:14635   83  chr1    10166   6   95S11M44S   =   10159   -18 ACCTAAACCTAACCCTAGCCCTAAACCTAACCCTAACTCTAACCCTATCCTAACCCTACCCCTAACCTAACCCTACCCCCCAACCCTAACCCTAAACCTAACCCTAGCCAAAAGCCTAAACCTAACCCTAGCCCTAGCCCTAGCCAAAAC  5?????5?????????????????????5????????????5??????????5??5?5+??????????5?5???+???????????5???????5????????55???5??5??????5?????????????????????????????5  MC:Z:18M132S    MD:Z:0C10   PG:Z:MarkDuplicates RG:Z:H7VLWALXX.1    NM:i:1  MQ:i:10 UQ:i:12 AS:i:42

According to the cigar string "95S11M44S", this read clearly matches the criteria for contamination. Strangely, the flag "83" does not have the '0x4' bit marked, which means it was not unmapped according to SAM file format specifications.

In summary, here are my questions:
1. How can I extract reads marked as contaminant from my analysis ready bam file (produced after alignment, duplicate marking, and BQSR)?
2. How exactly does MergeBamAlignment "mark" reads as contaminant without raising the '0x4' bit?

Thank you for your time!
Lee

Best Answer

  • lraolrao
    edited July 2018 Accepted Answer

    Quite the opposite actually; most of my samples have small amounts of contamination, with a couple of files have as high as 11% of reads marked as contaminant, such as the example in the stderr log of my original post.

    I think I've solved my problem though. For whatever reason, it looks like contaminant reads are being tagged with CO:Z:Cross-species contamination, rather than FT. Here's an example:

    E00489:73:H7VLWALXX:8:2224:9993:72701   95      *       0       0       16S22M112S      *       0       0       TTGCATGGGCAGCTGGCTTTCCCTTCTCTTCCCTTCCTGGGAATACTTGAGAGGTTCAGGGTAGAGAATGGGAGTTGAGGTATCCAGGGGGGCTGGATGCTGTGCTTGTCTTTCCCTAAACTTGCTGTGTGAGTTAGTGAGTTGAGTGGC     ??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????  PA:Z:chr7,136517937,16S22M112S,0,0;        MD:Z:22 PG:Z:MarkDuplicates     RG:Z:H7VLWALXX.1        NM:i:0  CO:Z:Cross-species contamination        AS:i:22
    

    I initially missed these reads when I was scanning through the 0x4-filtered sam file because the first occurrence happens at the 2-millionth line. Woops.

    Thank you for the help!
    Lee

Answers

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @lrao,

    I document MergeBamAlignment's UNMAP_CONTAMINANT_READS=true option in Article#7899. Here is the relevant passage:

    The UNMAP_CONTAMINANT_READS=true option applies to likely cross-species contamination, e.g. bacterial contamination. MergeBamAlignment identifies reads that are (i) softclipped on both ends and (ii) map with less than 32 basepairs as contaminant. For a similar feature in GATK, see OverclippedReadFilter. If MergeBamAlignment determines a read is contaminant, then the mate is also considered contaminant. MergeBamAlignment unmaps the pair of reads by (i) setting the 0x4 flag bit, (ii) replacing column 3's contig name with an asterisk *, (iii) replacing columns 4 and 5 (POS and MAPQ) with zeros, and (iv) adding the FT tag to indicate the reason for unmapping the read, e.g. FT:Z:Cross-species contamination. The records retain their CIGAR strings. Note other processes also use the FT tag, e.g. to indicate reasons for setting the QCFAIL 0x200 flag bit, and will use different tag descriptions.

    You should be able to grep out contaminant reads with FT:Z:. You should be able to pull out all unmapped reads (whether due to aligner or by being unmapped subsequently) with PrintReads -L unmapped.

  • lraolrao Member
    edited July 2018

    Hi @shlee,

    Thank you for the article link! I'm not sure how I missed that the first time around.

    However, I am unable to find contaminant reads in my analysis ready sam file after grepping for "FT". When I look through a sam file that filtered for reads with the '0x4' flag set, there are no cigar strings in their "normal" column (just an "=", although the mate cigar string still exists in the optional fields). Moreover, all contig names and positions are still present, which seems to suggest that none of these are contaminant reads according to your linked article. Here's an example of a typical read from this '0x4' filtered sam file:

    E00489:73:H7VLWALXX:3:2121:22546:29613  1093    chr1    629529  0       *       =       629529  0       TGTTGCTGCTTCAGTTGATCGTGGGTTTTTTTTGTTGATTAGTATGGGGATAATTGCTAGTAGGCTGAATTCCAGGCCTACTCATATTAGTATTAGGTTGGTGCTGGATATTGTGATTACAGGACCTAAGAAGATTGTGAAGTAGATGAT     ????????????????????5?????????????????5?????????????????????????+?????????????????????????????????????????????????????????????????????????????????????  MC:Z:20S27M1D15M1I49M1I27M10S      PG:Z:MarkDuplicates     RG:Z:H7VLWALXX.1        MQ:i:4
    

    Since I noticed that the bam file size became considerably smaller after the SortAndFixTags step, to really make sure no contaminant reads were being filtered out, I checked the sam file created directly from MergeBamAlignment, and still saw no signs of the FT tag.

    So, here's my new set of questions:
    1. Why are there no FT tags in my sam files? I noticed that the article you linked is for GATK3, and the FireCloud pipeline I'm running these analyses on is GATK4; perhaps the documentation is slightly outdated?
    2. (less important) My aligned, unsorted, duplicate marked BAM file is 90GB prior to SortAndFixTags, but becomes 66GB in the output file. The article you linked says SortAndFixTags "sorts reads by coordinate and then (ii) corrects the NM tag values", and makes no mention of any sort of filtering. If this tool is just sorting and changing tags, why does the file size get massively reduced?

    Let me know if there's any more information I can provide to help you understand this issue, such as privately sharing one of my output bams.

    Best,
    Lee

    Issue · Github
    by shlee

    Issue Number
    3127
    State
    open
    Last Updated
  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited July 2018

    Hi @lrao,

    There are many types of unmapped reads, not just those that are unmapped by the contamination filter. I talk about some of them in Dictionary Article#11039. I also write about how MergeBamAlignment handles a type of unmapped read in Article#6483. Documents are what they are--appropriate for the tools being documented at the time of writing. Changes and updates are typically noted in the Comments section.

    Remember that there are ten different levels of BAM compression. GATK3 and Samtools set this level to 5 by default. In GATK4, depending on the version, last I checked this compression level was set to either 1 or 2. Please check your pipeline tasks for versions of tools and compression settings. Look for parameters such as --COMPRESSION_LEVEL 2, --USE_JDK_DEFLATER false and --USE_JDK_INFLATER false. The latter two options relate to performance and come with their own compression levels.

    As for not seeing any MergeBamAlignment contaminant filtered reads, again, be sure to check the default settings and your command settings for the version of the tool you are using. For example, I'm looking at the Picard version available in GATK4.0.5.1 and I see:

    --UNMAP_CONTAMINANT_READS,-UNMAP_CONTAM:Boolean
                                  Detect reads originating from foreign organisms (e.g. bacterial DNA in a non-bacterial
                                  sample),and unmap + label those reads accordingly.  Default value: false. Possible values:
                                  {true, false} 
    
    --UNMAPPED_READ_STRATEGY:UnmappingReadStrategy
                                  How to deal with alignment information in reads that are being unmapped (e.g. due to
                                  cross-species contamination.) Currently ignored unless UNMAP_CONTAMINANT_READS = true 
                                  Default value: DO_NOT_CHANGE. Possible values: {COPY_TO_TAG, DO_NOT_CHANGE, MOVE_TO_TAG} 
    

    So it looks like unmapping contaminant reads is no longer on by default. I discuss some implications of these contaminant reads in section 6 of Article#8017.

    I hope this is helpful.

  • lraolrao Member
    edited July 2018

    Hi @shlee,

    I checked my WDL file, and the compression level remains at 5 throughout the script, and there's no mention of JDK_INFLATER/DEFLATER. However, I did find an answer to my question here; sorting by coordinates results in a more compressible format, because each block that is compressed will contain more similar sequences.

    Back to the contaminants issue though. Although the default is false, the WDL script I'm running sets --UNMAP_CONTAMINANT_READS true, and this is also present in the stderr log of my original post. Any other ideas as to why there are no FT tags?

    Thanks for the useful article links!
    Lee

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Is it possible @lrao that your samples are just the ideal case of being contaminant-free? You can test your pipeline with the small data provided in Article#8017 to make sure it generates contaminant reads that the tutorial mentions.

  • lraolrao Member
    edited July 2018 Accepted Answer

    Quite the opposite actually; most of my samples have small amounts of contamination, with a couple of files have as high as 11% of reads marked as contaminant, such as the example in the stderr log of my original post.

    I think I've solved my problem though. For whatever reason, it looks like contaminant reads are being tagged with CO:Z:Cross-species contamination, rather than FT. Here's an example:

    E00489:73:H7VLWALXX:8:2224:9993:72701   95      *       0       0       16S22M112S      *       0       0       TTGCATGGGCAGCTGGCTTTCCCTTCTCTTCCCTTCCTGGGAATACTTGAGAGGTTCAGGGTAGAGAATGGGAGTTGAGGTATCCAGGGGGGCTGGATGCTGTGCTTGTCTTTCCCTAAACTTGCTGTGTGAGTTAGTGAGTTGAGTGGC     ??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????  PA:Z:chr7,136517937,16S22M112S,0,0;        MD:Z:22 PG:Z:MarkDuplicates     RG:Z:H7VLWALXX.1        NM:i:0  CO:Z:Cross-species contamination        AS:i:22
    

    I initially missed these reads when I was scanning through the 0x4-filtered sam file because the first occurrence happens at the 2-millionth line. Woops.

    Thank you for the help!
    Lee

Sign In or Register to comment.