If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Identifying contaminant reads in MergeBamAlignment
I noticed that several of my WGS runs in GATK4's MergeBamAlignment step had high numbers of contaminant reads. A snippet of one of the stderr logs is shown below:
INFO 2018-07-02 02:12:29 AbstractAlignmentMerger 86389692 Reads have been unmapped due to being suspected of being Cross-species contamination. INFO 2018-07-02 02:12:36 AbstractAlignmentMerger Wrote 689846082 alignment records and 119640074 unmapped reads. [Mon Jul 02 02:12:36 UTC 2018] picard.sam.MergeBamAlignment done. Elapsed time: 434.47 minutes. Runtime.totalMemory()=3040870400 Using GATK jar /gatk/build/libs/gatk-package-126.96.36.199-local.jar Running: java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Dsamjdk.compression_level=5 -Xms3000m -jar /gatk/build/libs/gatk-package-188.8.131.52-local.jar MergeBamAlignment --VALIDATION_STRINGENCY SILENT --EXPECTED_ORIENTATIONS FR --ATTRIBUTES_TO_RETAIN X0 --ALIGNED_BAM /cromwell_root/fc-4449151c-8501-4474-a203-83d0c4dbd051/6d545972-e762-4770-9e8c-7703f11301dc/PreProcessingForVariantDiscovery_GATK4/cc708444-df39-4766-aec7-92536ad81ea7/call-SamToFastqAndBwaMem/shard-0/916_3063_1_3_h5nlwalxx_3.unmapped.unmerged.bam --UNMAPPED_BAM /cromwell_root/fc-4449151c-8501-4474-a203-83d0c4dbd051/uBAMs/916_3063_1_3/916_3063_1_3_h5nlwalxx_3.unmapped.bam --OUTPUT 916_3063_1_3_h5nlwalxx_3.unmapped.aligned.unsorted.bam --REFERENCE_SEQUENCE /cromwell_root/broad-references/hg38/v0/Homo_sapiens_assembly38.fasta --PAIRED_RUN true --SORT_ORDER unsorted --IS_BISULFITE_SEQUENCE false --ALIGNED_READS_ONLY false --CLIP_ADAPTERS false --MAX_RECORDS_IN_RAM 2000000 --ADD_MATE_CIGAR true --MAX_INSERTIONS_OR_DELETIONS -1 --PRIMARY_ALIGNMENT_STRATEGY MostDistant --PROGRAM_RECORD_ID bwamem --PROGRAM_GROUP_VERSION 0.7.15-r1140 --PROGRAM_GROUP_COMMAND_LINE bwa mem -K 100000000 -p -v 3 -t 16 -Y /cromwell_root/broad-references/hg38/v0/Homo_sapiens_assembly38.fasta --PROGRAM_GROUP_NAME bwamem --UNMAPPED_READ_STRATEGY COPY_TO_TAG --ALIGNER_PROPER_PAIR_FLAGS true --UNMAP_CONTAMINANT_READS true
I would like to be able to extract these contaminant reads and BLAST them to investigate where they came from. However, I didn't see a way to pull out these contaminant reads using command line options, so I tried parsing the bam files.
To extract unmapped reads, I used samtools to filter for the '0x4' flag in the analysis ready bam file. The reads I pulled out from this process seemed to be reads with 0 mapping quality, and not necessarily contaminant reads. After some more digging, I found out that reads are marked for contamination when they are 1) soft clipped on both ends and 2) align with less than 32 bases. As a result, I wrote a simple "contaminant-finder" that parses my analysis ready bam file, and here's an example of a read I personally marked as contaminant:
E00489:73:H7VLWALXX:2:1121:2290:14635 83 chr1 10166 6 95S11M44S = 10159 -18 ACCTAAACCTAACCCTAGCCCTAAACCTAACCCTAACTCTAACCCTATCCTAACCCTACCCCTAACCTAACCCTACCCCCCAACCCTAACCCTAAACCTAACCCTAGCCAAAAGCCTAAACCTAACCCTAGCCCTAGCCCTAGCCAAAAC 5?????5?????????????????????5????????????5??????????5??5?5+??????????5?5???+???????????5???????5????????55???5??5??????5?????????????????????????????5 MC:Z:18M132S MD:Z:0C10 PG:Z:MarkDuplicates RG:Z:H7VLWALXX.1 NM:i:1 MQ:i:10 UQ:i:12 AS:i:42
According to the cigar string "95S11M44S", this read clearly matches the criteria for contamination. Strangely, the flag "83" does not have the '0x4' bit marked, which means it was not unmapped according to SAM file format specifications.
In summary, here are my questions:
1. How can I extract reads marked as contaminant from my analysis ready bam file (produced after alignment, duplicate marking, and BQSR)?
2. How exactly does MergeBamAlignment "mark" reads as contaminant without raising the '0x4' bit?
Thank you for your time!