If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Cross Species Validation

lordjoelordjoe Member
edited July 2012 in Ask the GATK team

I have a number of BAM files that have been blasted against hg19 using UnifiedGenotyper to find variants. The data comes from a human/mouse chimera. We have a few dozen high quality and interesting SNPs. Now I need to verify that the reads are not mouse genes that fit the human genome. I am not exactly sure how to do this but I am looking at
1) Running a walker (LocusWalker??) over the interesting locations - gathering the SamRecords and writing them to a SAM file or just keeping them in memory - the set is not large.
2) Running a ReadWalker over the BAM representing the fit to the mouse genome and seeing if the reads fit a consistent location and if the detected SNPs are present in the mouse reference.
There may be a better way to do this but I am pretty new to GATK - I suspect writing a couple of custom walkers is the simplest - To say published samples , say CountLocusWalker, are crude is an overstatement. Assuming I write my own walkers - how do I access the reference genome in the region of a GenomeLocation and how do I go from PileupElement to SAMRecord.
Or am I doing this all wrong and there is a better way???


  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    Hi there,

    For the programming question, I'd recommend just looking at other walkers to see what they do. Perhaps something like IndelRealigner, which needs to pull out reference bases that aren't already in the reference context.

    That being said, why can't you use PrintReads with the -L (intervals) argument that is your SNP VCF file? That will output a new bam file with just the reads overlapping your SNPs of interest. Then you can use e.g. BWA to align just those reads to the mouse genome. (Or something like that.)

Sign In or Register to comment.