The current GATK version is 3.2-2

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

# ReassignOneMappingQualityFilter

NYPosts: 16Member
edited August 2013

Hi I am using this version of GATK which is java -jar /apps1/gatk/GenomeAnalysisTK-2.6-5-gba531bd/GenomeAnalysisTK.jar

I have obtained my bam files using bowtie as aligner. So the quality score is 255 in the 5th column. I want to use UnifiedGenotyper and when i use it it gives me an error saying MappingQualityUnavailable.

So I see read filter ReassignOneMappingQualityFilter which can convert 255 to 60 and then it would be easily taken by UnifiedGenotyper to do the snp calling.

So what command line should i be using

I have just started to explore GATK. Hope to hear from you soon

Regards Varun

Post edited by gupta567varun on
Tagged:

Hi Varun,

To use a read filter, you don't call the filter directly, you add it to the command of another tool. In your case, the simplest way to proceed would be to use PrintReads with the read filter, like this:

java -jar GenomeAnalysisTK.jar -T PrintReads -R reference.fa -I input.bam -o output.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60


This will tell the PrintReads tool to make a copy of your input file, and apply the read filter to adjust the MAPQ values. This will create a new file called output.bam with the adjusted MAPQ values.

Geraldine Van der Auwera, PhD

• NYPosts: 16Member
edited August 2013

Hi Geraldine So I used your command to generate the new output.bam file.

I used UnifiedGenotyper on this new output.bam and got raw vcf files with around 9460 snps.

I also used this command

java -jar GenomeAnalysisTK.jar -nt 6 -T UnifiedGenotyper -R /GENOMES/sacCer3.fa -I sacCer3.alignment.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -dcov 200 -o snp.vcf

The bam file sacCer3.alignment.bam is basically the one which needs to be converted to score of 60 and we converted it and named output.bam

When I run this command, my vcf output is created BUT when I compare number of snps obtained in this case with previous one where I ran output.bam with UnifiedGenotyper I get around 9448 snps.

Why is there a difference in number of snps??

Regards Varun

Post edited by gupta567varun on

That is also an acceptable way to use the read filter.

The small difference in number of called snps is because you are using multithreading (the nt argument). This causes some marginal calls to be gained or lost because of differences in the downsampling seeds. It is expected.

However what is not so good is that you are skipping indel realignment and base quality score recalibration. You should perform those steps before calling variants, according to our Best Practices.

Geraldine Van der Auwera, PhD

• NYPosts: 16Member

Since I am still getting hold of this let me describe you the way I understand it and should be done and then probably you can give me your help(with inputs)

1.Since I have my bam file with 255 i should first convert it using ReassignOneMappingQuality read filter.

Command

java -jar GenomeAnalysisTK.jar -T PrintReads -R reference.fa -I input.bam -o output.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -

2.After this it should be indel realignment i guess http://www.broadinstitute.org/gatk/guide/tagged?tag=indel-realignment

I don't have the known indel snp.vcf file for cerevisiae so have to skip that..

Lets say after indel realignment we get bam file named as test.bam

3.After this i will have to do BQSR

Can you give me the command line for it . I am looking it though.

4.Then i can just call UnifiedGenotyper.

Let me know if i miss something

Regards

Varun

Hi Varun,

Hopefully they will help you understand what needs to be done and how. Note that you don't need to skip indel realignment; it will work even if you don't have a list of known indels.

Geraldine Van der Auwera, PhD

• NYPosts: 16Member
edited August 2013

Hi Geraldine So i started following the indel realignment and base quality score recalibration steps.

I did indel realignment on my bam file but it gave me

INFO  12:14:47,507 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 12:14:47,510 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.00
INFO  12:15:13,492 ProgressMeter -            done        1.22e+07   26.0 s        2.0 s    100.0%        26.0 s     0.0 s
INFO  12:15:13,492 ProgressMeter - Total runtime 26.01 secs, 0.43 min, 0.01 hours
INFO  12:15:13,492 MicroScheduler - 0 reads were filtered out during the traversal out of approximately 7591116 total reads (0.00%)
INFO  12:15:13,493 MicroScheduler -   -> 0 reads (0.00% of total) failing BadCigarFilter
INFO  12:15:13,493 MicroScheduler -   -> 0 reads (0.00% of total) failing BadMateFilter
INFO  12:15:13,493 MicroScheduler -   -> 0 reads (0.00% of total) failing DuplicateReadFilter
INFO  12:15:13,493 MicroScheduler -   -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
INFO  12:15:13,493 MicroScheduler -   -> 0 reads (0.00% of total) failing MalformedReadFilter
INFO  12:15:13,494 MicroScheduler -   -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter
INFO  12:15:13,494 MicroScheduler -   -> 0 reads (0.00% of total) failing MappingQualityZeroFilter
INFO  12:15:13,494 MicroScheduler -   -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter
INFO  12:15:13,494 MicroScheduler -   -> 0 reads (0.00% of total) failing Platform454Filter
INFO  12:15:13,494 MicroScheduler -   -> 0 reads (0.00% of total) failing UnmappedReadFilter
INFO  12:15:14,277 GATKRunReport - Uploaded run statistics report to AWS S3

So my file realigner.intervals was empty but it ran properly.

Did the next step involving IndelRealigner

INFO  12:20:36,935 ProgressMeter -        Location processed.reads  runtime per.1M.reads completed total.runtime remaining
INFO  12:20:36,989 ReadShardBalancer$1 - Loading BAM index data INFO 12:20:36,990 ReadShardBalancer$1 - Done loading BAM index data
INFO  12:21:06,948 ProgressMeter -    chrIX:294800        1.59e+06   30.0 s       18.0 s     26.2%       114.0 s    84.0 s
INFO  12:21:36,960 ProgressMeter -  chrVIII:157854        3.32e+06   60.0 s       18.0 s     45.4%         2.2 m    72.0 s
INFO  12:22:06,971 ProgressMeter -   chrXII:463169        5.05e+06   90.0 s       17.0 s     64.1%         2.3 m    50.0 s
INFO  12:22:36,983 ProgressMeter -    chrXV:454498        6.76e+06  120.0 s       17.0 s     87.0%         2.3 m    17.0 s
INFO  12:23:06,994 ProgressMeter -   chrXVI:948058        7.47e+06    2.5 m       20.0 s    100.0%         2.5 m     0.0 s
INFO  12:23:13,071 ProgressMeter -            done        8.64e+06    2.6 m       18.0 s    100.0%         2.6 m     0.0 s
INFO  12:23:13,071 ProgressMeter - Total runtime 156.14 secs, 2.60 min, 0.04 hours
INFO  12:23:13,160 MicroScheduler - 0 reads were filtered out during the traversal out of approximately 8637506 total reads (0.00%)
INFO  12:23:13,160 MicroScheduler -   -> 0 reads (0.00% of total) failing MalformedReadFilter
INFO  12:23:13,928 GATKRunReport - Uploaded run statistics report to AWS S3

So I guess since it ran properly there was no indel realignment.

I ran BQSR step but it requires a vcf file for snp to run. I am working on yeast and it does not have that.

My question is "Is there a dbsnp vcf file for S.cerevisiae??" and if no then is there any other way to run BQSR without it??

Hope to hear from you soon

Regards Varun

Post edited by gupta567varun on

Hi Varun,

That realignment step seems to have run awfully fast, I'd like to check your command line if you don't mind posting it. Is the genome you're working with very small?

For the BQSR step, I don't know if there is a dbsnp file available for S. cerevisiae; we only deal with humans. I recommend you search online to make sure. If there is not, then you can generate a set yourself; have a look at our documentation for details on how to do this.

Geraldine Van der Auwera, PhD

• NYPosts: 16Member

Hi Geraldine

Command line

java -jar /apps1/gatk/GenomeAnalysisTK-2.6-5-gba531bd/GenomeAnalysisTK.jar -nt 4 -T RealignerTargetCreator -R /cork/vgupta12/S.cerevisiae/GENOMES/sacCer3.fa -I output.bam -o realigner.intervals

and then

java -jar /apps1/gatk/GenomeAnalysisTK-2.6-5-gba531bd/GenomeAnalysisTK.jar -nt 4 -T IndelRealigner -R /cork/vgupta12/S.cerevisiae/GENOMES/sacCer3.fa -I output.bam -targetIntervals realigner.intervals -o realigned.bam

Yeah yeast genome is very very small 12M in size.

Well that all looks good! Sorry to doubt you, but we don't see those kinds of speeds on our usual human data, so I wanted to make sure.

Geraldine Van der Auwera, PhD

• NYPosts: 16Member

Hi Geraldine

I started using bwa as my aligner and then want to do GATK for my INDELS and SNP's.

When I map the raw reads with bwa-0.7.5a using bwa mem algorithm, i get bam file with reads which does not have read sequence, instead have a *. Ofcourse this will create problem downstream, so I want to remove them. But first question , why would bwa report such reads in the bam file. Secondly I used this command to PRINT READS without * in the seq column but it is giving me error.

It gave me this error

##### ERROR ------------------------------------------------------------------------------------------

can you see what is happening

Regards Varun

Hi Varun,

I can't answer your first question because bwa is not our tool. If you have any problems with it you should ask the authors of bwa for support.

To your second question, this is a minor bug in GATK that affects how read filters are handled. The MalformedRead filter is applied by default, so when you specify it at the command line, it ends up listed twice in the internal arguments list. This causes the error you see. Just remove the filter from your command line and it should work fine. The filter will be applied automatically.

Geraldine Van der Auwera, PhD

• NYPosts: 16Member

Hi Geraldine But I have to remove those reads which have * in the read seq column. So what should be my command line?? I did this and it gave me an error java -jar /apps1/gatk/GenomeAnalysisTK-2.6-5-gba531bd/GenomeAnalysisTK.jar -T PrintReads -R /cork/GENOMES/sacCer3.fa -I final_sorted.bam -o output.bam

INFO 11:33:54,623 HelpFormatter - -------------------------------------------------------------------------------- INFO 11:33:54,627 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.6-5-gba531bd, Compiled 2013/07/18 18:05:31 INFO 11:33:54,627 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 11:33:54,627 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 11:33:54,635 HelpFormatter - Program Args: -T PrintReads -R /cork/vgupta12/S.cerevisiae/GENOMES/sacCer3.fa -I final_sorted.bam -o output.bam INFO 11:33:54,635 HelpFormatter - Date/Time: 2014/04/08 11:33:54 INFO 11:33:54,635 HelpFormatter - -------------------------------------------------------------------------------- INFO 11:33:54,635 HelpFormatter - -------------------------------------------------------------------------------- INFO 11:33:55,495 GenomeAnalysisEngine - Strictness is SILENT INFO 11:33:55,642 GenomeAnalysisEngine - Downsampling Settings: No downsampling INFO 11:33:55,657 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 11:33:55,685 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.03 INFO 11:33:55,814 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 11:33:55,823 GenomeAnalysisEngine - Done preparing for traversal INFO 11:33:55,823 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 11:33:55,824 ProgressMeter - Location processed.reads runtime per.1M.reads completed total.runtime remaining INFO 11:33:55,829 ReadShardBalancer$1 - Loading BAM index data INFO 11:33:55,831 ReadShardBalancer$1 - Done loading BAM index data INFO 11:33:56,924 GATKRunReport - Uploaded run statistics report to AWS S3

##### ERROR ------------------------------------------------------------------------------------------

Regards Varun

• NYPosts: 16Member

Thanks Geraldine

That worked like a charm

Also now I would want to run Unified genotyper for getting snps and indels(Already got the realign intervals for indels).

I am using this command java -jar /apps1/gatk/GenomeAnalysisTK-2.6-5-gba531bd/GenomeAnalysisTK.jar -T UnifiedGenotyper -R /cork/vgupta12/S.cerevisiae/GENOMES/sacCer3.fa -glm BOTH -I realigned.bam -dcov 200 -ploidy 1 -o snpcall.vcf

I am using ploidy 1 since my genome which happens to be yeast is haploid. Can you tell me the use of dcov here?? What is the default value and what does it mean.

Thanks for all the help

Regards

Varun