The current GATK version is 3.8-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Get notifications!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

Got a problem?

1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

Did we ask for a bug report?

Then follow instructions in Article#1894.

Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block as demonstrated here.

Jump to another community
Download the latest Picard release at
GATK version 4.beta.3 (i.e. the third beta release) is out. See the GATK4 beta page for download and details.

How does -L or filtering N cigars in earlier steps affect UnifiedGenotyper

stechenstechen University of PennsylvaniaMember

Dear GATK team,

I'm trying to get genotype calls for whole genome data sequenced on Illumina HiSeq. I have ~1.2B reads from one individual and would like to test the effect of varying the number of reads used as input on GATK calls. I first divide the unsorted reads into parcels of ~200M. Then I use either 1 parcel or 2 parcels or 3 parcels and so on as input to the GATK pipeline to simulate having different numbers of reads. When I ran this process on hg18 aligned data, the vcf files increased in size as number of input parcels increased, as expected. However, when I ran the same process on hg19 aligned data, the vcf files sizes were all similar and the loci and read numbers in all the files were similar. The input files to UnifiedGenotyper varied in size as expected so the problem is likely at the last step.

The only difference between the hg18 and hg19 runs was that for hg18, I used the -L target.intervals option at the RealignerTargetCreator step (and later on whenever required) which solved the filter N cigar problem. Somehow that didn't work for the hg19 run and so I had to use the filter N cigar option. Could this affect the genotyping step?

A diagram of the workflow is attached.

Thank you!

Best Answers


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi Stephanie,

    I can't really comment on your experiment as there are many things that could affect the results, like how exactly you're parceling out data and how you're feeding it in at the various steps. If you want to make your life easier, another way to test the effect of number of reads on calling is to vary the downsampling settings. And I would make sure, if you want to compare results, to use exactly the same arguments in your command lines.

  • stechenstechen University of PennsylvaniaMember
    edited October 2013

    Hi Geraldine,

    Thank you for the reply and for the advice! The documentation says this: "By default, the downsampler is limited to 1000 reads per sample. This value can be adjusted either per-walker or per-run." Considering that I have ~1.27 billion reads (~70x coverage) from one human individual, how does the default downsampler handle the data? Should I use the -dcov option and set it to numbers below 70?

    Also, the purpose of down-sampling would be to simulate Illumina sequencing runs with fewer lanes. If this tool "attempts to downsample uniformly across the range spanned by the reads in the pileup," would it make the downsized data have more uniform coverage than if it came from the machine directly?

    Thank you!

  • stechenstechen University of PennsylvaniaMember

    Thank you so much!

  • stechenstechen University of PennsylvaniaMember

    Sorry, one more question: should the -dfrac option be applied only while using the RealignerTargetCreator tool or at all steps thereafter as well?

  • stechenstechen University of PennsylvaniaMember

    Yes, thank you again! I'll give it a try.

  • stechenstechen University of PennsylvaniaMember

    Sorry to ask yet another question, but I was wondering how to use the -L option to address the N cigar operator problem. Apparently N cigar means "skipped region from the reference.*" May I use -L with a file containing the list of chromosomes that are found in the reference file to fix this?

    Thanks again!


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @stechen,

    I'm not sure what you are trying to accomplish. Passing a list of chromosomes via -L will just make the program run on all chromosomes, which is what it does by default anyway. The N cigar signifies an intron. What you may want to do (if the point is to skip intron regions) is pass a list of exons with -L.

  • stechenstechen University of PennsylvaniaMember

    Hi Geraldine,

    Ultimately, I'm trying to run GATK on full genome sequence including introns without having to filter out reads with N cigar operators. Is there away to do this?

    Thank you,

Sign In or Register to comment.