The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

#### ☞ Did you remember to?

1. Search using the upper-right search box, e.g. using the error message.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Formatting tip!

Surround blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ` ) each to make a code block.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

# How does -L or filtering N cigars in earlier steps affect UnifiedGenotyper

University of PennsylvaniaMember Posts: 23

Dear GATK team,

I'm trying to get genotype calls for whole genome data sequenced on Illumina HiSeq. I have ~1.2B reads from one individual and would like to test the effect of varying the number of reads used as input on GATK calls. I first divide the unsorted reads into parcels of ~200M. Then I use either 1 parcel or 2 parcels or 3 parcels and so on as input to the GATK pipeline to simulate having different numbers of reads. When I ran this process on hg18 aligned data, the vcf files increased in size as number of input parcels increased, as expected. However, when I ran the same process on hg19 aligned data, the vcf files sizes were all similar and the loci and read numbers in all the files were similar. The input files to UnifiedGenotyper varied in size as expected so the problem is likely at the last step.

The only difference between the hg18 and hg19 runs was that for hg18, I used the -L target.intervals option at the RealignerTargetCreator step (and later on whenever required) which solved the filter N cigar problem. Somehow that didn't work for the hg19 run and so I had to use the filter N cigar option. Could this affect the genotyping step?

A diagram of the workflow is attached.

Thank you!
Stephanie

Tagged:

Hi Stephanie,

I can't really comment on your experiment as there are many things that could affect the results, like how exactly you're parceling out data and how you're feeding it in at the various steps. If you want to make your life easier, another way to test the effect of number of reads on calling is to vary the downsampling settings. And I would make sure, if you want to compare results, to use exactly the same arguments in your command lines.

Geraldine Van der Auwera, PhD

• University of PennsylvaniaMember Posts: 23
edited October 2013

Hi Geraldine,

Thank you for the reply and for the advice! The documentation says this: "By default, the downsampler is limited to 1000 reads per sample. This value can be adjusted either per-walker or per-run." Considering that I have ~1.27 billion reads (~70x coverage) from one human individual, how does the default downsampler handle the data? Should I use the -dcov option and set it to numbers below 70?

Also, the purpose of down-sampling would be to simulate Illumina sequencing runs with fewer lanes. If this tool "attempts to downsample uniformly across the range spanned by the reads in the pileup," would it make the downsized data have more uniform coverage than if it came from the machine directly?

Thank you!
Stephanie

• University of PennsylvaniaMember Posts: 23

Thank you so much!

• University of PennsylvaniaMember Posts: 23

Sorry, one more question: should the -dfrac option be applied only while using the RealignerTargetCreator tool or at all steps thereafter as well?

• University of PennsylvaniaMember Posts: 23

Yes, thank you again! I'll give it a try.

• University of PennsylvaniaMember Posts: 23

Sorry to ask yet another question, but I was wondering how to use the -L option to address the N cigar operator problem. Apparently N cigar means "skipped region from the reference.*" May I use -L with a file containing the list of chromosomes that are found in the reference file to fix this?

Thanks again!

Hi @stechen,

I'm not sure what you are trying to accomplish. Passing a list of chromosomes via -L will just make the program run on all chromosomes, which is what it does by default anyway. The N cigar signifies an intron. What you may want to do (if the point is to skip intron regions) is pass a list of exons with -L.

Geraldine Van der Auwera, PhD

• University of PennsylvaniaMember Posts: 23

Hi Geraldine,

Ultimately, I'm trying to run GATK on full genome sequence including introns without having to filter out reads with N cigar operators. Is there away to do this?

Thank you,
Stephanie