How does -L or filtering N cigars in earlier steps affect UnifiedGenotyper

stechenstechen University of PennsylvaniaPosts: 23Member

Dear GATK team,

I'm trying to get genotype calls for whole genome data sequenced on Illumina HiSeq. I have ~1.2B reads from one individual and would like to test the effect of varying the number of reads used as input on GATK calls. I first divide the unsorted reads into parcels of ~200M. Then I use either 1 parcel or 2 parcels or 3 parcels and so on as input to the GATK pipeline to simulate having different numbers of reads. When I ran this process on hg18 aligned data, the vcf files increased in size as number of input parcels increased, as expected. However, when I ran the same process on hg19 aligned data, the vcf files sizes were all similar and the loci and read numbers in all the files were similar. The input files to UnifiedGenotyper varied in size as expected so the problem is likely at the last step.

The only difference between the hg18 and hg19 runs was that for hg18, I used the -L target.intervals option at the RealignerTargetCreator step (and later on whenever required) which solved the filter N cigar problem. Somehow that didn't work for the hg19 run and so I had to use the filter N cigar option. Could this affect the genotyping step?

A diagram of the workflow is attached.

Thank you!
Stephanie

Best Answers

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,365Administrator, GATK Developer admin

    Hi Stephanie,

    I can't really comment on your experiment as there are many things that could affect the results, like how exactly you're parceling out data and how you're feeding it in at the various steps. If you want to make your life easier, another way to test the effect of number of reads on calling is to vary the downsampling settings. And I would make sure, if you want to compare results, to use exactly the same arguments in your command lines.

    Geraldine Van der Auwera, PhD

  • stechenstechen University of PennsylvaniaPosts: 23Member
    edited October 2013

    Hi Geraldine,

    Thank you for the reply and for the advice! The documentation says this: "By default, the downsampler is limited to 1000 reads per sample. This value can be adjusted either per-walker or per-run." Considering that I have ~1.27 billion reads (~70x coverage) from one human individual, how does the default downsampler handle the data? Should I use the -dcov option and set it to numbers below 70?

    Also, the purpose of down-sampling would be to simulate Illumina sequencing runs with fewer lanes. If this tool "attempts to downsample uniformly across the range spanned by the reads in the pileup," would it make the downsized data have more uniform coverage than if it came from the machine directly?

    Thank you!
    Stephanie

    Post edited by stechen on
  • stechenstechen University of PennsylvaniaPosts: 23Member

    Thank you so much!

  • stechenstechen University of PennsylvaniaPosts: 23Member

    Sorry, one more question: should the -dfrac option be applied only while using the RealignerTargetCreator tool or at all steps thereafter as well?

  • stechenstechen University of PennsylvaniaPosts: 23Member

    Yes, thank you again! I'll give it a try.

  • stechenstechen University of PennsylvaniaPosts: 23Member

    Sorry to ask yet another question, but I was wondering how to use the -L option to address the N cigar operator problem. Apparently N cigar means "skipped region from the reference.*" May I use -L with a file containing the list of chromosomes that are found in the reference file to fix this?

    Thanks again!

    *http://picard.sourceforge.net/javadoc/net/sf/samtools/CigarOperator.html

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,365Administrator, GATK Developer admin

    Hi @stechen,

    I'm not sure what you are trying to accomplish. Passing a list of chromosomes via -L will just make the program run on all chromosomes, which is what it does by default anyway. The N cigar signifies an intron. What you may want to do (if the point is to skip intron regions) is pass a list of exons with -L.

    Geraldine Van der Auwera, PhD

  • stechenstechen University of PennsylvaniaPosts: 23Member

    Hi Geraldine,

    Ultimately, I'm trying to run GATK on full genome sequence including introns without having to filter out reads with N cigar operators. Is there away to do this?

    Thank you,
    Stephanie

Sign In or Register to comment.