Bug Bulletin: we have identified a bug that affects indexing when producing gzipped VCFs. This will be fixed in the upcoming 3.2 release; in the meantime you need to reindex gzipped VCFs using Tabix.

UnifiedGenotyper functions available when using the -sample_ploidy option

shelbirussellshelbirussell Posts: 3Member

Hi,

I trying to detect variants in a population of haploid genomes with UnifiedGenotyper, using the -sample_ploidy option to specify the number of haploid genomes in the pool. I would like to be able to specify other options, but so far the ones I have tried do not seem to work. I have tried: -out_mode EMIT_ALL_SITES --sample_ploidy 50 -minIndelFrac 0.05 --genotype_likelihoods_model BOTH -pnrm EXACT_GENERAL_PLOIDY. Could you please tell me which, if any, of the other commands are available when calling variants on non-diploid samples? Specifically, I would really like to be able to call indels too.

Thanks for your time and help.

Shelbi

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,276Administrator, GSA Member admin

    Hi Shelbi,

    Could you tell me what you mean when you say it's not working? Are you getting an error or just no variant calls?

    Geraldine Van der Auwera, PhD

  • shelbirussellshelbirussell Posts: 3Member
    edited February 2013

    Hi Geraldine,

    Sorry I was vague. I didn't get any variant calls and the program didn't run very long. The output summary at the end seemed truncated as well. I pasted one of the outputs below. I am running the program right now using only the -sample_ploidy option, and it has been running for nearly two hours.

    Shelbi

    Your job looked like:

    ------------------------------------------------------------
    # LSBATCH: User input
    java -Xmx2g -jar /n/sw/GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -T UnifiedGenotyper -R /n/nobackup2/cavanaugh/shelbirussell/alignments/reference/Solemya_velum_sym_genome_all_scaffolds.fasta -I Unamp_DML1gill_realigned.bam -o Unamp_DML1_snps_raw.vcf -out_mode EMIT_ALL_SITES --sample_ploidy 50 -minIndelFrac 0.05 --genotype_likelihoods_model BOTH -pnrm EXACT_GENERAL_PLOIDY -pcr_error 1.0E-20 -refsample WHOI_Svelum_sym_metagenome
    ------------------------------------------------------------
    
    Successfully completed.
    
    Resource usage summary:
    
       CPU time   :    298.04 sec.
       Max Memory :       778 MB
       Max Swap   :      2566 MB
    
       Max Processes  :         3
       Max Threads    :        24
    
    The output (if any) follows:
    
    INFO  19:38:17,044 HelpFormatter - -------------------------------------------------------------------------------- 
    INFO  19:38:17,047 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.3-9-ge5ebf34, Compiled 2013/01/11 22:43:14 
    INFO  19:38:17,047 HelpFormatter - Copyright (c) 2010 The Broad Institute 
    INFO  19:38:17,047 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
    INFO  19:38:17,051 HelpFormatter - Program Args: -T UnifiedGenotyper -R /n/nobackup2/cavanaugh/shelbirussell/alignments/reference/Solemya_velum_sym_genome_all_scaffolds.fasta -I Unamp_DML1gill_realigned.bam -o Unamp_DML1_snps_raw.vcf -out_mode EMIT_ALL_SITES --sample_ploidy 50 -minIndelFrac 0.05 --genotype_likelihoods_model BOTH -pnrm EXACT_GENERAL_PLOIDY -pcr_error 1.0E-20 -refsample WHOI_Svelum_sym_metagenome 
    INFO  19:38:17,051 HelpFormatter - Date/Time: 2013/02/21 19:38:17 
    INFO  19:38:17,060 HelpFormatter - -------------------------------------------------------------------------------- 
    INFO  19:38:17,060 HelpFormatter - -------------------------------------------------------------------------------- 
    INFO  19:38:17,146 GenomeAnalysisEngine - Strictness is SILENT 
    INFO  19:38:17,681 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250, Using the new downsampling implementation 
    INFO  19:38:17,687 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
    INFO  19:38:17,741 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.05 
    INFO  19:38:17,794 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
    INFO  19:38:17,794 ProgressMeter -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining 
    WARN  19:38:17,909 UnifiedGenotyper - WARNING: note that the EMIT_ALL_SITES option is intended only for point mutations (SNPs) in DISCOVERY mode or generally when running in GENOTYPE_GIVEN_ALLELES mode; it will by no means produce a comprehensive set of indels in DISCOVERY mode 
    INFO  19:38:47,806 ProgressMeter - Contig10580:31076        2.99e+05   30.0 s      100.4 s     10.5%         4.7 m     4.2 m 
    INFO  19:39:17,830 ProgressMeter - Contig10580:189787        4.63e+05   60.0 s        2.2 m     15.9%         6.3 m     5.3 m 
    INFO  19:39:47,840 ProgressMeter - Contig10580:356240        6.27e+05   90.0 s        2.4 m     21.5%         7.0 m     5.5 m 
    INFO  19:40:17,849 ProgressMeter - Contig10580:573285        8.40e+05    2.0 m        2.4 m     28.8%         7.0 m     5.0 m 
    INFO  19:40:47,858 ProgressMeter - Contig10580:763041        1.04e+06    2.5 m        2.4 m     35.1%         7.1 m     4.6 m 
    INFO  19:41:17,912 ProgressMeter - Contig10583:38307        1.21e+06    3.0 m        2.5 m     40.8%         7.4 m     4.4 m 
    INFO  19:41:47,922 ProgressMeter - Contig10583:238769        1.41e+06    3.5 m        2.5 m     47.5%         7.4 m     3.9 m 
    INFO  19:42:17,932 ProgressMeter - Contig10583:444763        1.62e+06    4.0 m        2.5 m     54.5%         7.3 m     3.3 m 
    INFO  19:42:47,942 ProgressMeter - Contig10583:609462        1.78e+06    4.5 m        2.5 m     60.0%         7.5 m     3.0 m 
    INFO  19:43:17,952 ProgressMeter - Contig10583:762834        1.93e+06    5.0 m        2.6 m     65.1%         7.7 m     2.7 m 
    INFO  19:43:47,962 ProgressMeter - Contig10583:911132        2.08e+06    5.5 m        2.6 m     70.1%         7.8 m     2.3 m 
    INFO  19:44:17,993 ProgressMeter - Contig10583:1047635        2.21e+06    6.0 m        2.7 m     74.7%         8.0 m     2.0 m 
    INFO  19:44:48,002 ProgressMeter - Contig10585:44382        2.35e+06    6.5 m        2.8 m     79.2%         8.2 m   102.3 s 
    INFO  19:45:18,039 ProgressMeter - Contig10590:111218        2.49e+06    7.0 m        2.8 m     84.1%         8.3 m    79.6 s 
    INFO  19:45:48,104 ProgressMeter - Contig10590:308414        2.69e+06    7.5 m        2.8 m     90.7%         8.3 m    46.2 s 
    INFO  19:46:18,133 ProgressMeter - Contig10590:499660        2.88e+06    8.0 m        2.8 m     97.1%         8.2 m    14.2 s 
    INFO  19:46:31,952 ProgressMeter -            done        2.98e+06    8.2 m        2.8 m    100.0%         8.2 m     0.0 s 
    INFO  19:46:31,953 ProgressMeter - Total runtime 494.16 secs, 8.24 min, 0.14 hours 
    INFO  19:46:31,953 MicroScheduler - 85507 reads were filtered out during traversal out of 15196956 total (0.56%) 
    INFO  19:46:31,953 MicroScheduler -   -> 85507 reads (0.56% of total) failing BadMateFilter 
    INFO  19:46:33,075 GATKRunReport - Uploaded run statistics report to AWS S3 
    
    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,276Administrator, GSA Member admin

    Let me know what results you get from the new run. It looks like you've got it running properly now.

    Geraldine Van der Auwera, PhD

  • delangeldelangel Posts: 71GSA Member mod

    Just as a general recommendation/warning: if your 50 haploid genomes are truly pooled and then sequenced then the -ploidy 50 is correct. But if your samples are individually barcoded (or sequenced separately) and hence are marked as separate samples in your BAM then that would be incorrect: you should set -ploidy 1 and you'll get in your output your individual genotypes for each sample.

  • shelbirussellshelbirussell Posts: 3Member

    Yes, my samples are truly pooled, thanks for inquiring. The genomes are from the intracellular symbiont population of the marine bivalve, Solemya velum. So, essentially the symbionts should be dealt with in a manner similar to organelles.

    As for running the UnifiedGenotyper with the -ploidy option, I have found that it works with -minIndelFrac, --genotype_likelihoods_model, and -pnrm. I know for sure that -refsample causes a problem. I'm rerunning it, adding an option at a time to find out which options I can specify. The above output is one where the program didn't output any variants. When the program works, the output is much longer, and details the progress meter for every contig and position. The above example is missing several contigs and most of the positions. The output of a successful run is very long, but I can post one if it would help.

    Shelbi

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,276Administrator, GSA Member admin

    Ah yes, -refsample is an argument that's only really meant for internal use; it will no longer be available in the next version since it is not useful for general users and apparently causes confusion.

    If it's working for you now then we don't need anything further. Good luck with your work!

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.