Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

UnifiedGenotyper functions available when using the -sample_ploidy option

Hi,

I trying to detect variants in a population of haploid genomes with UnifiedGenotyper, using the -sample_ploidy option to specify the number of haploid genomes in the pool. I would like to be able to specify other options, but so far the ones I have tried do not seem to work. I have tried: -out_mode EMIT_ALL_SITES --sample_ploidy 50 -minIndelFrac 0.05 --genotype_likelihoods_model BOTH -pnrm EXACT_GENERAL_PLOIDY. Could you please tell me which, if any, of the other commands are available when calling variants on non-diploid samples? Specifically, I would really like to be able to call indels too.

Thanks for your time and help.

Shelbi

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,834Administrator, GATK Developer admin

    Hi Shelbi,

    Could you tell me what you mean when you say it's not working? Are you getting an error or just no variant calls?

    Geraldine Van der Auwera, PhD

  • shelbirussellshelbirussell Posts: 3Member
    edited February 2013

    Hi Geraldine,

    Sorry I was vague. I didn't get any variant calls and the program didn't run very long. The output summary at the end seemed truncated as well. I pasted one of the outputs below. I am running the program right now using only the -sample_ploidy option, and it has been running for nearly two hours.

    Shelbi

    Your job looked like:

    ------------------------------------------------------------
    # LSBATCH: User input
    java -Xmx2g -jar /n/sw/GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar -T UnifiedGenotyper -R /n/nobackup2/cavanaugh/shelbirussell/alignments/reference/Solemya_velum_sym_genome_all_scaffolds.fasta -I Unamp_DML1gill_realigned.bam -o Unamp_DML1_snps_raw.vcf -out_mode EMIT_ALL_SITES --sample_ploidy 50 -minIndelFrac 0.05 --genotype_likelihoods_model BOTH -pnrm EXACT_GENERAL_PLOIDY -pcr_error 1.0E-20 -refsample WHOI_Svelum_sym_metagenome
    ------------------------------------------------------------

    Successfully completed.

    Resource usage summary:

    CPU time : 298.04 sec.
    Max Memory : 778 MB
    Max Swap : 2566 MB

    Max Processes : 3
    Max Threads : 24

    The output (if any) follows:

    INFO 19:38:17,044 HelpFormatter - --------------------------------------------------------------------------------
    INFO 19:38:17,047 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.3-9-ge5ebf34, Compiled 2013/01/11 22:43:14
    INFO 19:38:17,047 HelpFormatter - Copyright (c) 2010 The Broad Institute
    INFO 19:38:17,047 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
    INFO 19:38:17,051 HelpFormatter - Program Args: -T UnifiedGenotyper -R /n/nobackup2/cavanaugh/shelbirussell/alignments/reference/Solemya_velum_sym_genome_all_scaffolds.fasta -I Unamp_DML1gill_realigned.bam -o Unamp_DML1_snps_raw.vcf -out_mode EMIT_ALL_SITES --sample_ploidy 50 -minIndelFrac 0.05 --genotype_likelihoods_model BOTH -pnrm EXACT_GENERAL_PLOIDY -pcr_error 1.0E-20 -refsample WHOI_Svelum_sym_metagenome
    INFO 19:38:17,051 HelpFormatter - Date/Time: 2013/02/21 19:38:17
    INFO 19:38:17,060 HelpFormatter - --------------------------------------------------------------------------------
    INFO 19:38:17,060 HelpFormatter - --------------------------------------------------------------------------------
    INFO 19:38:17,146 GenomeAnalysisEngine - Strictness is SILENT
    INFO 19:38:17,681 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250, Using the new downsampling implementation
    INFO 19:38:17,687 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
    INFO 19:38:17,741 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.05
    INFO 19:38:17,794 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
    INFO 19:38:17,794 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining
    WARN 19:38:17,909 UnifiedGenotyper - WARNING: note that the EMIT_ALL_SITES option is intended only for point mutations (SNPs) in DISCOVERY mode or generally when running in GENOTYPE_GIVEN_ALLELES mode; it will by no means produce a comprehensive set of indels in DISCOVERY mode
    INFO 19:38:47,806 ProgressMeter - Contig10580:31076 2.99e+05 30.0 s 100.4 s 10.5% 4.7 m 4.2 m
    INFO 19:39:17,830 ProgressMeter - Contig10580:189787 4.63e+05 60.0 s 2.2 m 15.9% 6.3 m 5.3 m
    INFO 19:39:47,840 ProgressMeter - Contig10580:356240 6.27e+05 90.0 s 2.4 m 21.5% 7.0 m 5.5 m
    INFO 19:40:17,849 ProgressMeter - Contig10580:573285 8.40e+05 2.0 m 2.4 m 28.8% 7.0 m 5.0 m
    INFO 19:40:47,858 ProgressMeter - Contig10580:763041 1.04e+06 2.5 m 2.4 m 35.1% 7.1 m 4.6 m
    INFO 19:41:17,912 ProgressMeter - Contig10583:38307 1.21e+06 3.0 m 2.5 m 40.8% 7.4 m 4.4 m
    INFO 19:41:47,922 ProgressMeter - Contig10583:238769 1.41e+06 3.5 m 2.5 m 47.5% 7.4 m 3.9 m
    INFO 19:42:17,932 ProgressMeter - Contig10583:444763 1.62e+06 4.0 m 2.5 m 54.5% 7.3 m 3.3 m
    INFO 19:42:47,942 ProgressMeter - Contig10583:609462 1.78e+06 4.5 m 2.5 m 60.0% 7.5 m 3.0 m
    INFO 19:43:17,952 ProgressMeter - Contig10583:762834 1.93e+06 5.0 m 2.6 m 65.1% 7.7 m 2.7 m
    INFO 19:43:47,962 ProgressMeter - Contig10583:911132 2.08e+06 5.5 m 2.6 m 70.1% 7.8 m 2.3 m
    INFO 19:44:17,993 ProgressMeter - Contig10583:1047635 2.21e+06 6.0 m 2.7 m 74.7% 8.0 m 2.0 m
    INFO 19:44:48,002 ProgressMeter - Contig10585:44382 2.35e+06 6.5 m 2.8 m 79.2% 8.2 m 102.3 s
    INFO 19:45:18,039 ProgressMeter - Contig10590:111218 2.49e+06 7.0 m 2.8 m 84.1% 8.3 m 79.6 s
    INFO 19:45:48,104 ProgressMeter - Contig10590:308414 2.69e+06 7.5 m 2.8 m 90.7% 8.3 m 46.2 s
    INFO 19:46:18,133 ProgressMeter - Contig10590:499660 2.88e+06 8.0 m 2.8 m 97.1% 8.2 m 14.2 s
    INFO 19:46:31,952 ProgressMeter - done 2.98e+06 8.2 m 2.8 m 100.0% 8.2 m 0.0 s
    INFO 19:46:31,953 ProgressMeter - Total runtime 494.16 secs, 8.24 min, 0.14 hours
    INFO 19:46:31,953 MicroScheduler - 85507 reads were filtered out during traversal out of 15196956 total (0.56%)
    INFO 19:46:31,953 MicroScheduler - -> 85507 reads (0.56% of total) failing BadMateFilter
    INFO 19:46:33,075 GATKRunReport - Uploaded run statistics report to AWS S3
    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,834Administrator, GATK Developer admin

    Let me know what results you get from the new run. It looks like you've got it running properly now.

    Geraldine Van der Auwera, PhD

  • delangeldelangel Posts: 71GATK Developer mod

    Just as a general recommendation/warning: if your 50 haploid genomes are truly pooled and then sequenced then the -ploidy 50 is correct. But if your samples are individually barcoded (or sequenced separately) and hence are marked as separate samples in your BAM then that would be incorrect: you should set -ploidy 1 and you'll get in your output your individual genotypes for each sample.

  • shelbirussellshelbirussell Posts: 3Member

    Yes, my samples are truly pooled, thanks for inquiring. The genomes are from the intracellular symbiont population of the marine bivalve, Solemya velum. So, essentially the symbionts should be dealt with in a manner similar to organelles.

    As for running the UnifiedGenotyper with the -ploidy option, I have found that it works with -minIndelFrac, --genotype_likelihoods_model, and -pnrm. I know for sure that -refsample causes a problem. I'm rerunning it, adding an option at a time to find out which options I can specify. The above output is one where the program didn't output any variants. When the program works, the output is much longer, and details the progress meter for every contig and position. The above example is missing several contigs and most of the positions. The output of a successful run is very long, but I can post one if it would help.

    Shelbi

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,834Administrator, GATK Developer admin

    Ah yes, -refsample is an argument that's only really meant for internal use; it will no longer be available in the next version since it is not useful for general users and apparently causes confusion.

    If it's working for you now then we don't need anything further. Good luck with your work!

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.