We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!
variant calling with biological replicates
I am new to use GATK pipeline for SNP calling. I am currently working on four different populations (RNASEq data) with 6 clones. Each clone has 3 biological replicates. How do I combine the variant calling step for the replicates?
I went through the GATK documentation on cohorts and stuff, but I am not sure if I should continue this.
Does combining the resultant vcf file also a possibility?
please help!
Best Answers
-
Sheila Broad Institute admin
@suda_ravindran
Hi Suda,Thanks for the clarification.
The best way to do this is to process the biological replicate samples as the same sample from different libraries. So, A1a, A1b, A2c would have the same sample name but a different library name.For variant calling, you can combine all the same sample reads into one bam file. For example, after pre-processing steps, A1a, A1b, and A1c will be merged into one bam file. Then you can run Haplotype Caller on each sample in GVCF mode. Then run Genotype GVCFs on all sample GVCFs. https://www.broadinstitute.org/gatk/guide/article?id=3893
-Sheila
-
Sheila Broad Institute admin
@suda_ravindran
Hi Suda,Just a quick note about how to merge the same sample bams.
When you have multiple libraries (or read groups) for a sample, there are several options for organizing the processing.If you'd like to produce a combined per-sample bam file to feed to Haplotype Caller, the simplest thing to do is to input all the bam files that belong to the sample, either at the indel realignment step or the BQSR step. The choice depends mostly on how deep the coverage is, because high depth means lots of data to process at the same time, which slows down indel realignment. BQSR doesn't suffer from that problem because it processes read groups separately. In either case, when you input all samples together, the bam that gets written out with the processed data will include all the libraries / read groups in one per-sample file.
Another option is to keep the sample bam files separate until variant calling, and then input them to Haplotype Caller together. As long as they are identified with the same SM tag, Haplotype Caller should recognize that it is a single-sample run and run in GVCF mode.
I hope this helps.
-Sheila
Answers
I am by no means an expert on this type of problem. But it makes more sense to me to merge your reads rather than merging after variant calling. You better wait for the answer from someone else other than a fellow user.
@suda_ravindran
Hi,
Can you give us some more background information? What organism are you working with? By clones, do you mean separate colonies (assuming you are working with bacteria)?
Can you explain how the samples were prepared?
Thanks,
Sheila
@Sheila:
Hello Ms.Sheila,
I work on Daphnia.
By clones, I meant the genotypes.
Daphnia from 4 different location were sampled and 6 genotypes from each location were selected using microsatellite markers. For each of the genotype, 3 biological replicates were considered. The cDNA was extracted and sequenced using Illumina. Then, the reads were assembled and mapped.
Hope this helps!
@suda_ravindran
Hi,
Thanks. What exactly is your end goal? Do you want to compare the Daphnia from each location?
-Sheila
@Sheila:
Hello,
Yes, I want to compare the daphnids from each location, I am trying to understand the local adaptation
We need to understand your experimental design a bit better to answer your question.
Regarding the clones / genotypes that you are using as biological replicates, do you expect them to be genetically identical? Or just very closely related?
In your downstream analysis,what exactly do you want to compare? How do you plan to set up the analysis?
@Geraldine_VdAuwera
1) Daphnia have the capabitiy to produce clones of themselves. Hence, our samples are genetically identical biological replicates.
So for example:
Lets say we have for populations: A,B,C and D.
Each of the populations have 6 genotypes labelled from 1-6. (eg: A1,A2,A3,A4,A5,A6 and so on..)
Each genotype has 3 biological replicates labelled as a,b,c. (eg: A1a, A1b, A1c, A2a, A2b, A2c and so on..)
So A1a, A1b, A1c are gentically identical. Similarly for all other genotypes.
2) With the read count data, we have already done a differential gene expression analysis and have a list of genes that are differentially expressed for each of the population (example: We have 100,150,125,145 genes that are differentially expressed in Population A,B,C and D respectively). Now we want to study the sequence level variation among the four populations. We would like to annotate them understand their function. Based on this functional interpretation, we would carry forward our study.
Hope this helps!
Suda
Issue · Github
by Sheila
@suda_ravindran
Hi Suda,
Thanks for the clarification.
The best way to do this is to process the biological replicate samples as the same sample from different libraries. So, A1a, A1b, A2c would have the same sample name but a different library name.
For variant calling, you can combine all the same sample reads into one bam file. For example, after pre-processing steps, A1a, A1b, and A1c will be merged into one bam file. Then you can run Haplotype Caller on each sample in GVCF mode. Then run Genotype GVCFs on all sample GVCFs. https://www.broadinstitute.org/gatk/guide/article?id=3893
-Sheila
Hi @Sheila
Thank you very much!
Suda
@suda_ravindran
Hi Suda,
Just a quick note about how to merge the same sample bams.
When you have multiple libraries (or read groups) for a sample, there are several options for organizing the processing.
If you'd like to produce a combined per-sample bam file to feed to Haplotype Caller, the simplest thing to do is to input all the bam files that belong to the sample, either at the indel realignment step or the BQSR step. The choice depends mostly on how deep the coverage is, because high depth means lots of data to process at the same time, which slows down indel realignment. BQSR doesn't suffer from that problem because it processes read groups separately. In either case, when you input all samples together, the bam that gets written out with the processed data will include all the libraries / read groups in one per-sample file.
Another option is to keep the sample bam files separate until variant calling, and then input them to Haplotype Caller together. As long as they are identified with the same SM tag, Haplotype Caller should recognize that it is a single-sample run and run in GVCF mode.
I hope this helps.
-Sheila
@Sheila
Hello,
After using the same SM tag in the AddorReplaceReadGroups step, I tried running the HaplotypeCaller using both methods individually viz.,
1) producing a combined per-sample bam file where I combined them at the indel realignment step.
2) I kept the bam files separate until variant calling.
After variant calling, when I try to do the Variant filtration, I started getting errors like:
Interpreter - ![0,2]: 'QD < 2.0;' undefined variable QD
Interpreter - ![0,2]: 'FS > 30.0;' undefined variable FS
I checked the vcf file for multiple-allelic sites as mentioned http://gatkforums.broadinstitute.org/discussion/2334/undefined-variable-variantfiltration
I am not sure what is happening here in the REF field (find attached jpeg file):
Please help!!
@suda_ravindran
Hi Suda,
Can you post some actual records from the VCF? I cannot see many annotations in the image you posted. Also, which version of GATK are you using and what was the exact command you ran for VariantFiltration?
Thanks,
Sheila
@Sheila
Hi Sheila,
I am using the latest version of GATK : -3.4-46
The command I used for running the Variant Filtration is:
java -jar GenomeAnalysisTK.jar -T VariantFiltration -R input_files/reference/allaugust.okay.tr.fasta -V output_files/indel_realignment/G/G111_new.vcf -window 35 -cluster 3 -filterName FS -filter "FS > 30.0" -filterName QD -filter "QD < 2.0" -o output_files/indel_realignment/G/G111_filt_again_2.vcf
Please find attached the same regions posted in the screenshot as a text file.
@Sheila
Hi Sheila,
Please find attached the same regions posted in the screenshot as a text file.
Suda
@suda_ravindran
Hi Suda,
I think you did not attach the image.
-Sheila
@suda_ravindran
Thanks.
Sorry I spoke to soon! I see the .txt file
@suda_ravindran
Hi Suda,
This is odd. The VCF looks fine. Can you check and make sure QD and FS are defined in the VCF header? Also, please try running ValidateVariants? https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_ValidateVariants.php
I tried running the same command you ran on some data I have, and it ran perfectly.
-Sheila
@Sheila:
Hi Sheila,
This is my header from the VCF file:
fileformat=VCFv4.1
FILTER=<ID=LowQual,Description="Low quality">
FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
GATKCommandLine.HaplotypeCaller=<ID=HaplotypeCaller,Version=3.4-0-g7e26428,Date="Wed Aug 26 11:04:51 CEST 2015",Epoch=1440579891180,CommandLineOptions="analysis_type=HaplotypeCaller input_file=[output_files/indel_realignment/G/G111_reads.bam] showFullBamList=false read_buffer_size=null phone_home=AWS gatk_key=null tag=NA read_filter=[] disable_read_filter=[] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=input_files/reference/allaugust.okay.tr.fasta nonDeterministicRandomSeed=false disableDithering=false maxRuntime=-1 maxRuntimeUnits=MINUTES downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=500 baq=OFF baqGapOpenPenalty=40.0 refactor_NDN_cigar_string=false fix_misencoded_quality_scores=false allow_potentially_misencoded_quality_scores=false useOriginalQualities=false defaultBaseQualities=-1 performanceLog=null BQSR=null quantize_quals=0 disable_indel_quals=false emit_original_quals=false preserve_qscores_less_than=6 globalQScorePrior=-1.0 validation_strictness=SILENT remove_program_records=false keep_program_records=false sample_rename_mapping_file=null unsafe=null disable_auto_index_creation_and_locking_when_reading_rods=false no_cmdline_in_header=false sites_only=false never_trim_vcf_format_field=false bcf=false bam_compression=null simplifyBAM=false disable_bam_indexing=false generate_md5=false num_threads=1 num_cpu_threads_per_data_thread=1 num_io_threads=0 monitorThreadEfficiency=false num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false generateShadowBCF=false variant_index_type=DYNAMIC_SEEK variant_index_parameter=-1 logging_level=INFO log_to_file=null help=false version=false out=/media/mathilde/daten2/Suda/SNP_Calling/GATK/output_files/indel_realignment/G/G111.vcf likelihoodCalculationEngine=PairHMM heterogeneousKmerSizeResolution=COMBO_MIN dbsnp=(RodBinding name= source=UNBOUND) dontTrimActiveRegions=false maxDiscARExtension=25 maxGGAARExtension=300 paddingAroundIndels=150 paddingAroundSNPs=20 comp=[] annotation=[ClippingRankSumTest, DepthPerSampleHC] excludeAnnotation=[] debug=false useFilteredReadsForAnnotations=false emitRefConfidence=NONE bamOutput=null bamWriterType=CALLED_HAPLOTYPES disableOptimizations=false annotateNDA=false heterozygosity=0.001 indel_heterozygosity=1.25E-4 standard_min_confidence_threshold_for_calling=20.0 standard_min_confidence_threshold_for_emitting=20.0 max_alternate_alleles=6 input_prior=[] sample_ploidy=2 genotyping_mode=DISCOVERY alleles=(RodBinding name= source=UNBOUND) contamination_fraction_to_filter=0.0 contamination_fraction_per_sample_file=null p_nonref_model=null exactcallslog=null output_mode=EMIT_VARIANTS_ONLY allSitePLs=false gcpHMM=10 pair_hmm_implementation=VECTOR_LOGLESS_CACHING pair_hmm_sub_implementation=ENABLE_ALL always_load_vector_logless_PairHMM_lib=false phredScaledGlobalReadMismappingRate=45 noFpga=false sample_name=null kmerSize=[10, 25] dontIncreaseKmerSizesForCycles=false allowNonUniqueKmersInRef=false numPruningSamples=1 recoverDanglingHeads=false doNotRecoverDanglingBranches=false minDanglingBranchLength=4 consensus=false maxNumHaplotypesInPopulation=128 errorCorrectKmers=false minPruning=2 debugGraphTransformations=false allowCyclesInKmerGraphToGeneratePaths=false graphOutput=null kmerLengthForReadErrorCorrection=25 minObservationsForKmerToBeSolid=20 GVCFGQBands=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 70, 80, 90, 99] indelSizeToEliminateInRefModel=10 min_base_quality_score=10 includeUmappedReads=false useAllelesTrigger=false doNotRunPhysicalPhasing=true keepRG=null justDetermineActiveRegions=false dontGenotype=false dontUseSoftClippedBases=true captureAssemblyFailureBAM=false errorCorrectReads=false pcr_indel_model=CONSERVATIVE maxReadsInRegionPerSample=10000 minReadsPerAlignmentStart=10 mergeVariantsViaLD=false activityProfileOut=null activeRegionOut=null activeRegionIn=null activeRegionExtension=null forceActive=false activeRegionMaxSize=null bandPassSigma=null maxProbPropagationDistance=50 activeProbabilityThreshold=0.002 min_mapping_quality_score=20 filter_reads_with_N_cigar=false filter_mismatching_base_and_quals=false filter_bases_not_stored=false">
INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">
INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">
INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
INFO=<ID=SOR,Number=1,Type=Float,Description="Symmetric Odds Ratio of 2x2 contingency table to detect strand bias">
This is what I obtained after running the ValidateVariants step:
java -jar GenomeAnalysisTK.jar -T ValidateVariants -R input_files/reference/allaugust.okay.tr.fasta -V output_files/indel_realignment/G/G111.vcf
INFO 16:41:52,908 HelpFormatter - ---------------------------------------------------------------------------------
INFO 16:41:52,910 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.4-46-gbc02625, Compiled 2015/07/09 17:38:12
INFO 16:41:52,911 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 16:41:52,911 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 16:41:52,915 HelpFormatter - Program Args: -T ValidateVariants -R input_files/reference/allaugust.okay.tr.fasta -V output_files/indel_realignment/G/G111.vcf
INFO 16:41:52,921 HelpFormatter - Executing as [email protected] on Linux 3.13.0-61-generic amd64; OpenJDK 64-Bit Server VM 1.7.0_79-b14.
INFO 16:41:52,921 HelpFormatter - Date/Time: 2015/08/28 16:41:52
INFO 16:41:52,921 HelpFormatter - ---------------------------------------------------------------------------------
INFO 16:41:52,921 HelpFormatter - ---------------------------------------------------------------------------------
INFO 16:41:52,984 GenomeAnalysisEngine - Strictness is SILENT
INFO 16:41:53,852 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO 16:42:01,746 GenomeAnalysisEngine - Preparing for traversal
INFO 16:42:01,798 GenomeAnalysisEngine - Done preparing for traversal
INFO 16:42:01,799 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 16:42:01,799 ProgressMeter - | processed | time | per 1M | | total | remaining
INFO 16:42:01,800 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime
INFO 16:42:03,198 ValidateVariants - Reference allele is too long (146) at position abyss10353:109; skipping that record.
INFO 16:42:03,683 ValidateVariants - Reference allele is too long (118) at position oasesvelvLoc2997d15643t2:151; skipping that record.
INFO 16:42:04,896 ValidateVariants - Reference allele is too long (129) at position trinitytrinloc16123c0t2:23; skipping that record.
INFO 16:42:04,982 ValidateVariants - Reference allele is too long (103) at position trinitytrinloc20144c2t5:2427; skipping that record.
Successfully validated the input file. Checked 150907 records with no failures.
INFO 16:42:05,561 ProgressMeter - done 168838.0 3.0 s 22.0 s 100.0% 3.0 s 0.0 s
INFO 16:42:05,562 ProgressMeter - Total runtime 3.76 secs, 0.06 min, 0.00 hours
INFO 16:42:07,083 GATKRunReport - Uploaded run statistics report to AWS S3
Suda
@Sheila:
Just to make sure, I have done things right:
1) In the AddorReplaceReadGroups step, I had the same sample name and different library name for each of my replicates (as you have mentioned previously)
2) I then Sorted, indexed the bam files, then checked for duplicates and also ran the SplitNCigarReads step using the default paramters.
3) Then, I ran the Indel Relaignment step.
4) I went to VariantCalling step after this as I do not have snp db for the BQSR step using the command:
java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R input_files/reference/allaugust.okay.tr.fasta -I output_files/indel_realignment/G/G111_reads.bam -dontUseSoftClippedBases -stand_call_conf 20.0 -stand_emit_conf 20.0 -o output_files/indel_realignment/G/G111_new.vcf -log output_files/indel_realignment/G/G111_report.log
After this I did the VariantFiltration step as mentioned previously.
Is this right?
@suda_ravindran
Thanks Suda. Everything looks fine. Can you try removing those sites where the reference alleles are too long and try running VariantFiltration again? I may need you to submit a bug report.
Thanks,
Sheila
@Sheila
Hi Sheila,
I removed those variants that were troubling and rand the VariantFiltration step:
This is what I got:
java -jar GenomeAnalysisTK.jar -T VariantFiltration -R input_files/reference/allaugust.okay.tr.fasta -V output_files/indel_realignment/G/G111_removed.vcf -window 35 -cluster 3 -filterName FS -filter "FS > 30.0" -filterName QD -filter "QD < 2.0" -o output_files/indel_realignment/G/G111_new_filt_removed.vcf
INFO 17:12:34,468 HelpFormatter - ---------------------------------------------------------------------------------
INFO 17:12:34,471 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.4-46-gbc02625, Compiled 2015/07/09 17:38:12
INFO 17:12:34,471 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 17:12:34,471 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 17:12:34,475 HelpFormatter - Program Args: -T VariantFiltration -R input_files/reference/allaugust.okay.tr.fasta -V output_files/indel_realignment/G/G111_removed.vcf -window 35 -cluster 3 -filterName FS -filter FS > 30.0 -filterName QD -filter QD < 2.0 -o output_files/indel_realignment/G/G111_new_filt_removed.vcf
INFO 17:12:34,481 HelpFormatter - Executing as [email protected] on Linux 3.13.0-61-generic amd64; OpenJDK 64-Bit Server VM 1.7.0_79-b14.
INFO 17:12:34,481 HelpFormatter - Date/Time: 2015/08/28 17:12:34
INFO 17:12:34,482 HelpFormatter - ---------------------------------------------------------------------------------
INFO 17:12:34,482 HelpFormatter - ---------------------------------------------------------------------------------
INFO 17:12:34,562 GenomeAnalysisEngine - Strictness is SILENT
INFO 17:12:35,392 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO 17:12:43,986 GATKRunReport - Uploaded run statistics report to AWS S3
ERROR ------------------------------------------------------------------------------------------
ERROR stack trace
java.lang.NumberFormatException: For input string: "280,77"
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1250)
at java.lang.Double.valueOf(Double.java:504)
at htsjdk.variant.vcf.AbstractVCFCodec.parseQual(AbstractVCFCodec.java:511)
at htsjdk.variant.vcf.AbstractVCFCodec.parseVCFLine(AbstractVCFCodec.java:317)
at htsjdk.variant.vcf.AbstractVCFCodec.decodeLine(AbstractVCFCodec.java:279)
at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:257)
at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:60)
at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:79)
at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:41)
at htsjdk.tribble.AbstractFeatureCodec.decodeLoc(AbstractFeatureCodec.java:40)
at htsjdk.tribble.index.IndexFactory$FeatureIterator.readNextFeature(IndexFactory.java:502)
at htsjdk.tribble.index.IndexFactory$FeatureIterator.(IndexFactory.java:403)
at htsjdk.tribble.index.IndexFactory.createDynamicIndex(IndexFactory.java:312)
at org.broadinstitute.gatk.utils.refdata.tracks.RMDTrackBuilder.createIndexInMemory(RMDTrackBuilder.java:401)
at org.broadinstitute.gatk.utils.refdata.tracks.RMDTrackBuilder.loadIndex(RMDTrackBuilder.java:287)
at org.broadinstitute.gatk.utils.refdata.tracks.RMDTrackBuilder.getFeatureSource(RMDTrackBuilder.java:224)
at org.broadinstitute.gatk.utils.refdata.tracks.RMDTrackBuilder.createInstanceOfTrack(RMDTrackBuilder.java:147)
at org.broadinstitute.gatk.engine.datasources.rmd.ReferenceOrderedQueryDataPool.(ReferenceOrderedDataSource.java:208)
at org.broadinstitute.gatk.engine.datasources.rmd.ReferenceOrderedDataSource.(ReferenceOrderedDataSource.java:88)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.getReferenceOrderedDataSources(GenomeAnalysisEngine.java:1047)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.initializeDataSources(GenomeAnalysisEngine.java:828)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:286)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106)
ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.4-46-gbc02625):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: For input string: "280,77"
ERROR ------------------------------------------------------------------------------------------
@Sheila
Hi Sheila,
Any updates on how I could proceed on this?
Thanks,
Suda
@suda_ravindran
Hi Suda,
It looks like there is a 280,77 that is not defined somewhere in your VCF. The format looks like an AD field. Can you check if for a 280,77 in your VCF? Please post the record here. How did you remove the offending records?
Thanks,
Sheila
@Sheila
Hi Shiela,
I am not sure what 280,77 means.
I removed the offending records manually (just deleted the variants from the file) and created the new file.
Suda
@suda_ravindran
Hi Suda,
Can you please try running ValidateVariants on your VCF without the offending records? If that runs with no errors, can you submit a bug report. Instructions are here: http://gatkforums.broadinstitute.org/discussion/1894/how-do-i-submit-a-detailed-bug-report
Thanks,
Sheila
@Sheila
Hi Sheila,
Thanks, I just ran the ValidateVaraints and obtained the same error message: " For input string: 280,77.
The command that I used:
java -jar GenomeAnalysisTK.jar -T ValidateVariants -R input_files/reference/allaugust.okay.tr.fasta -V output_files/indel_realignment/G/G111_removed.vcf
INFO 15:46:37,573 HelpFormatter - ---------------------------------------------------------------------------------
INFO 15:46:37,575 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.4-46-gbc02625, Compiled 2015/07/09 17:38:12
INFO 15:46:37,575 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 15:46:37,575 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 15:46:37,578 HelpFormatter - Program Args: -T ValidateVariants -R input_files/reference/allaugust.okay.tr.fasta -V output_files/indel_realignment/G/G111_removed.vcf
INFO 15:46:37,583 HelpFormatter - Executing as [email protected] on Linux 3.13.0-61-generic amd64; OpenJDK 64-Bit Server VM 1.7.0_79-b14.
INFO 15:46:37,584 HelpFormatter - Date/Time: 2015/08/31 15:46:37
INFO 15:46:37,584 HelpFormatter - ---------------------------------------------------------------------------------
INFO 15:46:37,584 HelpFormatter - ---------------------------------------------------------------------------------
INFO 15:46:37,635 GenomeAnalysisEngine - Strictness is SILENT
INFO 15:46:38,336 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO 15:46:46,936 GATKRunReport - Uploaded run statistics report to AWS S3
ERROR ------------------------------------------------------------------------------------------
ERROR stack trace
java.lang.NumberFormatException: For input string: "280,77"
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1250)
at java.lang.Double.valueOf(Double.java:504)
at htsjdk.variant.vcf.AbstractVCFCodec.parseQual(AbstractVCFCodec.java:511)
at htsjdk.variant.vcf.AbstractVCFCodec.parseVCFLine(AbstractVCFCodec.java:317)
at htsjdk.variant.vcf.AbstractVCFCodec.decodeLine(AbstractVCFCodec.java:279)
at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:257)
at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:60)
at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:79)
at htsjdk.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:41)
at htsjdk.tribble.AbstractFeatureCodec.decodeLoc(AbstractFeatureCodec.java:40)
at htsjdk.tribble.index.IndexFactory$FeatureIterator.readNextFeature(IndexFactory.java:502)
at htsjdk.tribble.index.IndexFactory$FeatureIterator.(IndexFactory.java:403)
at htsjdk.tribble.index.IndexFactory.createDynamicIndex(IndexFactory.java:312)
at org.broadinstitute.gatk.utils.refdata.tracks.RMDTrackBuilder.createIndexInMemory(RMDTrackBuilder.java:401)
at org.broadinstitute.gatk.utils.refdata.tracks.RMDTrackBuilder.loadIndex(RMDTrackBuilder.java:287)
at org.broadinstitute.gatk.utils.refdata.tracks.RMDTrackBuilder.getFeatureSource(RMDTrackBuilder.java:224)
at org.broadinstitute.gatk.utils.refdata.tracks.RMDTrackBuilder.createInstanceOfTrack(RMDTrackBuilder.java:147)
at org.broadinstitute.gatk.engine.datasources.rmd.ReferenceOrderedQueryDataPool.(ReferenceOrderedDataSource.java:208)
at org.broadinstitute.gatk.engine.datasources.rmd.ReferenceOrderedDataSource.(ReferenceOrderedDataSource.java:88)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.getReferenceOrderedDataSources(GenomeAnalysisEngine.java:1047)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.initializeDataSources(GenomeAnalysisEngine.java:828)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:286)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106)
ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.4-46-gbc02625):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: For input string: "280,77"
ERROR ------------------------------------------------------------------------------------------
Do I still submit a bug report?
Thanks, Suda
@suda_ravindran
Hi Suda,
Okay somehow the VCF must have gotten corrupted when you removed the sites. Can you try using -XL on the sites rather than removing them? https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_engine_CommandLineGATK.php#--excludeIntervals
-Sheila
****@Sheila
Hi Sheila,
I used the -XL option to remove those sites which had long REF position. I still keep getting this warning message:
My command was: ****
**java -jar GenomeAnalysisTK.jar -T VariantFiltration -R input_files/reference/allaugust.okay.tr.fasta -V output_files/indel_realignment/G/G112.vcf -window 35 -cluster 3 -filterName FS -filter "FS > 30.0" -filterName QD -filter "QD < 2.0" -o output_files/indel_realignment/G/G112_filt.vcf -XL abyssk40_j_668347:698 -XL abyssk64_j_517734:263 -XL abyssk76_j_468075:250 -XL abyss1756:1344 -XL abyss2055:1638 -XL abyss12232:82 -XL oasesvelvLoc3048t2:1925 -XL oasesvelvLoc3048t2:2170 -XL oasesvelvLoc4278t1:324 -XL oasesvelvLoc4t8884:3534 -XL oasesvelvLoc6t676:53 -XL oasesvelvLoc4723d10616t1:637 -XL oasesvelvLoc5598d10760t1:60 -XL oasesvelvLoc359t2:370 -XL oasesvelvLoc1228t2:927 -XL oasesvelvLoc3068t3:448 -XL oasesvelvLoc11581t1:146 -XL oasesvelvLoc12128t3:1376 -XL oasesvelvLoc1026d26235t1:3695 -XL oasesvelvLoc2282d27162t3:334 -XL oasesvelvLoc5370d28990t1:817 -XL oasesvelvLoc16759t1:2457 -XL oasesvelvLoc21253t2:237 -XL oasesvelvLoc23883t1:467 -XL oasesvelvLoc886d36038t3:4137 -XL oasesvelvLoc894d36043t1:74 -XL oasesvelvLoc1408d36364t2:2167 -XL oasesvelvLoc12188d39561t1:189 -XL oasesvelvLoc13807d47288t1:444 -XL soapsoap196131:106 -XL soapsoapd3723321453:85 -XL soapsoap362703:372 -XL soapsoap385976:970 -XL soapsoapd20450231149:98 -XL soapsoapd25964281441:72 -XL soapsoap385691:67 -XL soapsoap386007:2010 -XL soapsoap425053:1616 -XL trinitytrinloc10376c0t4:1143 -XL trinitytrinloc12960c0t2:1952 -XL trinitytrinloc18203c0t14:2436 -XL trinitytrinloc18717c0t1:1105 -XL trinitytrinloc19230c0t1:261 -XL trinitytrinloc19230c0t1:319 -XL trinitytrinloc19862c0t4:762 -XL trinitytrinloc20144c2t5:2427 -XL trinitytrinloc20146c0t4:2046 -XL trinitytrinloc21018c0t1:1235 -XL trinitytrinloc21314c0t2:698 -XL trinitytrinloc21409c1t3:1096 -XL trinitytrinloc22065c0t1:2308 -XL trinitytrinloc22067c0t1:900 -XL trinitytrinloc22331c0t1:444 -XL trinitytrinloc22510c0t13:1465 -XL trinitytrinloc22867c0t1:2162 -XL trinitytrinloc23124c0t14:1009 -XL trinitytrinloc23314c0t5:774 -XL trinitytrinloc23357c0t4:101 -XL trinitytrinloc23597c0t1:1431 -XL trinitytrinloc23632c0t11:171 -XL trinitytrinloc23694c0t1:394 -XL trinitytrinloc23969c1t1:1909 -XL trinitytrinloc23972c1t41:2308 -XL trinitytrinloc24346c0t3:2962 -XL trinitytrinloc24511c0t1:367 -XL trinitytrinloc24791c0t2:2249 -XL trinitytrinloc25114c0t38:3566 -XL trinitytrinloc25199c0t1:1077 -XL trinitytrinloc25306c1t3:2396 -XL trinitytrinloc25550c0t2:2283 -XL trinitytrinloc25716c1t13:1527 -XL trinitytrinloc25721c1t2:1454
**
INFO 15:39:57,354 HelpFormatter - ---------------------------------------------------------------------------------
INFO 15:39:57,356 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.4-46-gbc02625, Compiled 2015/07/09 17:38:12
INFO 15:39:57,356 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 15:39:57,357 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 15:39:57,360 HelpFormatter - Program Args: -T VariantFiltration -R input_files/reference/allaugust.okay.tr.fasta -V output_files/indel_realignment/G/G112.vcf -window 35 -cluster 3 -filterName FS -filter FS > 30.0 -filterName QD -filter QD < 2.0 -o output_files/indel_realignment/G/G112_filt.vcf -XL abyssk40_j_668347:698 -XL abyssk64_j_517734:263 -XL abyssk76_j_468075:250 -XL abyss1756:1344 -XL abyss2055:1638 -XL abyss12232:82 -XL oasesvelvLoc3048t2:1925 -XL oasesvelvLoc3048t2:2170 -XL oasesvelvLoc4278t1:324 -XL oasesvelvLoc4t8884:3534 -XL oasesvelvLoc6t676:53 -XL oasesvelvLoc4723d10616t1:637 -XL oasesvelvLoc5598d10760t1:60 -XL oasesvelvLoc359t2:370 -XL oasesvelvLoc1228t2:927 -XL oasesvelvLoc3068t3:448 -XL oasesvelvLoc11581t1:146 -XL oasesvelvLoc12128t3:1376 -XL oasesvelvLoc1026d26235t1:3695 -XL oasesvelvLoc2282d27162t3:334 -XL oasesvelvLoc5370d28990t1:817 -XL oasesvelvLoc16759t1:2457 -XL oasesvelvLoc21253t2:237 -XL oasesvelvLoc23883t1:467 -XL oasesvelvLoc886d36038t3:4137 -XL oasesvelvLoc894d36043t1:74 -XL oasesvelvLoc1408d36364t2:2167 -XL oasesvelvLoc12188d39561t1:189 -XL oasesvelvLoc13807d47288t1:444 -XL soapsoap196131:106 -XL soapsoapd3723321453:85 -XL soapsoap362703:372 -XL soapsoap385976:970 -XL soapsoapd20450231149:98 -XL soapsoapd25964281441:72 -XL soapsoap385691:67 -XL soapsoap386007:2010 -XL soapsoap425053:1616 -XL trinitytrinloc10376c0t4:1143 -XL trinitytrinloc12960c0t2:1952 -XL trinitytrinloc18203c0t14:2436 -XL trinitytrinloc18717c0t1:1105 -XL trinitytrinloc19230c0t1:261 -XL trinitytrinloc19230c0t1:319 -XL trinitytrinloc19862c0t4:762 -XL trinitytrinloc20144c2t5:2427 -XL trinitytrinloc20146c0t4:2046 -XL trinitytrinloc21018c0t1:1235 -XL trinitytrinloc21314c0t2:698 -XL trinitytrinloc21409c1t3:1096 -XL trinitytrinloc22065c0t1:2308 -XL trinitytrinloc22067c0t1:900 -XL trinitytrinloc22331c0t1:444 -XL trinitytrinloc22510c0t13:1465 -XL trinitytrinloc22867c0t1:2162 -XL trinitytrinloc23124c0t14:1009 -XL trinitytrinloc23314c0t5:774 -XL trinitytrinloc23357c0t4:101 -XL trinitytrinloc23597c0t1:1431 -XL trinitytrinloc23632c0t11:171 -XL trinitytrinloc23694c0t1:394 -XL trinitytrinloc23969c1t1:1909 -XL trinitytrinloc23972c1t41:2308 -XL trinitytrinloc24346c0t3:2962 -XL trinitytrinloc24511c0t1:367 -XL trinitytrinloc24791c0t2:2249 -XL trinitytrinloc25114c0t38:3566 -XL trinitytrinloc25199c0t1:1077 -XL trinitytrinloc25306c1t3:2396 -XL trinitytrinloc25550c0t2:2283 -XL trinitytrinloc25716c1t13:1527 -XL trinitytrinloc25721c1t2:1454
INFO 15:39:57,366 HelpFormatter - Executing as [email protected] on Linux 3.13.0-61-generic amd64; OpenJDK 64-Bit Server VM 1.7.0_79-b14.
INFO 15:39:57,367 HelpFormatter - Date/Time: 2015/09/01 15:39:57
INFO 15:39:57,367 HelpFormatter - ---------------------------------------------------------------------------------
INFO 15:39:57,367 HelpFormatter - ---------------------------------------------------------------------------------
INFO 15:39:57,445 GenomeAnalysisEngine - Strictness is SILENT
INFO 15:39:58,286 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO 15:40:08,122 IntervalUtils - Initial include intervals span 39118833 loci; exclude intervals span 72 loci
INFO 15:40:08,125 IntervalUtils - Excluding 72 loci from original intervals (0.00% reduction)
INFO 15:40:08,126 IntervalUtils - Processing 39118761 bp from intervals
INFO 15:40:08,467 GenomeAnalysisEngine - Preparing for traversal
INFO 15:40:08,490 GenomeAnalysisEngine - Done preparing for traversal
INFO 15:40:08,491 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 15:40:08,491 ProgressMeter - | processed | time | per 1M | | total | remaining
INFO 15:40:08,492 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime
WARN 15:40:20,665 Interpreter - ![0,2]: 'QD < 2.0;' undefined variable QD
WARN 15:40:21,055 Interpreter - ![0,2]: 'QD < 2.0;' undefined variable QD
WARN 15:40:21,056 Interpreter - ![0,2]: 'QD < 2.0;' undefined variable QD
WARN 15:40:21,127 Interpreter - ![0,2]: 'QD < 2.0;' undefined variable QD
WARN 15:40:21,353 Interpreter - ![0,2]: 'QD < 2.0;' undefined variable QD
WARN 15:40:21,356 Interpreter - ![0,2]: 'QD < 2.0;' undefined variable QD
WARN 15:40:21,398 Interpreter - ![0,2]: 'QD < 2.0;' undefined variable QD
INFO 15:40:22,542 ProgressMeter - done 156790.0 14.0 s 89.0 s 100.0% 14.0 s 0.0 s
INFO 15:40:22,543 ProgressMeter - Total runtime 14.05 secs, 0.23 min, 0.00 hours
INFO 15:40:23,874 GATKRunReport - Uploaded run statistics report to AWS S3
Thanks,
Suda
@suda_ravindran
Hi Suda,
Oh. So you are getting an output VCF with no error message? You just get a WARN message? That is no reason to be concerned. The WARN messages may just be telling you that some of the sites in your file do not have the QD annotation or some of the sites may be homozygous reference. Have a look at this thread for more information: http://gatkforums.broadinstitute.org/discussion/2334/undefined-variable-variantfiltration
-Sheila
@Sheila
Hi Sheila,
Now I get only a warning message. I did have a look at the link you have mentioned. My question is:
Why does GATK provide such long stretch of nucleotides in the Reference field (Before using the -XL option) ? Sequences of upto say (10bp) could be considered as indels. But can indels also be 100bp long as shown below? I don't understand.
abyssk76_j_468075 250 . AAGGTTGTATATTTATTATACATTCAACATGAGTTTAGATTTGTATGAAGCTACTGTAACACCAATTTCCCTTTTATT A 112,73 . AC=1;AF=0.500;AN=2;BaseQRankSum=1.203;ClippingRankSum=-0.248;DP=27;FS=3.332;MLEAC=1;MLEAF=0.500;MQ=37.27;MQRankSum=-2.054;QD=4.18;ReadPosRankSum=-0.727;SOR=2.833 GT:AD:DP:GQ:PL 0/1:6,3:9:99:150,0,1995
abyss1756 1344 . ACTGGCGAACACAAAACAAAGTTAAATTGAATTTGACAATATTTTTAAAAGTTATTAC A 631,73 . AC=1;AF=0.500;AN=2;BaseQRankSum=-1.745;ClippingRankSum=1.317;DP=40;FS=4.416;MLEAC=1;MLEAF=0.500;MQ=34.21;MQRankSum=-3.950;QD=15.79;ReadPosRankSum=-0.047;SOR=1.863 GT:AD:DP:GQ:PL 0/1:8,16:24:99:669,0,342
abyss10353 109 . ATGTGGATTAGTCATGATGCCGGCAAGTTATTCAATAACACATCGAGGCAGTTTGTTTGTTTTCCAACTTTATTTAGCTTCTCCAGGTGGTATTTTTTTGGTAGGTAAATATTAACCTCGGCAGCATCGCCCGCCATTGCAAGTCT A 21,76 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.049;ClippingRankSum=0.832;DP=46;FS=1.768;MLEAC=1;MLEAF=0.500;MQ=26.68;MQRankSum=3.441;QD=0.47;ReadPosRankSum=-3.593;SOR=1.292 GT:AD:DP:GQ:PL 0/1:6,40:46:59:59,0,127
oasesvelvLoc359t2 370 . AACCCTTTGAACGTTAGGTCTGATCATGAAGGTGTGAAATGTAATCAAATAATAATAATTATTACACAAAATGTCTTGCGTTTT A 2034,84 . AC=1;AF=0.500;AN=2;BaseQRankSum=-0.413;ClippingRankSum=1.310;DP=83;FS=1.788;MLEAC=1;MLEAF=0.500;MQ=35.98;MQRankSum=1.451;QD=24.52;ReadPosRankSum=-0.089;SOR=1.463 GT:AD:DP:GQ:PL 0/1:6,77:83:16:2072,0,16
oasesvelvLoc11581t1 146 . AGTCATGATGGATCCAAGAGAAGAAGAAGGAGGTGTTGAT A 265,73 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.067;ClippingRankSum=0.762;DP=46;FS=6.532;MLEAC=1;MLEAF=0.500;MQ=38.01;MQRankSum=-0.762;QD=5.78;ReadPosRankSum=-0.747;SOR=4.174 GT:AD:DP:GQ:PL 0/1:2,7:9:49:303,0,49
oasesvelvLoc23883t1 467 . ACCTAAGGTATTTTTCAAAATTTGTAATTCTTTTAACTGCGACTTGGTAAAATTGCATGTCAGAAAAATATGTTTGACACTTG A 1157,73 . AC=1;AF=0.500;AN=2;BaseQRankSum=2.635;ClippingRankSum=-0.746;DP=52;FS=1.386;MLEAC=1;MLEAF=0.500;MQ=35.80;MQRankSum=-3.961;QD=22.26;ReadPosRankSum=-0.125;SOR=0.997 GT:AD:DP:GQ:PL 0/1:10,28:38:99:1195,0,2877
oasesvelvLoc3975d37489t1 14 . AGTAAGGTTGAATTCAGGGGTATTATTTGTTCCGTTTCCGCCTTTTTTCAGCTGCCTGTCTTTAACTTTAATAAG A 48,73 . AC=1;AF=0.500;AN=2;BaseQRankSum=2.510;ClippingRankSum=-0.271;DP=45;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=37.00;MQRankSum=-2.329;QD=1.08;ReadPosRankSum=-2.899;SOR=0.223 GT:AD:DP:GQ:PL 0/1:40,5:45:86:86,0,1656
soapsoap196131 106 . ATCTTCTTCATCCTCGTCTTCTTCATCCGCATCTACTTCG A 65,74 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.727;ClippingRankSum=-0.727;DP=10;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=28.47;MQRankSum=0.727;QD=6.57;ReadPosRankSum=0.727;SOR=1.609 GT:AD:DP:GQ:PL 0/1:1,3:4:29:103,0,29
Thanks,
Suda
@suda_ravindran
Hi Suda,
Yes, it is possible for Haplotype Caller to call such a long deletion. You will have to check your bam and bamout files for more information about why exactly they were called.
-Sheila
@Sheila
Hi Sheila,
Thanks. I will look into it
Suda
@Sheila
Dear Sheila,
I have a followup question here.
Wouldn't the simplest way forward to treat it as one sample from the beginning if possible (as long as the coverage does not slow everything down) - with the beginning I mean read preprocessing and bwa mapping - followed by GATK preprocessing and finally calling. So could I just treat all runs for a sample even if they come from different biological replicates as one sample from the start or is there something I did not consider?
Thank you for your help!
Julia
@Juls For performance reasons (like speed and memory requirements) it is better to do the early processing steps separately.
@Geraldine_VdAuwera
Thank you very much! I have very very low coverage data so I don't think it would change much performance-wise.
So it would be fine to throw everything together from the start if I don't mind the slight drop in performance?
Thanks again!
If the coverage is very low then that's reasonable.
Thanks!