Loss of RSIDs for GenotypeGVCFs, possibly an issue with dpsnp filter.vcf file from SelectVariants

Hello,

I've run into this problem a few times now having attempted to debug the issue in various ways. The first time it occurred I was using a vcf file containing the rsids I wished to genotype on a given gvcf file. There was no error message that I could discern, however the resultant vcf file only contained <1/5th of the rsids of interest in the filter file.

I assumed this issue related to the gvcf file and so acquired the bam file to begin the workflow from the beginning and stick to best practices, however I have now run into the same issue using haplotype caller and the same filter.vcf file as the -L and -D argument.

This leads me to believe the filter file is the issue, but when I look at the file I can see all of the rsids there so I'm not sure what is causing the loss. Of lesser concern is the loss of 7 rsids when I used SelectVariants to produce the filter initially, but I will also have to address that.

I have attempted to reduce the minimum base quality reads to 1 but it has not resulted in any increase in variant calling. I also looked into generating a bamout file which I will return to when possible, however I am currently working remotely and unable to install IGV on this device.

Many thanks for any tips

Answers

  • kingpgkingpg Member

    I have been working with IGV to investigate the bamout file. While I still have to familiarise myself with the toolkit, from what I can see the original bam file does contain the data for the regions in question.

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @kingpg Sorry for the delay in response but I am looking into it and will confer with the team to see if we can help debug this issue!

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin
    edited December 6

    @kingpg
    Are there any sites in the vcf you wish to filter (with your filter.vcf) that have more than 1 rsID? For example, if you look at the issue listed in this link, there is an example vcf record that is annotated with 2 rsIDs, instead of 1. This issue was resolved recently - did you do the above listed troubleshooting before or after? Let me know if this helps get back the missing rsIDs of interest.

    If this is not the case, would you be able to provide the code snippets that you used to generate your filter as well as the other 2 troubleshooting steps? This will help myself and the team determine the issue faster :)

    Post edited by SChaluvadi on
  • kingpgkingpg Member

    Hi @SChaluvadi, thanks for getting back to me!

    I've had a look at the potentially similar issue you mentioned and I'll look for other ways to verify if that's the problem I'm having, however I do not think that it is as I am experiencing it both when searching for the rsids with the filter.list file as well as when I am simply trying to populate the bam -> vcf file with said rsids using HaplotypeCaller. If this is not a reasonable conclusion to draw let me know, I should also not the rsid list does not contain sets of ids that refer to the same sites.

    I have again changed device so will begin downloading the toolkit and repeating the steps to provide the code.

    Thanks again

  • kingpgkingpg Member

    To start with here's the command I ran with the filter file to try and produce the parsed vcf from the bam file using HaplotypeCaller. As before while the filter contains 108 rsids, the output file contains only 20. Sorry about the long file paths, I am using an external hard-drive for the large files.

    ./gatk HaplotypeCaller -I /Volumes/Macintosh\ HD\ 1/Jordan/N1_S1_L002_R1_001_dedup.realigned.recalibrated.bam -R /Volumes/Macintosh\ HD\ 1/Jordan/human_g1k_v37.fasta -L filter.vcf -D filter.vcf -O N1_parsed.vcf

  • kingpgkingpg Member

    In repeating the creation of my filter.vcf file using SelectVariants I am being delayed by having to download another vcf file from which to select said variants using the rsid file, and currently the server is at the max 25 users limit. It does occur to me however that my choice of file to use as an input here may have been the issue, I am currently aiming to use the 1000G_omni2.5.b36.vcf.gz. However I am uncertain as to what I used before and am still less clear on the potential impacts of an incorrect choice of input. Regardless I will follow up with the commands and results when I have completed this step.

  • kingpgkingpg Member

    I just did the above and produced a filter_2.vcf file which was already missing a larger number of rsids than the previous filter I created using the forgotten vcf input file (the file is discoverable as it is on my device but I don't have access to it currently). The commands I used to create filter_2 are below.

    ./gatk SelectVariants -V /Volumes/Macintosh\ HD\ 1/Jordan/1000G_omni2.5.b36.vcf -ids rsids.list -O filter_2.vcf

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @kingpg
    This troubleshooting idea is only based on the first command you posted that runs HaplotypeCaller. It looks like you are using the same filter.vcf file as input for the -L and -D parameters. The L parameter is looking for a text file that contains a list of intervals, only over which the tool will operate. Here is a small post that has some more information on the -L argument. The -D parameter will use the filter.vcf to annotate variants with the corresponding reference ID. Would you be able to look into the intervals parameter to make sure that you are not isolating intervals where the tool will not look, which could result in the lesser than expected number of rsids.

  • kingpgkingpg Member

    Thanks @SChaluvadi,
    I'm currently trying to use the interval command as the post describes, I can see how using the vcf file could fail to include target sites. I am currently running into the following error however, first seen using the interval.list file, and now when pasting in the -L chr1 example argument.

    A USER ERROR has occurred: Badly formed genome unclippedLoc: Query interval "chr1" is not valid for this input.

    I will see if I can work out what's going on before the weekend is over.

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin

    Hi @kingpg

    Have you been able to figure out the error? Please let us know if there is anyway we can help.

    Regards
    Bhanu

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @kingpg While you are troubleshooting the intervals list option, can you confirm that the reference build is consistent across all your files. This could be causing the error that you are seeing.

  • kingpgkingpg Member

    I am quite confidant that the reference build is consistent between the -I input, I am less certain regarding the file used to construct the filter.vcf file, as expressed earlier. The g1k_v37 build was not accompanied by a vcf file from which to generate a shorter list, and I was unable to find the right guide on the matter.

    That said I was not using the filter.vcf file for the -L option in the command that generated the last error.

    I will double check the build question and troubleshoot the -L command problem as best I can.

    Thanks again!

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @kingpg Great! If you can share your findings with us that would be appreciated and as always any follow up questions as well!

  • kingpgkingpg Member

    The question I'm still unsure about which may not be the cause of the -L errors but which would be good to know anyway is how to determine which files to use in the creation of a .vcf file containing the rsids of interest, and what issues can result from using different files.

  • kingpgkingpg Member

    Also, having ensured that the reference build is consistent, are there any other possible sources for the error - 'Badly formed genome unclippedLoc: Query interval "chr1" is not valid for this input.'?

  • kingpgkingpg Member

    I will see if this file has gone through the second half of the data pre-processing stage as I realised this was an unwarranted assumption.

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @kingpg I am looking for other sources of error and will get back to you.

  • kingpgkingpg Member

    Thanks @SChaluvadi, I should mention I think I solved the -L issue by not including chr in chr1 and just using 1. I have yet to properly go over the results of that command though.

  • kingpgkingpg Member

    With that covered I have unfortunately returned to my initial issue, within the resultant gvcf from HaplotypeCaller there are a number of positions intended for genotyping within the defined ranges that don't appear. HaplotypeCaller didn't populate the desired positions with the -D argument from what I could see, but GenotypeGVCFs did, at least those that were present. After using SelectVariants I outputted the the same list I had produced before starting again with the BAM file. I attempted to reduce the minimum base quality score and repeat the process but got the same results.

  • kingpgkingpg Member

    ./gatk HaplotypeCaller -I /Volumes/Macintosh\ HD\ 1/Jordan/N1_S1_L002_R1_001_dedup.realigned.recalibrated.bam -R /Volumes/Macintosh\ HD\ 1/Jordan/human_g1k_v37.fasta -L intervals_2.intervals -D filter.vcf -O N1_parsed_lowq.gvcf -ERC GVCF -mbq 1
    Using GATK jar /Users/brigittedelamalene/Desktop/Patrick/gatk-4.0.11.0/gatk-package-4.0.11.0-local.jar
    Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /Users/brigittedelamalene/Desktop/Patrick/gatk-4.0.11.0/gatk-package-4.0.11.0-local.jar HaplotypeCaller -I /Volumes/Macintosh HD 1/Jordan/N1_S1_L002_R1_001_dedup.realigned.recalibrated.bam -R /Volumes/Macintosh HD 1/Jordan/human_g1k_v37.fasta -L intervals_2.intervals -D filter.vcf -O N1_parsed_lowq.gvcf -ERC GVCF -mbq 1
    23:24:26.938 INFO NativeLibraryLoader - Loading libgkl_compression.dylib from jar:file:/Users/brigittedelamalene/Desktop/Patrick/gatk-4.0.11.0/gatk-package-4.0.11.0-local.jar!/com/intel/gkl/native/libgkl_compression.dylib
    23:24:29.283 INFO HaplotypeCaller - ------------------------------------------------------------
    23:24:29.284 INFO HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.0.11.0
    23:24:29.284 INFO HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
    23:24:34.294 INFO HaplotypeCaller - Executing as [email protected] on Mac OS X v10.14.1 x86_64
    23:24:34.297 INFO HaplotypeCaller - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_191-b12
    23:24:34.302 INFO HaplotypeCaller - Start Date/Time: 12 December 2018 23:24:26 GMT
    23:24:34.303 INFO HaplotypeCaller - ------------------------------------------------------------
    23:24:34.303 INFO HaplotypeCaller - ------------------------------------------------------------
    23:24:34.313 INFO HaplotypeCaller - HTSJDK Version: 2.16.1
    23:24:34.313 INFO HaplotypeCaller - Picard Version: 2.18.13
    23:24:34.313 INFO HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
    23:24:34.314 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    23:24:34.314 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    23:24:34.314 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    23:24:34.315 INFO HaplotypeCaller - Deflater: IntelDeflater
    23:24:34.315 INFO HaplotypeCaller - Inflater: IntelInflater
    23:24:34.316 INFO HaplotypeCaller - GCS max retries/reopens: 20
    23:24:34.316 INFO HaplotypeCaller - Requester pays: disabled
    23:24:34.318 INFO HaplotypeCaller - Initializing engine
    23:24:36.469 INFO FeatureManager - Using codec VCFCodec to read file file:///Users/brigittedelamalene/Desktop/Patrick/gatk-4.0.11.0/filter.vcf
    23:24:36.606 INFO IntervalArgumentCollection - Processing 1484820 bp from intervals
    23:24:36.681 INFO HaplotypeCaller - Done initializing engine
    23:24:36.690 INFO HaplotypeCallerEngine - Tool is in reference confidence mode and the annotation, the following changes will be made to any specified annotations: 'StrandBiasBySample' will be enabled. 'ChromosomeCounts', 'FisherStrand', 'StrandOddsRatio' and 'QualByDepth' annotations have been disabled
    23:24:36.712 INFO HaplotypeCallerEngine - Standard Emitting and Calling confidence set to 0.0 for reference-model confidence output
    23:24:36.712 INFO HaplotypeCallerEngine - All sites annotated with PLs forced to true for reference-model confidence output
    23:24:36.743 INFO NativeLibraryLoader - Loading libgkl_utils.dylib from jar:file:/Users/brigittedelamalene/Desktop/Patrick/gatk-4.0.11.0/gatk-package-4.0.11.0-local.jar!/com/intel/gkl/native/libgkl_utils.dylib
    23:24:36.748 WARN NativeLibraryLoader - Unable to find native library: native/libgkl_pairhmm_omp.dylib
    23:24:36.749 INFO PairHMM - OpenMP multi-threaded AVX-accelerated native PairHMM implementation is not supported
    23:24:36.749 INFO NativeLibraryLoader - Loading libgkl_pairhmm.dylib from jar:file:/Users/brigittedelamalene/Desktop/Patrick/gatk-4.0.11.0/gatk-package-4.0.11.0-local.jar!/com/intel/gkl/native/libgkl_pairhmm.dylib
    23:24:36.803 WARN IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
    23:24:36.805 WARN IntelPairHmm - Ignoring request for 4 threads; not using OpenMP implementation
    23:24:36.805 INFO PairHMM - Using the AVX-accelerated native PairHMM implementation
    23:24:36.842 WARN GATKVariantContextUtils - Can't determine output variant file format from output file extension "gvcf". Defaulting to VCF.
    23:24:36.937 INFO ProgressMeter - Starting traversal
    23:24:36.938 INFO ProgressMeter - Current Locus Elapsed Minutes Regions Processed Regions/Minute
    23:24:48.080 INFO ProgressMeter - 1:97550499 0.2 30 161.6
    23:24:58.202 INFO ProgressMeter - 1:97566267 0.4 100 282.2
    23:25:08.308 INFO ProgressMeter - 1:97625980 0.5 370 707.7
    23:25:19.086 INFO ProgressMeter - 1:97653982 0.7 510 726.0
    23:25:29.144 INFO ProgressMeter - 1:97675099 0.9 610 701.1
    23:25:39.966 INFO ProgressMeter - 1:97693730 1.1 700 666.4
    23:25:50.736 INFO ProgressMeter - 1:97721826 1.2 840 683.0
    23:26:01.014 INFO ProgressMeter - 1:97759599 1.4 1060 756.5
    23:26:11.167 INFO ProgressMeter - 1:97830142 1.6 1420 904.2
    23:26:21.360 INFO ProgressMeter - 1:97921472 1.7 1840 1057.2
    23:26:31.496 INFO ProgressMeter - 1:98009260 1.9 2270 1189.0
    23:26:41.672 INFO ProgressMeter - 1:98108228 2.1 2660 1279.9
    23:26:51.644 INFO ProgressMeter - 1:98226438 2.2 3110 1385.2
    23:27:01.749 INFO ProgressMeter - 1:98312077 2.4 3520 1458.5
    23:27:11.926 INFO ProgressMeter - 3:165519632 2.6 3860 1494.3
    23:27:18.982 WARN DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
    23:27:18.982 WARN StrandBiasBySample - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
    23:27:18.983 WARN DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
    23:27:18.983 WARN StrandBiasBySample - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
    23:27:22.014 INFO ProgressMeter - 7:99306250 2.8 4320 1570.2
    23:27:32.018 INFO ProgressMeter - 10:96427248 2.9 4780 1638.1
    23:27:42.073 INFO ProgressMeter - 10:96497052 3.1 5120 1659.3
    23:27:52.212 INFO ProgressMeter - 10:96603823 3.3 5620 1726.8
    23:28:02.339 INFO ProgressMeter - 10:96681338 3.4 6050 1767.3
    23:28:12.422 INFO ProgressMeter - 10:96762233 3.6 6500 1809.9
    23:28:22.607 INFO ProgressMeter - 11:46760000 3.8 6880 1829.2
    23:28:30.591 INFO HaplotypeCaller - 25092 read(s) filtered by: ((((((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND NotSecondaryAlignmentReadFilter) AND NotDuplicateReadFilter) AND PassesVendorQualityCheckReadFilter) AND NonZeroReferenceLengthAlignmentReadFilter) AND GoodCigarReadFilter) AND WellformedReadFilter)
    25092 read(s) filtered by: (((((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND NotSecondaryAlignmentReadFilter) AND NotDuplicateReadFilter) AND PassesVendorQualityCheckReadFilter) AND NonZeroReferenceLengthAlignmentReadFilter) AND GoodCigarReadFilter)
    25092 read(s) filtered by: ((((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND NotSecondaryAlignmentReadFilter) AND NotDuplicateReadFilter) AND PassesVendorQualityCheckReadFilter) AND NonZeroReferenceLengthAlignmentReadFilter)
    25092 read(s) filtered by: (((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND NotSecondaryAlignmentReadFilter) AND NotDuplicateReadFilter) AND PassesVendorQualityCheckReadFilter)
    25092 read(s) filtered by: ((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND NotSecondaryAlignmentReadFilter) AND NotDuplicateReadFilter)
    5759 read(s) filtered by: (((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND NotSecondaryAlignmentReadFilter)
    5549 read(s) filtered by: ((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter)
    5549 read(s) filtered by: (MappingQualityReadFilter AND MappingQualityAvailableReadFilter)
    5549 read(s) filtered by: MappingQualityReadFilter
    210 read(s) filtered by: NotSecondaryAlignmentReadFilter
    19333 read(s) filtered by: NotDuplicateReadFilter

    23:28:30.592 INFO ProgressMeter - X:153761996 3.9 7109 1825.5
    23:28:30.594 INFO ProgressMeter - Traversal complete. Processed 7109 total regions in 3.9 minutes.
    23:28:30.927 INFO VectorLoglessPairHMM - Time spent in setup for JNI call : 0.104715343
    23:28:30.927 INFO PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 6.6614969870000005
    23:28:30.927 INFO SmithWatermanAligner - Total compute time in java Smith-Waterman : 15.56 sec
    23:28:30.927 INFO HaplotypeCaller - Shutting down engine
    [12 December 2018 23:28:30 GMT] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 4.07 minutes.
    Runtime.totalMemory()=437780480

    This is the haplotype caller command and output in full, the desired loci do not all fall within the range of the error messages.

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @kingpg Thanks for the details and the update! I am conferring with the team on possible troubleshooting and they are looking into it

  • kingpgkingpg Member

    Thanks again @SChaluvadi, I am going to start the process from the beginning again as it seems likely that with the confusion of using multiple devices and my own lack of experience I made an alignment error early on that is causing these problems. I'll make sure to stick to best practices to the letter this time around.

Sign In or Register to comment.