How to get rid of false positives? And troubles with VQSR

RoosSRoosS Nijmegen, The NetherlandsMember

For the last couple of months we have been trying to analyze an exome data set of ~1500 samples.
Within the data there are 2 sets, ~1200 control samples sequenced at one site, and ~300 cases samples from another sequencing facility, but with the same capture kit and both sets have similar coverage.

We have thrown all bams on a big pile and followed the best practices, performing base recalibration, joint genotyping and variant recalibration on the whole dataset together. When we performed downstream analysis on the data we found a lot of false positives (having for example 20% freq in the controls and never in the cases or vise versa).

We initially thought that the main problem was the difference in sequencing location. Therefore, to solve this we performed filtering on DP>8, GQ>20, HWE, averageGQ>35 and callrate>85%, on the 2 separate batches as recommended by: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-15-125
After that we repeated the VQSR on the whole dataset together and the results already improved significantly, however we still had some strange results. When we looked at the bam files of people carrying these variants the mapping seemed to be the main problem in some. So then we added the MQCap option to the VQSR which further improved the results.

1.Is the way we set up this experiment correct, or are there any thing we should change? Are we being too stringent/not stringent enough? We still have false positives, but what is the chance of missing something by filtering before the VQSR?

The next question is about using different versions of dbsnp to perform the VQSR. In our whole pipeline we initially used dbsnp version 144. However a lot of the downstream analysis variants could not be confirmed with sanger sequencing. When we perform the VQSR with dbsnp138 most of these false positive variants, are not in the dataset anymore (except 2 that remain).

2.How is it possible that using a different dbsnp version has such a big impact on the output of the VQSR? And which dbsnp version should we use? (because still 2 (and most likely more) variants remain that cannot be verified, but are associated with our trait in downstream analysis)

If we look at the haplotype option bamouts of a person with and without one of the variants that could not be confirmed with sanger sequencing, we see something peculiar. There seems to be something strange going on in how haplotypecaller looks at this site in the individuals supposedly carrying the variant (sample 2) and one that does not carry the variant (sample 1) while the coverage in the samples is the same, and the region looks very messy in both samples. We could not confirm this variant in sample 2 with sanger sequencing.

3.What is going on? And could this explain the false positives we find?

Also we have some questions about using different versions, the haplotype calling and joint genotyping was performed with GATK3.4 by a collaborator and repeating it would be very cumbersome.
The VQSR was performed with GATK 3.5 on our local cluster, and could be repeated with 3.7.

4.However, we were wondering how advisable it is to mix versions of GATK? And if the new options of 3.7 will still function on g.vcfs that are generated with 3.4?

We’re kind of running out of ideas, and do not have that much experience with running these kind of analysis. After consulting a lot on the forum, we did not find the answers to our questions, so we hoped someone could answer them.
Thank you very much in advance.

Command:

        java -Xmx32G -jar /opt/GATK-3.5/GenomeAnalysisTK.jar \
            -T VariantRecalibrator \
            -R {path}References/hs_ref_GRCh37.p5_all_contigs.fasta \
            -input {path}/Complete_withXY_sorted.vcf \
            -nt 6 \
            -resource:hapmap,known=false,training=true,truth=true,prior=15.0 {path}References/hapmap_3.3.b37_withchr.vcf \
            -resource:omni,known=false,training=true,truth=true,prior=12.0 {path}References/1000G_omni2.5.b37_withchr.vcf \
            -resource:1000G,known=false,training=true,truth=false,prior=10.0 {path}References/1000G_phase1.snps.high_confidence_withchr.b37.vcf \
            -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 {path}References/snp144-All.vcf \
            -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an InbreedingCoeff \
            -MQCap 70 \
            -tranche 100.0 -tranche 99.9 -tranche 99.8 -tranche 99.0 -tranche 90.0 \
            -recalFile {path}/Complete_withXY_snps_dbsnp144.recal \
            -tranchesFile {path}/Complete_withXY_snps_dbsnp144.tranches \
            -rscriptFile {path}/Complete_withXY_snps_dbsnp144.R \
            -mode SNP

        #indels
        java -Xmx8G -jar /opt/GATK-3.5/GenomeAnalysisTK.jar \
            -T VariantRecalibrator \
            -R {path}References/hs_ref_GRCh37.p5_all_contigs.fasta \
            -input {path}/Complete_withXY_sorted.vcf \
            -recalFile {path}/Complete_withXY_indels_dbsnp144.recal \
        -tranchesFile {path}/Complete_withXY_indels_dbsnp144.tranches \
        -rscriptFile {path}/Complete_withXY_indels_dbsnp144.R \
        -nt 3 \
        -resource:mills,known=false,training=true,truth=true,prior=12.0 {path}References/Mills_and_1000G_gold_standard.indels.b37.vcf \
        -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 {path}References/snp144-All.vcf \
        -an QD -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an InbreedingCoeff \
        -MQCap 70 \
        --maxGaussians 4 \
        -tranche 100.0 -tranche 99.9 -tranche 99.8 -tranche 99.0 -tranche 90.0 \
        -mode INDEL  

    #apply recalibration
    java -jar /opt/GATK-3.5/GenomeAnalysisTK.jar \
        -T ApplyRecalibration \
        -R {path}References/hs_ref_GRCh37.p5_all_contigs.fasta \
        -input {path}/Complete_withXY_sorted.vcf \
        --ts_filter_level 99.9 \
        -recalFile {path}/Complete_withXY_snps_dbsnp144.recal \
        -tranchesFile {path}/Complete_withXY_snps_dbsnp144.tranches \
        -o {path}/Complete_withXY_snps_dbsnp144.vcf \
        -mode SNP

    java -jar /opt/GATK-3.5/GenomeAnalysisTK.jar \
        -T ApplyRecalibration \
        -R {path}References/hs_ref_GRCh37.p5_all_contigs.fasta \
        -input {path}/Complete_withXY_sorted.vcf \
        --ts_filter_level 99.9 \
        -recalFile {path}/Complete_withXY_indels_dbsnp144.recal \
        -tranchesFile {path}/Complete_withXY_indels_dbsnp144.tranches \
        -o {path}/Complete_withXY_indels_dbsnp144.vcf \
        -mode INDEL    

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @RoosS

    Hi,

    1) In general, your workflow looks fine. The MQJitter should definitely have some impact. As for additional filtering, they are not our recommendations, but they seem reasonable. We have not done evaluations on them, so we cannot endorse or reject them. Please note, there will always be some variants that are "hard".

    There are levels of sensitivity and specificity that are hard to define and achieve balance with. The greater the sensitivity, the lower the specificity, and vice versa. VQSR is one filtering technique we find appropriate. You may need a more stringent level of sensitivity when choosing the tranche if you are seeing many false positives. It is up to you to determine which is most appropriate for your goals. If you notice problems in some samples and not others, look at quality control metrics. Is there a reason you are applying the hard filters before VQSR? Is there any difference if you apply them after VQSR? How are you evaluating the false positive rate?

    2) dbSNP version should not have an effect unless you are basing sensitivity level on it. Can you confirm you have run the exact same commands but only changed the dbSNP version to get the different results? Can you confirm this happens with the latest version?

    3) That is a very messy region. Those regions tend to be hard to produce confident calls in. You are also trying to call at the edge of an exon which makes it harder. From the looks of it, there may be a structural variant there. Notice the major difference in coverage in the region. HaplotypeCaller is not designed to call such large structural variants. You may try GenomeSTRiP.

    4) This article should help.

    -Sheila

  • RoosSRoosS Nijmegen, The NetherlandsMember

    Hi Sheila,

    thank you for your answers!
    1) We decided to do hard filtering because when we ran an association analysis on the data we found a lot of variants that where not present in one dataset and were present with quite a high minor allele frequency in the other dataset. This didn't seem right to us, and after hard filtering before the VQSR according to the publication most of these variants were removed. We could of course try the hard filter after the VQSR as well, do you think this is preferable?
    So we assess the false positive rates by looking at the association analysis, and we tried to confirm variants with sanger sequencing as well.

    2) We have indeed run the same command as stated, only with the different dbsnp version. We haven't checked if this also happens with the newest version, and probably should not as the answer to our question 4 was that we should not mix GATK versions, and are thus stuck with version 3.4 for now.

    3) Is there a way to exclude these regions? Now we have included flanking regions in our analysis instead of the exact regions the SureSelect kit is targeting, would you advise this? Or is is better to focus on the targeted regions specifically, to remove these regions?

    4) Ok then we have to stick with 3.4 for now.

    Thanks again!

    Roos

  • zejunyanzejunyan EdinburghMember

    Hi, Sheila

    I am trying to run VQSR on my callsets for chicken genome. As far as I know, unlike resources used for human, there are no resources for chicken that can be used to train my model. In this case, can I use hard-filtered callsets to train my model? I do have dbsnp as resource, but I got the following errors:

    MESSAGE: Invalid argument value 'training=true,' at position 8.

    ERROR Invalid argument value 'truth=true,' at position 9.
    ERROR Invalid argument value 'prior=12.0' at position 10.
    ERROR Invalid argument value 'training=false,' at position 13.
    ERROR Invalid argument value 'truth=false,' at position 14.
    ERROR Invalid argument value 'prior=2.0' at position 15.
    ERROR Invalid argument value '../../../../snp_db/vcf_chr_1-28_30_32.vcf.gz' at position 16.

    My commandline: java -jar ../../../../tools/GenomeAnalysisTK.jar -T VariantRecalibrator -R ../../../../pdata/testdata/RefGenome/GCF_000002315.4_Gallus_gallus-5.0_genomic.fa -input genotyped.trio.cohort.g.vcf -resource:HFSNPs, known=false, training=true, truth=true, prior=12.0 /mnt/Pella/Processing/NGS/zyan/CA_TREE/realn_temp/AVSEQ1300040/gVCF/Chrom_filtered/final.NC_006088.4.biallelic.g.vcf -resource:dbsnp, known=true, training=false, truth=false, prior=2.0 ../../../../snp_db/vcf_chr_1-28_30_32.vcf.gz -an DP -an QD -an FS -an SOR -an MQ -an MQRankSum -an ReadPosRankSum -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -recalFile recal_SNP.recal -tranchesFile recal_SNP.tranches -rscriptFile recal_SNP_plots.R

    Cheers

    Dr Yan

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @RoosS
    Hi Roos,

    1) So, you do not expect major differences in the two datasets? Why did you decide to call variants on the two different datasets separately? You could always try the hard filtering after VQSR and see if that makes a difference.

    3) Ah, if you mean you have included off-target sequence in your variant calling, that may indeed have an effect. Have a look at this thread.
    We recommend using -L when running on exome data. Did you use an intervals list when calling variants? If not, you can use SelectVariants with -L to select sites in your targeted regions.

    I hope those suggestions help.

    -Sheila

  • RoosSRoosS Nijmegen, The NetherlandsMember

    Hi Sheila,

    1) Sorry for the misunderstanding, but we did call the variants in both datasets together and only afterwards compared the frequencies. We do expect to see variation between the dataset (case vs controls) but not as extreme as currently observed.
    3) We did use an interval list but is was +200bp on flanking the target intervals. I am now running it with the target intervals specifically!

    I am now repeating the analysis with GATK 3.4 and the specified targets on the unfiltered data to see what happens after implementing your suggestions and to see if hard filtering is still necessary afterwards, will let you know how it turned out!

    Thank you for your help!

    Best,
    Roos

  • mmokrejsmmokrejs Czech RepublicMember

    Hi @zejunyan ,
    you broke the command syntax with extra spaces, kindly return to https://software.broadinstitute.org/gatk/documentation/article.php?id=2805 .
    Martin

  • zejunyanzejunyan EdinburghMember

    @mmokrejs

    Thank you very much! My command-line works now. But, as far as I know, there is no True-site training resource I can use for training the recalibration model for chicken, (do you know?) can I use the hard-filtered vcf files for this purpose? I know hard filtering is an alternative way to this approach, but is it valid to do so?

    Bests

    Dr yan

  • mmokrejsmmokrejs Czech RepublicMember
    edited April 2017

    @zejunyan

    can I use the hard-filtered vcf files for this purpose?

    As far as I read GATK docs, yes, and there are several documents on this. Just read them all icluding reading comments in their associated discussions.

  • zejunyanzejunyan EdinburghMember

    @mmokrejs

    If I need to create my own true training resource, as mentioned in this forum, I need to do a first round variant calling and then select the site with highest quality scores. In this case, what is the recommended threshold for filtering highest qual scores?

    Bests

    Dr yan

  • mmokrejsmmokrejs Czech RepublicMember
    edited April 2017

    @RoosS
    I think you should use hg38 and run recent enough 'bwa mem' in ATL-aware mode against ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/hs38DH.fa with its indexes andhs38DH.fa.alt as created by bwa.kit bundle . In my experience I accomodated in this way 3-4% of exome reads from my samples, and more importantly, I had reads assigned to primary chromosomes no matter the AS was a bit lower than when aligned to some _alt location.

    From my experience with non-human reference I can also say that 'bwa mem' is too lazy to do the alignment properly and typically it soft- or hard-clips regions which did not align easily. Provided I trimmed away sequencing adapters before alignment step there was no reason soft- or hard-clip any sequence. Indeed BBmap did much better job and aligned whole 250nt reads from a RapidSeq run. Unfortunately, BBmap is not able to work with ALT-containing references.

    So, instead of switching to hg38alt you could try BBmap, which a global alignment and is more thorough than 'bwa mem'.

    Or at least, check in IGV the problematic reads in the region you showed and try to understand why they were aligned there, whether there are alternative locations recorded using XA tag (and how many), and check the CIGAR string for soft and hard clips.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
Sign In or Register to comment.