The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Get notifications!


You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

Did you remember to?


1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

Did we ask for a bug report?


Then follow instructions in Article#1894.

Formatting tip!


Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block as demonstrated here.

Jump to another community
Picard 2.9.0 is now available. Download and read release notes here.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

Select SNP and reference monomorphic sites from a large callset

zzqzzq ChinaMember Posts: 62
edited February 8 in Ask the GATK team

Dear all,

Recently, I have called genotype using GenotypeGVCFs with --includeNonVariantSites. I want to pull out the sites with one allele or . in ALT column. I found the following command will also report the sites with * or more than one allele listed in ALT column. I also tried to add --restrictAllelesTo BIALLELIC, but the non-variant sites (sites with . listed in ALT column) will be excluded. I do not know how to set these parameters to meet my requirements.

-T SelectVariants \
--selectTypeToInclude NO_VARIATION \
--selectTypeToInclude SNP \

Hope for your help.

Best
Zhuqing

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator Posts: 4,694 admin

    @zzq
    Hi Zhuqing,

    I think you can use --selectTypeToExclude. If you add --selectTypeToExclude INDEL --selectTypeToExclude MIXED --selectTypeToExclude MNP --selectTypeToExclude SYMBOLIC that should give you what you are looking for.

    -Sheila

  • zzqzzq ChinaMember Posts: 62

    Hi @Sheila

    Thank you. I have run with --selectTypeToExclude, but the results also report the sites with * or multiple allele listed in ALT column (like following). The version of GATK 3.5-0-g36282e4

    NC_005044.2     1229    .       A       .       .       .       AN=424;DP=39131 GT:DP:RGQ       0/0:2:6 0/0:43:99       0/0:49
    NC_005044.2     1230    .       A       .       .       .       AN=426;DP=39198 GT:DP:RGQ       0/0:2:6 0/0:43:99       0/0:49
    NC_005044.2     1232    .       A       .       .       .       AN=442;DP=39386 GT:DP:RGQ       0/0:2:6 0/0:43:99       0/0:49
    NC_005044.2     1233    .       A       .       .       .       AN=446;DP=39415 GT:DP:RGQ       0/0:3:9 0/0:43:99       0/0:49
    NC_005044.2     1234    .       A       .       .       .       AN=446;DP=39512 GT:DP:RGQ       0/0:3:9 0/0:43:99       0/0:49
    NC_005044.2     1237    .       G       A,*     605604.72       .       AC=5,32;AF=0.011,0.071;AN=448;BaseQRankSum=0.979;Clipp
    NC_005044.2     1238    .       A       .       .       .       AN=416;DP=39597 GT:DP:RGQ       0/0:3:9 0/0:43:99       0/0:49
    NC_005044.2     1239    .       A       .       .       .       AN=448;DP=38210 GT:DP:RGQ       0/0:3:9 0/0:43:99       0/0:49
    NC_005044.2     1240    .       T       .       .       .       AN=448;DP=38200 GT:DP:RGQ       0/0:3:9 0/0:43:99       0/0:49
    NC_005044.2     1241    .       T       .       .       .       AN=448;DP=38201 GT:DP:RGQ       0/0:3:9 0/0:43:99       0/0:49
    NC_005044.2     1242    .       A       .       .       .       AN=448;DP=38192 GT:DP:RGQ       0/0:3:9 0/0:43:99       0/0:49
    NC_005044.2     5558    .       G       .       .       .       AN=448;DP=13759 GT:DP:RGQ       0/0:8:24        0/0:44:99
    NC_005044.2     5559    .       G       A,C     19917   .       AC=150,8;AF=0.336,0.018;AN=446;BaseQRankSum=0.361;ClippingRank
    NC_005044.2     5560    .       T       .       .       .       AN=448;DP=13810 GT:DP:RGQ       0/0:9:27        0/0:44:99
    NC_005044.2     5561    .       T       .       .       .       AN=448;DP=13790 GT:DP:RGQ       0/0:9:27        0/0:44:99
    NC_005044.2     5562    .       T       .       .       .       AN=440;DP=13786 GT:DP:RGQ       0/0:9:27        ./.:45:0
    NC_005044.2     5563    .       G       .       .       .       AN=448;DP=13774 GT:DP:RGQ       0/0:9:27        0/0:45:99
    NC_005044.2     5564    .       G       .       .       .       AN=446;DP=13766 GT:DP:RGQ       0/0:10:30       0/0:45:99
    NC_005044.2     5565    .       C       .       .       .       AN=440;DP=13752 GT:DP:RGQ       0/0:10:30       ./.:45:0

    Maybe I have missed something.

    Best
    Zhuqing

  • shleeshlee CambridgeMember, Broadie, Moderator Posts: 524 admin
    edited February 17

    Hi @zzq,

    Can you reproduce this result with the latest GATK (v3.7)? If so, would you mind sharing a snippet of your data so we can debug this? Instructions for sending us data are at http://gatkforums.broadinstitute.org/gatk/discussion/1894/how-do-i-submit-a-detailed-bug-report. Thanks.

    Post edited by shlee on
  • zzqzzq ChinaMember Posts: 62

    Dear @shlee

    Yes, the same problem reproduced with the latest GATK.
    The command

    java -Xmx100g /bin/GATK3.7/GenomeAnalysisTK.jar \
        -T SelectVariants \
        -R ref.fa \
        --selectTypeToExclude INDEL --selectTypeToExclude MIXED --selectTypeToExclude MNP --selectTypeToExclude SYMBOLIC \
        -V $VCF \
        -o $OUT

    It will keep the MNP sites like following
    1       5969    .       C       .       84.60   .   
    1       6033    .       C       T       623.14  .   
    1       6085    .       A       .       72.87   .   
    1       6093    .       A       .       72.87   .   
    1       6097    .       T       .       66.87   .   
    1       6357    .       C       A,G     4303.95 .   
    1       6374    .       C       .       45.87   .   
    1       6417    .       C       .       51.87   .   
    1       6418    .       G       .       51.87   .   
    1       6508    .       C       .       42.87   .   
    1       6540    .       C       .       51.87   .   
    1       6547    .       G       A       2612.19 .   
    1       6597    .       C       .       66.87   .   
    1       6753    .       A       .       72.80   .   
    1       6799    .       C       .       77.90   .   
    1       6890    .       G       .       72.86   .

    Best
    Zhuqing
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie Posts: 11,643 admin

    @zzq none of these would be considered a MNP in GATK -- this list is a mix of SNPs (including a multiallelic SNP) and reference sites. A MNP would be e.g. AAT -> ACC.

    Geraldine Van der Auwera, PhD

  • zzqzzq ChinaMember Posts: 62
    Hi @Geraldine_VdAuwera

    thanks, but how can i filter these sites using SelectVariants. Hope your help.

    best
    Zhuqing
  • SheilaSheila Broad InstituteMember, Broadie, Moderator Posts: 4,694 admin

    @zzq
    Hi Zhuqing,

    Ah, I see. I think you are asking to remove sites that have more than one alternate allele. In that case, you need to use --restrictAllelesTo BIALLELIC.

    -Sheila

  • zzqzzq ChinaMember Posts: 62

    Hi @Sheila

    Yes, I have tried with --restrictAllelesTo BIALLELIC, but the non-variant sites (sites with . listed in ALT column) will also be excluded. But
    I want to keep them.

    Best
    Zhuqing

  • shleeshlee CambridgeMember, Broadie, Moderator Posts: 524 admin

    Hi @zzq,

    The team may have more suggestions for using SelectVariants or JEXL expressions. In the meanwhile, perhaps you can move forward by using awk to exclude rows where column 5 contains a comma. This should then give you rows where the ALT allele is singular or a ..

    For example, try this command:

    cat input.vcf | awk '{if ($5 !~ /,/) print $0}' > output.vcf
    

    Be sure to review the resulting file to make sure it is as you expect.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator Posts: 4,694 admin

    @zzq
    Hi,

    So, adding --selectTypeToInclude NO_VARIATION and --restrictAllelesTo BIALLELIC does not get you what you want? If so, what esle do you want to keep and/or not keep?

    -Sheila

  • zzqzzq ChinaMember Posts: 62

    Dear @Sheila

    Sorry for the late reply. When using your recommended parameters, there will be nothing in the output. I just want to keep the sites with one allele or . in ALT column.

    Best
    Zhuqing

  • zzqzzq ChinaMember Posts: 62

    Dear @Sheila

    I think following can get I want. Thank you.

    -T SelectVariants \
       -R ref.fa \
       --selectTypeToExclude INDEL --selectTypeToExclude MIXED --selectTypeToExclude MNP --selectTypeToExclude SYMBOLIC \
       --selectTypeToInclude NO_VARIATION \
       --maxNOCALLfraction 0 --excludeFiltered \
       -V $VCF \
       -o $OUT.Monomorphic.vcf.gz
    -T SelectVariants \
        -R ref.fa \
        --selectTypeToExclude INDEL --selectTypeToExclude MIXED --selectTypeToExclude MNP --selectTypeToExclude SYMBOLIC \
        --selectTypeToInclude SNP \
        --maxNOCALLfraction 0 --excludeFiltered \
        --restrictAllelesTo BIALLELIC \
        -V $VCF \
        -o $OUT.SNP.vcf.gz
    -T CombineVariants \
        -R ref.fa \
        -V $OUT.Monomorphic.vcf.gz \
        -V $OUT.SNP.vcf.gz \
        -o $OUT.Clean.vcf.gz \
        --assumeIdenticalSamples

    Best wishes
    Zhuqing
Sign In or Register to comment.