We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!
remove any variant with a ./.

Does anyone know how to remove a whole variant record in which the call for any of the samples is ./. ?
It should be simple but have been trying filter-select and select+jexl for a while now but to no avail.
Best Answers
-
pdexheimer ✭✭✭✭
Get the first working before bothering with the second (which is more complicated by a long shot)
The error fragments you pasted most likely mean that the shell is not dealing with your quotes properly - instead of seeing the entire 'AN = 438' as a single argument, GATK is seeing it broken up by whitespace. I can think of two possibilities: 1) Either you're using some strange (most likely non-Unix) environment, or 2) Something in your aliases/environment (like the 'GenomeAnalysisTK' at the beginning of your command) is broken and not handling the quotes correctly.
Just a thought, but you're not trying to call this from within another program, are you?
Answers
Why don't you select on AN? If you only select the variants where AN=(ploidy*nSamples), it will only grab the records where all of the samples are called
Thanks @pdexheimer Filtering by AN is a good hard-coded solution.
Unfortunately my code:
GenomeAnalysisTK -R local_reference/dm6.fa -T SelectVariants -V gd_Samps_raw_SNPs.vcf -select 'AN = 438' -o aajexeltest.vcf
Is returning:
Invalid argument value '=' at position 8.
Invalid argument value '438' at position 9.
..whether or not I use one 'equals' sign or two. I'm not sure if I'm missing something.
Also the other related query I have is how to remove any variant records than have a 1/1 call for any of the samples.... It would have to a different command to altogether.
Get the first working before bothering with the second (which is more complicated by a long shot)
The error fragments you pasted most likely mean that the shell is not dealing with your quotes properly - instead of seeing the entire 'AN = 438' as a single argument, GATK is seeing it broken up by whitespace. I can think of two possibilities: 1) Either you're using some strange (most likely non-Unix) environment, or 2) Something in your aliases/environment (like the 'GenomeAnalysisTK' at the beginning of your command) is broken and not handling the quotes correctly.
Just a thought, but you're not trying to call this from within another program, are you?
Filtering by AN now works well (on desktop) with:
java -jar -Xmx2g ~/SOFTWARE/GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -R ../reference_sequences/dmel/v6.0/dm6.fa -T VariantFiltration -V gd_Samps_raw_SNPs.vcf --filterExpression "AN < 438" --filterName "lowCallRate" -o aa_marked_test.vcf
I'll try on the cluster in a minute. Suspect you are correct about command syntax error. Not trying to call from another program, just running as a job in SGE.
Thanks ever so much for the help.
I don't suppose you have an answer for my second problem ? i.e. filtering variants that have a 1/1 call in an sample... I've done this in bash but I think it creates problems with the .idx
It will cause problems for the .idx, yes - but you can just delete that and allow GATK to regenerate it next time you need it.
The only way I know to do this in GATK is to iterate over every single sample, which is not something you want to do with 219 samples (--filterExpression 'vc.getGenotype("Sample 1").isHomAlt() | vc.getGenotype("Sample 2").isHomAlt() | ... - that's from memory, so the syntax may be a bit off, but you get the idea). With that many samples, you may be better off just using grep
I found that when running --filterExpression from a shell script (like submitting to SGE). I have to replace double quotes with single quotes in my environment to get it to work.
$JAVA_1_7/java -jar $GATK_DIR/GenomeAnalysisTK.jar \ -T VariantFiltration \ -R $REF_GENOME \ --variant $CORE_PATH/$PROJECT/TEMP/$SM_TAG"_QC_RAW_OnBait_SNV_ANNOTATED.vcf" \ --filterExpression 'QD < 2.0' \ --filterName 'QD' \ --filterExpression 'MQ < 30.0' \ --filterName 'MQ' \ --filterExpression 'FS > 40.0' \ --filterName 'FS' \ --filterExpression 'MQRankSum < -12.5' \ --filterName 'MQRankSum' \ --filterExpression 'ReadPosRankSum < -8.0' \ --filterName 'ReadPosRankSum' \ --filterExpression 'DP < 8.0' \ --filterName 'DP' \ --logging_level ERROR \ -o $CORE_PATH/$PROJECT/TEMP/$SM_TAG"_QC_FILTERED_OnBait_SNV.vcf"
@kurt @pdexheimer Both ['] and ["] work fine for filter expression, from my mac, but neither are accepted on the linux SGE cluster. I've run other gatk commands in the same script prior to this error, and gatk is the only program used to generate the vcf files. I've tried moving the filtration command up the work-flow, in case the input vcf was malformed. I've re-typed GenomeAnalysisTK, checked the shell script headers and module environment, and made a shorter version of the whole script. I've simplified in the input file names.
#!/bin/sh
$ -N dummy_gatk
$ -pe openmp 1
$ -S /bin/sh
$ -cwd
$ -j y
$ -q bioinf.q
. /etc/profile.d/modules.sh
module load jre/1.7.0_25
module load gatk/3.4-0
GenomeAnalysisTK -R ./local_reference/dm6.fa \
-T SelectVariants -V comb_SNP_indel.vcf \
-restrictAllelesTo BIALLELIC \
-XL dmel6_repMask.gatk.intervals \
-o bi_gd_Samps.vcf
GenomeAnalysisTK -R ./local_reference/dm6.fa \ ##### problem line
-T VariantFiltration -V bi_gd_Samps.vcf \
--filterExpression 'AN < 438' --filterName 'lowCallRate' \
-o bi_gd_Samps_lowCall_marked.vcf
In the log is that there are no INFO-time-HelpFormatter rows before the error lines start which seems like a gatk loading problem (normal for previous and subsequent commands). I've asked sysadmin if there's anything odd regarding expressions on SGE.
A USER ERROR has occurred (version 3.4-46-gbc02625):
##### ERROR MESSAGE: Invalid argument value '<' at position 8.
##### ERROR Invalid argument value '438' at position 9.
@Will_Gilks There is a tool called bcftools, which can do this:
The problem in this case was space.
Bad:
--filterExpression "QUAL < 1000.0" --filterName "LowQual"
Good:
--filterExpression "QUAL<1000.0" --filterName "LowQual"
@Will_Gilks
Wow. Thank you for posting your solution
-Sheila