We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Select INDELs using an interval file

Hi, I used SelectVariants (GATK 4.0) to extract INDELs by providing the start positions of the desired loci using the -L option and an interval file wit start positions in GATK format. However, the tool extracts desired loci plus extra INDELs that I did not specify. The tool clearly selects out the INDELs that I desire but why are these extra loci selected as well?

Answers

  • Following is the format for intervals that I specified.

    Chr01:14736
    Chr01:18598
    Chr01:18684
    Chr01:44409
    Chr01:44636
    Chr01:44683
    Chr01:45107
    Chr01:47832
    Chr01:49529
    Chr01:49532
    Chr01:51390
    Chr01:71288
    Chr01:72934
    Chr01:73022
    Chr01:77479
    Chr01:139798
    Chr01:140125
    Chr01:165932
    Chr01:171305
    Chr01:172306

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    What is your select command and are you using some kind of interval padding?

  • gatk --java-options "-Xmx30g" SelectVariants \
    -R ${ref} \
    -L ${PBS_O_WORKDIR}/final_data_set_per_GATK_format_sorted_INDELs.intervals \
    -V ${inputPath}/combined_contigs_genotype_hardfiltered_biallelic_INDELs.vcf \
    -O ${outputPath}/INDEL_truthdataset.vcf \
    --select-type-to-include INDEL \
    --restrict-alleles-to BIALLELIC \
    --exclude-filtered

  • bshifawbshifaw Member, Broadie, Moderator admin

    Would you mind providing an example of indels within at the specified interval but shouldn't have been outputted. So a snippet of your input and output data.
    Was the interval format post from earlier the entire content of the interval file?
    What reference file are you using? Does it contain the Chr01 format?

  • Was the interval format post from earlier the entire content of the interval file? - No this is not the entire file.
    What reference file are you using? Does it contain the Chr01 format? - The reference file contains the Chr01 format and this is Populus trichocarpa v3.0

  • Just to add and clarify, the interval based extraction of INDELs work on vcftools.

    vcftools --vcf ${inputPath}/combined_contigs_genotype_improved_hardfiltered_biallelic_INDELs.vcf \
    --out ${outputPath}/INDEL_truthdataset_using_vcftools.vcf \
    --positions ${PBS_O_WORKDIR}/final_truthdata_set_per_VCFTOOLS_format_sorted_INDELs_2.intervals \
    --recode \
    --recode-INFO-all \

Sign In or Register to comment.