RAW VCF has a significantly higher amount of variations than the total of (RAW SNPS + RAW INDELS)

Nilaksha

I successfully ran a variant calling pipeline and got the raw.vcf file obtained by haplotype caller which contains exactly 4,484,688 records. And then I processed it further by extracting SNPs and INDELs to seperate VCF files using SelectVariants , and it yielded a vcf for raw SNPs with 85,239 records and a vcf for raw INDELs with 14,501 records. As you can see there is a significant decrease of the total number of records. where almost 4 million records were ignored. Could you please explain me what's the process behind all these and why is this happening ?


