This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!
Large vcf files after running the GATK SNV + indel pipeline
Simple question: Why do I get large vcf files after filtering variant calls?
I am following your best practice pipeline (SNV + indel), with some minor modifications suggested in another thread (with bug fixes for Mutect2).
In brief (not adding the
Funcotator), here is the pipeline for one whole exome sequencing sample (run on a docker container):
gatk Mutect2 -R my_data/reference/hg19/hg19.fa -I my_data/input/CRF.sorted.bam -O my_data/output/CRF_unfiltered.vcf --independent-mates
gatk GetPileupSummaries -I my_data/input/CRF.sorted.bam -V my_data/reference/ExAc_r1/ExAC_hg19_BiallelicOnly.r1.sites.vep.vcf.gz -L my_data/reference/ExAc_r1/ExAC_hg19_BiallelicOnly.r1.sites.vep.vcf.gz -O my_data/output/CRFpileups.table
gatk CalculateContamination -I my_data/output/CRFpileups.table -O my_data/output/CRFcontamination.table
gatk FilterMutectCalls -R my_data/reference/hg19/hg19.fa -V my_data/output/CRF_unfiltered.vcf --contamination-table CRFcontamination.table --tumor-segmentation CRFsegments.tsv -O my_data/output/CRF_filtered.vcf
Here are the sizes of each output generated (only those specified on the command lines):
CRF_filtered.vcf file won't even open on a text editor (e.g. atom) for visualization. Also, although not included here, the funcontated output file was very large (4.6GB) as well.
Sorry for the lay question, is there anything missing here?
Thanks a lot in advance.
Edit I notice that in the tutorial posted here, the output is not gz-compressed. Can one still designate an vcf.gz output file?