We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Large vcf files after running the GATK SNV + indel pipeline

dodauspdodausp DenmarkMember
edited November 2019 in Ask the GATK team

Simple question: Why do I get large vcf files after filtering variant calls?
I am following your best practice pipeline (SNV + indel), with some minor modifications suggested in another thread (with bug fixes for Mutect2).

In brief (not adding the LearnReadOrientationModel nor Funcotator), here is the pipeline for one whole exome sequencing sample (run on a docker container):

gatk Mutect2 -R my_data/reference/hg19/hg19.fa -I my_data/input/CRF.sorted.bam -O my_data/output/CRF_unfiltered.vcf  --independent-mates
gatk GetPileupSummaries -I my_data/input/CRF.sorted.bam -V my_data/reference/ExAc_r1/ExAC_hg19_BiallelicOnly.r1.sites.vep.vcf.gz -L my_data/reference/ExAc_r1/ExAC_hg19_BiallelicOnly.r1.sites.vep.vcf.gz -O my_data/output/CRFpileups.table
gatk CalculateContamination -I my_data/output/CRFpileups.table -O my_data/output/CRFcontamination.table
gatk FilterMutectCalls -R my_data/reference/hg19/hg19.fa -V my_data/output/CRF_unfiltered.vcf  --contamination-table CRFcontamination.table --tumor-segmentation CRFsegments.tsv -O my_data/output/CRF_filtered.vcf

Here are the sizes of each output generated (only those specified on the command lines):
CRF.sorted.bam (12.9GB)
CRF_unfiltered.vcf (432.6MB)
CRFpileups.table (1.1MB)
CRFcontamination.table (80B)
CRFsegments.tsv (989B)
CRF_filtered.vcf (558.7MB)

The CRF_filtered.vcf file won't even open on a text editor (e.g. atom) for visualization. Also, although not included here, the funcontated output file was very large (4.6GB) as well.
Sorry for the lay question, is there anything missing here?

Thanks a lot in advance.

Edit I notice that in the tutorial posted here, the output is not gz-compressed. Can one still designate an vcf.gz output file?

Best Answer


  • dodauspdodausp DenmarkMember

    Thank you @bhanuGandham!
    I'm only puzzled now, that I am using a cpu with 16GB to run these WES data, and it is taking the same amount of time as you describe here (1-2 days).

    Many thanks again!

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @dodausp

    Additional info: Usually with 1cpu with ~2gb memory, a whole genome sample with 100x coverage(tumor+normal) will take about 1-2days to run through the entire Mutect2 pipeline.

    This is the minimum cpu requirement. More memory will not affect the time it takes.

Sign In or Register to comment.