The current GATK version is 3.8-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

#### ☞ Got a problem?

1. Search using the upper-right search box, e.g. using the error message.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ` ) each to make a code block as demonstrated here.

GATK version 4.beta.3 (i.e. the third beta release) is out. See the GATK4 beta page for download and details.

USAMember

Hello

I am working with canine genomes and donwloaded the reference file (ftp://ftp.ensembl.org/pub/release-73/fasta/canis_familiaris/dna/Canis_familiaris.CanFam3.1.73.dna.toplevel.fa.gz) and variation VCF file (ftp://ftp.ensembl.org/pub/release-73/variation/vcf/canis_familiaris/Canis_familiaris.vcf.gz) from Ensembl. I was able to index the reference files and perform the alignment and Mark duplicate steps of the pipeline for the canine genomes.

I extracted SNPs that were marked either deleteions or duplications from the variation VCF file using grep -E "^#|deletion|insertion" Canis_familiaris.vcf > Canis_familiaris.indels.vcf to use in the Indel Realigner steps. The header of the file is as follows:
##fileformat=VCFv4.1 ##fileDate=20130826 ##source=ensembl;version=73;url=http://e73.ensembl.org/canis_familiaris ##reference=ftp://ftp.ensembl.org/pub/release-73/fasta/canis_familiaris/dna/ ##INFO=<ID=TSA,Number=0,Type=String,Description="Type of sequence alteration. Child of term sequence_alteration as defined by the sequence ontology project."> ##INFO=<ID=E_MO,Number=0,Type=Flag,Description="Multiple_observations.http://www.ensembl.org/info/docs/variation/data_description.html#evidence_status"> ##INFO=<ID=E_ESP,Number=0,Type=Flag,Description="Exome_Sequencing_Project.http://www.ensembl.org/info/docs/variation/data_description.html#evidence_status"> ##INFO=<ID=E_1000G,Number=0,Type=Flag,Description="1000Genomes.http://www.ensembl.org/info/docs/variation/data_description.html#evidence_status"> ##INFO=<ID=E_HM,Number=0,Type=Flag,Description="HapMap.http://www.ensembl.org/info/docs/variation/data_description.html#evidence_status"> ##INFO=<ID=E_Freq,Number=0,Type=Flag,Description="Frequency.http://www.ensembl.org/info/docs/variation/data_description.html#evidence_status"> ##INFO=<ID=E_C,Number=0,Type=Flag,Description="Cited.http://www.ensembl.org/info/docs/variation/data_description.html#evidence_status"> ##INFO=<ID=dbSNP_131,Number=0,Type=Flag,Description="Variants (including SNPs and indels) imported from dbSNP [remapped from build CanFam2.0]"> 1 8252 rs8962840 C CT . . dbSNP_131;TSA=insertion 1 9702 rs8457244 CA C . . dbSNP_131;TSA=deletion 1 18289 rs8444620 GT G . . dbSNP_131;TSA=deletion 1 36381 rs8471229 GT G . . dbSNP_131;TSA=deletion 1 52452 rs8469855 AT A . . dbSNP_131;TSA=deletion 1 55767 rs8285977 G GA . . dbSNP_131;TSA=insertion 1 60065 rs8723538 TC T . . dbSNP_131;TSA=deletion 1 62067 rs8395051 TA T . . dbSNP_131;TSA=deletion

However, when I run the filrst step of Indel Realignment
-T RealignerTargetCreator -nt 16 -R canFam3.1_bwa075/Canis_familiaris.CanFam3.1.73.dna.toplevel.fa -I AF23.rg.prmdup.bam -l INFO -S SILENT -o AF23.indel.intervals --known canFam3.1_bwa075/Canis_familiaris.indels.vcf

I get the error message:

I guess it is because the VCF file is missing the standard header (#CHROM POS ID ...). I was wondering if there was any way I could add the header line to the VCF file to be able to use in GATK? Would it also be possible to use the original VCF with all variations in the Indel Realignment steps or would you recommend I subset it for insertions/deletions as I did in this case?

Thank you very much for your help and suggestions!!!

Yours sincerely
Shanker Swaminathan

Tagged:

You can add in the line manually, but be sure to check there's nothing else wrong with your file. And you can use the original VCF, no need to subset.

• USAMember

Hi Geraldine
Thanks for the reply! Is there an easy way to add the line manually on the command line without having to actually open the file and insert the line? Could you also please explain as what I need to check to make the file is ok?
Thanks
Shanker

Hi Shanker,

You can do that with some Unix magic, but not through the GATK, so I recommend you look up how to do it somewhere like StackOverflow.com.

To validate your VCF you can use ValidateVariants.

• USAMember

Hi Geraldine
Thank you for the suggestion! I created a file with the header information and using the command sed '12r header.txt' Canis_familiaris.vcf > Canis_familiaris_header.vcf was able to add the header information.
I then attempted to run ValidateVariants using the following command
-R canFam3.1_bwa075/Canis_familiaris.CanFam3.1.73.dna.toplevel.fa -T ValidateVariants --variant Canis_familiaris_header.vcf --warnOnErrors

##### ERROR MESSAGE: Input files canFam3.1_bwa075/TEST/Canis_familiaris_header.vcf and reference have incompatible contigs: Relative ordering of overlapping contigs differs, which is unsafe.

I am not sure why I got the error message since I downloaded the reference and variation files from Ensembl release 73.
I then tried to reorder the VCF using a perl program vcfsorter.pl (https://code.google.com/p/vcfsorter/) and reran the above command on the reordered VCF. I received the warnings
WARN 10:54:12,527 ValidateVariants - ***** the REF allele is incorrect for the record at position 1:8113, fasta says A vs. VCF says T ***** WARN 10:54:12,528 ValidateVariants - ***** the REF allele is incorrect for the record at position 1:8290, fasta says C vs. VCF says G ***** WARN 10:54:12,529 ValidateVariants - ***** the REF allele is incorrect for the record at position 1:10374, fasta says T vs. VCF says A ***** .... Found 36208 records with failures.

I don't understand why there might be many mismatches. I downloaded the reference and variation files from Ensembl release 73. Could you please advise if I am doing anything incorrectly? How would you suggest I proceed?

Thanks again for the help and suggestions!

Shanker

Hi Shanker,

It's difficult to comment on this since we don't work with those genomes ourselves. I know that in humans there are several different versions of builds of the reference genome; perhaps there is the same for dogs, and somehow you got mismatching files? I would recommend checking with Ensembl.

• USAMember

Hi Geraldine

Thank you for the suggestion. I checked with Ensembl and it appears that the variation file contained variants that should not have been present in the file as they had failed QC. They give me a list of variants to exclude and I removed those variants from the file. On rerunning the ValidateVariants tool, it came back with no failures.

I then performed the Indel Realignment steps. When I tried to do the first step of Base Recalibration I received the following error:

##### ERROR stack trace

java.lang.ArrayIndexOutOfBoundsException
at java.util.Arrays.copyOfRange(Unknown Source)
at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:228) at org.broadinstitute.sting.gatk.traversals.TraverseReadsNano$TraverseReadsMap.apply(TraverseReadsNano.java:216)

##### ERROR ------------------------------------------------------------------------------------------

The command used for the base recalibration is:

Program Args: -T BaseRecalibrator -I AF23.rg.prmdup.jir.bam -R canFam3.1_bwa075/Canis_familiaris.CanFam3.1.73.dna.toplevel.fa -knownSites canFam3.1_bwa075/Canis_familiaris.CanFam3.1.73.variations.sorted.clean.recode.vcf -o AF23.rg.prmdup.recal_data.grp

Could you please tell me what could be causing this error and how to correct it?

Thank you once again from your help!

Shanker

Hi Shanker,

I'm glad your original problem was resolved. Regarding your new problem, can you please validate your BAM file (with Picard's ValidateSAMFile)? If the file is valid, then we may have a bug; if so please post it as a new discussion thread since it is unrelated to the original problem that is the topic of this thread. Thanks!

• USAMember

Hi Geraldine

I ran Picard ValidateSamFile on the file which had been Indel Realigned as follows:

Here is the output from the summary file:

## HISTOGRAM java.lang.String Error Type Count ERROR:INVALID_CIGAR 497409 ERROR:MATES_ARE_SAME_END 4316 ERROR:MISMATCH_FLAG_MATE_NEG_STRAND 6120 ERROR:MISMATCH_FLAG_MATE_UNMAPPED 78510 ERROR:MISMATCH_MATE_ALIGNMENT_START 10

Do you think there is a problem with a BAM file? What would you suggest we do next?

I can start this post on another thread if you would like, I just wanted to make sure that this may be a different problem.

Thanks

Shanker

This line

ERROR:INVALID_CIGAR 497409

is alarming. It means that many of your reads have a malformed CIGAR string. You need to either fix them or run with -rf BadCigar so that GATK can skip them. If you do that, make sure to check the read filter summary report (in the console output) to see what proportion of data is getting excluded.

• USAMember

Hi Geraldine

Could you please tell me what could be causing this problem and how I should go about trying to fix it? Do I have to realign the data again?

Thanks
Shanker