About the haplotypes in hg19.fasta and the GTF file

Dear all,

I am new to GATK and want to call variants for RNA-seq according to the best practice (https://software.broadinstitute.org/gatk/documentation/article.php?id=3891). I have three questions.

  1. The UCSC.hg19.fasta downloaded from ftp://ftp.broadinstitute.org/bundle/hg19/ contains several haplotypes: chr4_ctg9_hap1, chr6_apd_hap1, chr6_ssto_hap7, chr17_ctg5_hap1, ... while the STAR manual (2.5.3a) says "> Generally, patches and alternative haplotypes should not be included in the genome." (section 2.2.1, page 5). So should I use the hg19 fasta without halotypes downloaded from UCSC as instead?

  2. The STAR manual also points out that "using annotations is highly recommended whenever they are available. " The current 2-pass mapping part of best practice does not refer to GTF file. I searched the forum and found the relevant topic (https://gatkforums.broadinstitute.org/gatk/discussion/comment/33853#Comment_33853); I think the GTF file should be involved to improve the mapping quality.

  3. In the step of "Split'N'Trim and reassign mapping qualities" and "variant calling" the "ref.fasta" are used:

    java -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R ref.fasta -I dedupped.bam -o split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS

    java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ref.fasta -I input.bam -dontUseSoftClippedBases -stand_call_conf 20.0 -o output.vcf

Do they mean the "hg19.fasta"?

Many thanks,
Stanley

Tagged:

Issue · Github
by Sheila

Issue Number
2886
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
sooheelee

Best Answer

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @swlx08
    Hi Stanley,

    I need to check with the team on 1 and 2.

    3) Yes, the ref.fasta is used to represent the reference FASTA file you are using. Sometimes people use hg38 or b37, so we just use "ref.fasta"

    -Sheila

  • Dear Sheila,

    Thank you very much for your answer. May I ask one more question?

    The current 2-pass mapping steps used the default options; however, to analyze different expression I need to use the command as below:

    STAR --runMode alignReads --runThreadN 4 --genomeDir ~/hg19/star_ref/ --readFilesIn a_1.fq a_2.fq --quantMode TranscriptomeSAM GeneCounts --outSAMtype BAM Unsorted --outSAMreadID Number --outSAMattrIHstart 0 --outSAMattributes All --outFilterMismatchNoverReadLmax 0.05 --outSAMstrandField intronMotif --outFilterIntronMotifs RemoveNoncanonical --outFilterMultimapNmax 1 --outReadsUnmapped Fastx --outFileNamePrefix ./a

    So may I use my self_defined options instead of the default ones in both the 1st and 2nd mapping?

    Best regards,
    Stanley

Sign In or Register to comment.