Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

GATK4 pipeline in easy bash scripts, please

Hi, I asked this question a while ago and a few times. I know, there is a wonderful WDL platform and fire cloud stuff to run things in parallel and check this and that. But, for someone who are so used to a series of simple BASH commands, can you guys please kindly provide an example script like the one shown here https://gencore.bio.nyu.edu/variant-calling-pipeline/?

Right after I found the above, I found that it is not updated with GATK4, and I would hate to use a pipeline that is based on an outdated version of engine.

I will say "Thank You So Much, GATK". For almost a year, I still could not make GATK run on my own server, although there are a million documentation and tutorial and PPTs googled everywhere.

Best Answer

  • Accepted Answer

    Of course, I did use "-ERC GVCF" like I did before, but still there is NO phasing information. Did you guys get phasing information? If so, can you please share the gatk haplotyeCaller command the a screenshot of the output.

Answers

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    Hello @jiehuang001

    A complete list of up to date tutorials can be found here.

    The page which you pointed to is a little out of date, I would refer to the tutorials and best practice guidelines on our page to run through a pipeline. Some of our documentation is out of date, but it will usually state that.

    Good luck in your research.

  • Thanks, Adelaide. Again, I am looking for a pipeline like this one https://informatics.fas.harvard.edu/whole-genome-resquencing-for-population-genomics-fastq-to-vcf.html. I tried to follow the instruction to use WDL, but it requires me to install too many things and it did not work. I wish that someone from GATK team is listening to our needs ....

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    You can run the commands from tutorials as written in the code blocks (you can copy and past directly) by downloading and installing gatk on your local machine. The instructions for installing gatk can be found here

    Did you look at the tutorials? Copying the code from the code blocks is exactly the same as the website you posted that actually copied the code blocks from our documentation. That website is exactly the same.

  • Thanks, Adelaide! Sorry if i gave you the impression that I am so lazy or dummy that i do not even know how to copy and paste a well-written example script and simply run it...

    For the tutorial link that you pointed out (https://software.broadinstitute.org/gatk/documentation/topic.php?name=tutorials), the first two is on “Sensitively detect copy ratio alterations and allelic segments”, which is about germline CNV calling using the latest GATK 4.1, I assume. I did not realize that I need to go through so many steps in order to call CNV. I will try to follow these 8 steps then, from the first step “Collect raw counts data with PreprocessIntervals and CollectFragmentCounts” to the last step of “Plot modeled segments and allelic copy ratios with PlotModeledSegments”.

    Then, the next tutorials listed on that page include “Consolidate GVCFs for joint calling with genotypeGVCFs”, and “Filter variants either with VQSR or by hard-filtering”, etc. I think this is about the usual “best practice” SNV calling. As far as I know, for this part, we usually start with a bwa command, then “gatk MarkDuplicates”, then “BaseRecalibrator”, then “ApplyBQSR”, then “BaseRecalibrator” again, then “HaplotypeCaller” etc., etc. This is what I mean, a clear example, like the one I showed to you on the NYU and the Harvard informatics website. However, right now, I don’t see such a script that runs from bump to bmp on the tutorials. It would be great if you could share one, based on the latest GATK4.

    The tutorial from NYU is really much easier to follow https://gencore.bio.nyu.edu/variant-calling-pipeline/ if you don’t mind that I say so. You guys are the one who made GATK and generated so many nice documentations. So, I really wish that I could get a clearly written and easy-to-follow example pipeline/script from you guys. Again, I also tried WDL, but it requires so many installation of things and the script become so long, really not working for me...

    Your help would be deeply appreciated!
    Jie

  • manolismanolis Member ✭✭
    edited February 22

    Hi @jiehuang001,

    in general, if you go in the .wdl files and read/see the order of the "call" (workflow, what you need I think) and then copy the commands from the "task" you can create your bash pipeline...

    Here you can find more pipelines.

    Also in your GATK directory you can find more..

    /gatk-4.1.0.0/scripts/cnv_wdl$ tree
    .
    ├── cnv_common_tasks.wdl
    ├── germline
    │   ├── cnv_germline_case_scattered_workflow.wdl
    │   ├── cnv_germline_case_workflow.wdl
    │   ├── cnv_germline_cohort_workflow.wdl
    │   └── README.md
    └── somatic
    ├── cnv_somatic_oncotator_workflow.wdl
    ├── cnv_somatic_pair_workflow.wdl
    ├── cnv_somatic_panel_workflow.wdl
    └── README.md

  • Thanks! Instead of extracting BASH commands from the .wdl files, maybe I should re-try the WDL way again, so that we are always on the same page.

    Best regards,
    Jie

  • manolismanolis Member ✭✭
    edited February 22

    In my case I can not use WDL pipelines for technical reasons and I'm working with bash... I think that wdl is the best solution because in many cases are "ready-to-use"...

    Best!

  • Well, although you "can not use" WDL, but you still think it is "ready-to-use". That is cool :-)

    I just feel BASH is so clean and it can be easily embedded into other pipelines....

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    I felt the urge to interfere however there are several issues about this bash script that you are requesting.

    There are 3 important rules to remember.

    Rule number 1: Never ever fully trust anyones script (regardless of being in any language bash or WDL). 
    Rule number 2: In case of confusion remember rule number 1. 
    Rule number 3: There is none...
    

    Anyone working in this field knows their requirements and needs for proper data analysis and not 2 people in this area works under same conditions. That's why even broadies tell everyone that best practices is not something to be followed blindly but critically and even modified carefully at the first line of trouble with your own setup.

    Writing bash commands or scripts is not something magic. Best practices pages include basic commands %100 compatible with bash scripts. It is up to you to select and combine protocols methods and practices to get the best results.

    My suggestion is to start with a simple script that only aligns and sorts your reads. Then you can go ahead and add more QC and analysis steps to your own pipeline.

    I myself have a pipeline consisting of bash and perl scripts that I generated over the course of years with practice and experience and I even don't share that with anyone due to the fact that I don't want to be responsible for anyone's loss in the process (Not that I am a person that keeps everything to himself).

  • jiehuang001jiehuang001 Member
    edited March 2

    Dear all:

    I appreciate all the kind responses provided. I understand that technical stuff needs one to go through and trying-and-retrying process, and it might be difficult for someone else to provide a simple BASH script that I only need to click and finish. So, I will take some efforts to figure out the technical scripts myself.

    Now, I come up with a list of "scientific" questions. Answers to these would help me greatly. This list is kind of long and some of the questions might be dummy. I did spend a lot of time on my own but still could not fully resolve them myself. Therefore, I do need help and I would deeply appreciate it.

    1. First on reference genome. The raw data for reference genome is FASTA, while the raw data for re-sequenced genome is FASTQ. I understand that “Q” means quality, but is “FASTA” an acronym for some terminology? Why the FASTA file comes with a .fai index file, while re-sequenced genome FASTQ file does not? I used two resource to download reference genome. One is from Illumina (https://support.illumina.com/sequencing/sequencing_software/igenome.html) and the other is from Broad Institute (https://software.broadinstitute.org/gatk/download/bundle). I found that the reference genome FASTA files are all DNA sequence, without quality information. Is this because all DNA sequence in FASTA passed QC? The FASTA files have 50 or 60 DNA sequences per line, while the re-sequenced genome FASTQ file have 150 columns per line for example. Why 50 (or 60)?

    2. Still on reference genome. The .fai file from Illumina has only 25 lines, for chr1-22, X, Y, M, respectively. However, the .fai file from GATK download (human_g1k_v37.fasta.fai) has 84 lines. Besides chr1-22, X, Y, M, there are 59 records starting with “GL”, such as “GL000217.1”. These “GL” stuff are contigs that can help the alignment, correct? There are even two more lines in the human_g1k_v37_decoy.fasta.fai file (a total of 86 lines), besides the “GL” records, the two extra records are “NC_007605” and “hs37d5”. So, are these two “decoy”? I actually don’t know where these “contigs” or “decoys” coma from, since there are only chr1-22, X, Y, M in the human genome. Previously, I only worked with VCF files, while the FASTQ and BAM are handled by a data production team. Apparently, I did not see these “contigs” or “decoys” in the VCF file.

    3. Now on BAM files. I got both FASTQ and BAM files from sequencing vendors and I usually trust the data, therefore I did not actually run GATK to convert FASTQ to BAM files. Some vendors used Illumina reference genome, while others used GATK reference genome. Fortunately, so far they are all based on GRCh37 or hg19. Now I need to use GATK to do some downstream processing such as running “HaplotypeCaller” and I want to use just one reference genome file for all, to make life easier. Now I need to deal with two issues: First, some BAM files include a “chr” prefix while others don’t. Is there a quick way to remove or add the “chr” prefix into a BAM file? For VCF file, I could use “bcftools –rename-chr” to do this. Someone recommended me to convert BAM to SAM and then do it through tools such as sed. I hope there is a better way, since the SAM file would be too big. Second, when I use GATK 4.1 or other tools such as ERDS to call CNV, it requires me to provide tens of WGS data besides my BAM file. Now, let’s say that the reference WGS from 1000GP is based on Illumina reference panel therefore without contigs, but my own BAM file is based on GATK reference genome therefore with contigs. How to resolve this, without me to re-map the BAM files to make them all based on the same reference genome?

    4. For a BAM file generated from a pair-ended sequencing, the first part of the first line looks like this “A00199:163:HCKFTDMXX:1:1159:25427:11678 177 1 9996 0 83S8M1D59M X 155260267”. Does this mean that for this pair, one read is on chr1 while the other read is on chrX? How could it be possible that two "ends" of a sequenced DNA fragment/library span across two different chromosomes?

    5. I am trying to understand better the BAM header file. The BAM documentation says that the “LN” means “Reference sequence length”, and it ranges from 1 to (2^31 -1). This is the header of a BAM file or my own sample, why it needs “reference” genome length? I use “samtools view –H my.bam” and saw the first 3 lines are as following: @HD VN:1.5 SO:coordinate; @SQ SN:1 LN:249250621; @SQ SN:2 LN:243199373. Does this mean that in this BAM file chr1 starts at line 249250621 and chr2 starts at line 243199373? BTW, I could use a text editor to view the .fastq.fai file, but how to view the .bam.bai file.

    6. I have one female BAM file, when I use “samtools view Y”, it is surprising that there is chrY data generated. The vendor told me to use “bcftools view -t ^Y” to remove chrY data. I think that should NOT be a good approach. After all, this female could be indeed XXY. So, what is the standard procedure to remove the false chrY data from females in GATK? Let’s say that this person is actually XXY. Is there a way to detect this XXY or trisomy from FASTQ or BAM file?

    7. Finally, on CNV calling. CNV includes duplication and deletion. INDEL includes deletion and insertion. So, both have “deletions”. The only difference is that the deletion in CNV is said to be >50bp while the deletion in INDEL is <50kb. Then there is also STR, which is more like CNV duplication, but at a very smaller size. So, why couldn’t we use the same GATK best practice to find a deletion of 49bp (an INDEL) vs. 51bp (a CNV)? Similarly, can we use GATK4.1 CNV caller to call STR?

    Thank you & best regards,
    Jie

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    You seem to have very fundamental questions that needs answers other than GATK kind questions.

    1- FASTA and FASTQ are 2 different things. FASTQ is the generic format for HTS read format that includes machine name read name read coordinate and read pair along with read quality. FASTA is a generic format for any type of sequence file that does contain reference sequence only. It could be DNA, RNA or Protein. There is no information regarding the quality because it is a reference format.

    2- Some vendors use the default UCSC hg19 without additional random contigs therefore the index file only includes 24 + mitochondria. UCSC genome includes chr prefix infront of all contigs. On the other hand there is b37 that is widely used by 1000G project as a reference which does not have chr prefix and also includes random unmapped contigs by default. Choice of reference genome depends on you. If you are not into mapping your own reads and doing proper bam file yourself you have to rely on what the vendor provides and you have to obtain the reference genome resources the same as your vendor uses otherwise your downstream analyses will go in vain with lots of errors and incompatibility messages. It is usually not a practice to call variants from decoys or unmapped contigs unless it is your utmost priority. Those unmapped contigs and decoys in b37 is used to syphon reads that do not align to main contigs properly. When it comes to hg38 things change substantially. hg38 includes alt mapping information that may help solving variants in alternate contigs. One thing to remember is that if you are working with mitochondrial variants you need to select the proper reference that includes the most proper mitochondria sequence. UCSC hg19 has an outdated chrM reference that does not fit most downstream software for proper annotation however b37 and hg38 includes the most accepted and most compatible mitochondria reference. I have seen some vendors using hybrid hg19 that is modified to have that mitochondria reference the same as b37 and hg38 however that itself is not compatible with IGV so you have to generate your own genome file for IGV to analyze those sequences with hybrid hg19. Best practice: Never fully trust any vendors reference genome and go with a standard set that fits all downstream applications and make this your own default and never ever deviate from that. That will give you peace of mind for consistency. Also pay attention to Masked Y chromosome sequences. Default UCSC hg19 does not mask PAR on Y so you may miss all the variants in those regions. b37 and hg38 in resource bundles already mask those regions.

    3- DON'T DO THIS! Get your fastq and re-do everything from the beginning.

    4-

    A00199:163:HCKFTDMXX:1:1159:25427:11678 177 1 9996 0 83S8M1D59M X 155260267
    

    This is a read that is mapped to X chromosome at position 15560267. The beginning string is the read name. Please read the SAM format manual thoroughly to understand what the format represents.
    https://samtools.github.io/hts-specs/SAMv1.pdf

    5- BAM header includes reference sequence dictionary for downstream compatibility. Do not mess with any of those lines manually. Fai file is text format but bai format is binary and zipped. You cannot open that. You also don't need to open that.

    6- Y chromosome contains regions that fall into autosomes as well. (CD54). So reads that map those regions mistakenly map to Y and there is no way to avoid that. You can use GATK CNV tools to check for allosomal aneuploidies. You may also implement your own analysis to find ratios of X and Y with respect to autosomes using read counting etc but it is totally up to you.

    7- GATK CNV tools is not designed to detect short indels and HaplotypeCaller is not designed to detect CNVs. They are 2 different tools with 2 different purposes. STR analysis on the other hand is a totally different thing and the performance depends on the read technology you are using. Short read technology cannot detect most STRs. You need to use long read technology or there are other molecular biology techniques to analyze those.

  • jiehuang001jiehuang001 Member
    edited March 3

    Dear SkyWarrior:

    Thank you very much for your detailed explanation! I would deeply appreciate if others could comment as well. I hope that the answers to these question could benefit more in this community, not just myself.

    Before further following up, I have a quite important extra question: the VCF file that I got from a sequencing vendor only contains genotype, not haplotype. That is, the genotype data used "/" instead of "|". Since GATK "HaplotypeCaller" is named "haplotype caller" , I am so curious why it ends up without "haplotype" in the final VCF file? Can GATK read and piece together directly sequenced haplotypes from BAM file, so that I don't need to run statistical imputation through tools such as BEAGLE. It is haplotype that is sequenced, after all.

    Now, below is my follow-up to your responses.

    1. Is FASTA an acronym for something, maybe "F" means "format"? I could not find this information through Google.

    2. "decoys" and "contigs" are chunks of DNAs on chromosomes, correct? For example, my VCF file has this line "##contig=<ID=chr1_gl000191_random,length=106433,assembly=hg19>". Does this mean a 106433bp DNA chunk on chr1 of the reference genome? If so, the reference genome already has the full sequence on chr1, why it needs to provide another sequence for a ~100kb chunk on it?

    3. I did read the SAM tutorial. For the the example that I gave, "1 9996" is the "position", and "X 155260267" is the "mate information". So, how could this read and its "mate" are on two different chromosomes?

    4. You said that Y chromosome contains regions that fall into autosomes as well. I know that, but only very few regions like that. However, I have a male and a female BAM file, when I use "samtools view the.bam chrY | wc -l", I got 12,903,917 and 3,519,554 lines for males and female, respectively. So, there are over 3 million chrY reads from a female. I think there must be some other reasons (not due to PAR regions either). Don't know how GATK deals with this so that those chrY reads are correctly excluded from the final VCF file.

    5. You also said that I can use GATK CNV tools to check for autosomal aneuploidies. But GATK CNV tools is only recently available and still kind of in beta phase. Let's say that I now have a WGS dataset for someone with down syndrome, isn't there an easier and mature way to confirm his triosomy status? I am not talking about NIPT, where things are difficult because the fetus' DNA is diluted by the pregnant mom's DNA. Now, there is no dilution or mix. The WGS is based on whole blood DNA from the person with chr21 trisomy.

    6. Why we need different methods to call a deletion of 49bp (an INDEL) vs. a deletion of 51bp (a CNV)?

    Thank you very much & best regards,
    Jie

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭
    edited March 4

    Reasons for function naming is beyond my understanding yet HaplotypeCaller still fits the function when it is considered with all its perks and outputs including GVCF and Joint genotyping. Joint genotyping results include phasing of alleles by default. Still naming is my least of concerns.

    1- Don't get drown in acronyms and jargons. Use them effectively only.

    2- That contig contains sequence that could not be mapped into the main chr1 properly yet its source is known to be chr1. There will be future releases of human genome that may eliminate this ambiguity.

    3- My bad. This is possible by several means. a- Mismapping b- A real SV It is upto you to decide with further evidence.

    4- Again there is no way that Y chromosome reads could be eliminated from female samples unless you do that manually with -XL parameter. Short reads technology has this kind of limitations when it comes to most accurate mapping of reads to reference. Still the ratio of reads from Y and X to autosomes is important. My experience is that if you have a Y to autosome ratio around 0.01 - 0.05 this is a perfect XX female data. So having 3 million reads in Y when having 300 millon reads in all autosomes does not mean that you have a XXY anomaly.

    5- You seem to get stuck on the definition of INDEL and CNV. 51 bp indel is not a CNV if it can be captured by your sequencing technology and 49 bp indel could be a CNV if it is located in a place where your read technology cannot capture it. CNV tools in GATK discovers CNVs based on multiple parameters around coverage and read depth. For genome sequencing it is upto the user to setup intervals for CNV calling in a bed file as seperate regions in the genome. The Manual clearly states this actually.

    Normally your resolution to capture indels is around half the size of your sequencing length. So if you are using paired end 2x150 bp reads you can capture around 75 to 100 bp of indels from your data depending on the quality of your reads and library. Of course there may be exceptions. dbSNPs policy of not accepting indels beyond 50bp is beyond my understanding yet I believe this will be a deprecating policy with the recent developments in long read sequencing technologies.

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    One more thing. That read that you gave example. I overlooked the issue but there is a very simple explanation for that. That read belongs to telomeric regions. So there is really nothing funky. It is just a simple mismapping due to telomeric repeats.

  • jiehuang001jiehuang001 Member

    I just found it!

    It is ReadBackedPhasing ( https://software.broadinstitute.org/gatk/documentation/tooldocs/3.6-0/org_broadinstitute_gatk_tools_walkers_phasing_ReadBackedPhasing.php). It is a little bit surprising that this cool method is NOT implemented in HaplotypeCaller as a default.

    For the questions that I raised above, I still appreciate if someone else could kindly give a reply, so that I could be "double" sure of the answers.

    Best regards,
    Jie

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @jiehuang001

    Support for outputting phased variants in HaplotypeCaller has been made available since GATK 4.0.12.0 release.

    Also most questions in the threads above are not directly GATK related questions and we do not support that. If you could filter that out for me and only post GATK related questions I will help you out with those. Thank you :)

  • jiehuang001jiehuang001 Member

    Yes, sure!

    Below are my 3 GATK related questions:

    1. You said that support for outputting phased variants in HaplotypeCaller has been made available since GATK 4.0.12.0 release. However, the VCF file that I got from the sequencing vendor still have unphased genotype "/", while they told me that they did use GATK. So, can you please let me know how to turn on the "ReadBackedPhasing" in HaplotypeCaller? The vendor is not willing to share their GATK pipeline scripts..

    2. I have one female BAM file, when I use “samtools view Y”, it is surprising that there is about 3.5 million lines in the output, and they are NOT in the PAR-1 or PAR-2 region. The vendor told me to use “bcftools view -t ^Y” to remove chrY data. I think that should NOT be a good approach. After all, this female could be indeed XXY. So, what is the standard procedure to remove the false chrY data from females in GATK? Let’s say that this person is actually XXY. Is there a way to detect this XXY or trisomy from FASTQ or BAM file?

    3. I used two resource to download reference genome. One is from Illumina (https://support.illumina.com/sequencing/sequencing_software/igenome.html) and the other is from GATK website (https://software.broadinstitute.org/gatk/download/bundle). The .fai file from Illumina has only 25 lines, for chr1-22, X, Y, M, respectively. However, the .fai file from GATK download (human_g1k_v37.fasta.fai) has 84 lines. Besides chr1-22, X, Y, M, there are 59 records starting with “GL”, such as “GL000217.1”. There are even two more lines in the human_g1k_v37_decoy.fasta.fai file (a total of 86 lines). Besides the “GL” records, the two extra records are “NC_007605” and “hs37d5”. So, are these two “decoy”? I actually don’t know where these “contigs” or “decoys” come from, since there are only chr1-22, X, Y, M in the human genome. In the VCF file generated from GATK pipeline, there are lines such as this one "##contig=<ID=chr1_gl000191_random,length=106433,assembly=hg19>". Apparently, in the body of VCF file, there is no "chr1_gl000191_random". Also, some BAM files have "chr" prefix while others don't, due to different reference panel used. Is there a GATK command to convert? Please simply forward me a learning resource on this topic if you feel this is a too basic and not so relevant question.

    I would deeply appreciate your explanation!

    Thank you & best regards,
    Jie

  • jiehuang001jiehuang001 Member
    1. Also to add a GATK CNV question. I got a few BAM files from 1000GP to use as reference for CNV calling. However, the 1000GP BAM files and my own WGS data are mapped using different reference panels, one is hg19 and the other is GRCh37. Therefore, I got an error message of "input file reference and reads have incompatible contigs. contig reference = chr1 / 248956422. contig reads = chr1 / 249250621 ...".
      Do I must re-map both BAM files using the same reference genome in order to resolve this error, or is there a quick way to fix the BAM file, since the difference between hg19 and GRCh37 is modest?

    Thank you very much!
    Jie

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @jiehuang001

    1) I will need to see the version of gatk and exact command used to tell why phasing information is not provided. Without that sorry I am unable to help you. Latest version of GATK4.1 provides phasing information by default unless turned off.
    2) to detect copy ratio alterations read this doc: https://software.broadinstitute.org/gatk/documentation/article?id=11682
    3) > Please simply forward me a learning resource on this topic if you feel this is a too basic and not so relevant question.
    You can post your question on www.biostar.org or www.seqanswers.com
    4) you must re-map both BAM files using the same reference genome in order to resolve this error.

  • jiehuang001jiehuang001 Member

    Hi, I use GATK 4.1.0.0.

    I first run the usual "gatk MarkDuplicates, gatk BaseRecalibrator, gatk ApplyBQSR", and then run: gatk HaplotypeCaller -R human_g1k_v37c_decoy.fasta -I sample1.bam -O sample1.gvcf -ERC GVCF -dbsnp dbsnp_138.b37.vcf.gz.

    The output sample1.gvcf file has about 440 million lines. However, the genotype is not phased, please see the screenshot attached below. So, can you please let me know how to make GATK generate phased gvcf file?

    It seems that I have two ways to create the VCF file. One is through bcftools --gvcf2vcf, and the other is to run the above "gatk HaplotypeCaller" command without the "-ERC GVCF" option. Ideally, i would like to use bcftools since I could get both gVCF and VCF files. Don't know if I could skip this bcftools command by asking GATK to output both VCF and gVCF files directly.

    Thank you very much & best regards,
    Jie

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited April 9

    Hi @jiehuang001

    In the snapshot I do not see any variants, hence they have no phasing information. If the ALT field shows <NON_REF> only and no variants, that means there is not variant at that postion. Read this blog for more info: https://software.broadinstitute.org/gatk/blog?id=4741

  • jiehuang001jiehuang001 Member

    Thanks. Please see the updated screenshot below. This time, i did not use "-ERC GVCF" and got a VCF file directly by using "gatk haplotypeCaller -I -O -dbsnp". As you can see, the output VCF file is not phased either.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    @jiehuang001

    Can you try to run the same with "-ERC GVCF" like you did before and see if you are able to see any phasing information.

  • jiehuang001jiehuang001 Member
    Accepted Answer

    Of course, I did use "-ERC GVCF" like I did before, but still there is NO phasing information. Did you guys get phasing information? If so, can you please share the gatk haplotyeCaller command the a screenshot of the output.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    HI @jiehuang001

    I checked with our dev team and this is what they had to say:

    Haplotypecaller and M2 automatically emit phased genotypes (i.e the PGT and PID tags and 0|1 instead of 0/1 genotypes) in any mode. However, the phasing algorithm is very conservative and a huge amount of phasing is not found.

    The way the phasing algorithm decides to phase is by checking whether two variants always occur on the same haplotype or always occur on a different haplotype. The excess haplotypes severely dilute the signal.

    This raises the question of whether we could do better, and the answer is yes, easily. The current code is very naive. The dev team is looking into making this process better.

  • jiehuang001jiehuang001 Member

    The output VCF file from GATK haplotypeCaller has ~5 million rows. I use fgrep "|" but found no row has "|" in it.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited April 22

    @jiehuang001

    My comment above still holds.

  • jiehuang001jiehuang001 Member

    Thanks. But for whatever reason, "GATK HaplotypeCaller" does NOT actually produce haplotype, that is a big pity. Above you mentioned "The dev team is looking into making this process better". I hope that will happen very soon!

    Best regards,
    Jie

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @jiehuang001 Four of us on the GATK team are implementing an algorithm for restoring phasing beyond the size of a single kmer into the assembly de Bruijn graph, which will greatly improve the output phasing.

  • jiehuang001jiehuang001 Member

    That's great! when could I expect it to be implemented?

    best regards,
    Jie

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    Hard to say. More than 2 months, less than 8 months.

  • jiehuang001jiehuang001 Member

    Sure, no problem, fully understood that you guys are busy. So, with the new method implemented, we will be able to get "|" for the vast majority of the VCF files generated by GATK HaplotypeCaller ?

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    I would expect | for most pairs of variants that fall within a read length of one another, and perhaps longer if we manage to account for pairing. Longer-range phasing really requires a dedicated phasing tool that infers haplotypes from lots of samples.

  • jiehuang001jiehuang001 Member

    For the VCF file generated by GATK HaplotypeCaller, let's say that 50% of them are phased with "|", while the other 50% are un-phased with "/". This might require a new phasing tool that only phases those genotypes with "/". I think the current phasing tools such as SHAPE-IT or EAGLE will ignore the existing "|" and phase everything altogether.

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    Phasing is a huge area of research even today regardless of the technology being used for analysis.

    Read supported phasing, phasing by transmission, phasing by imputation are all important aspects of this.

    One cannot expect a single tool to perform all the perfect phasing using only a single methodology therefore insisting on perfection for this kind of task is pretty much pointless.

    What is more important is to get accuracy and precision in genotyping and reducing errors.

    Current short read technologies as @davidben indicated can support up to a few hundred bases may be a little more with better assembly algorithms. A better phasing will only come from long read technologies which are better suited for getting phased variants up to kilobases even mega bases apart.

    When working with large cohorts with founders and offspring together it is not too hard to find out proper phased genotypes so I would suggest you to read more about Stephens & Li HMM.

    By the way GATK when used with joint genotyping performs phasing although not in the sense of SHAPE-IT or BEAGLE, however the VCF file produced contains PID and PGT flags in the FORMAT section that indicates which alleles are in phase with each other.

Sign In or Register to comment.