Questions about Indel Realignment (2013-2015)

shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
edited May 2016 in Ask the GATK team
This discussion was created from comments split from: Local Realignment around Indels.

Previous comments in this thread were posted at http://gatkforums.broadinstitute.org/gatk/discussion/7639/questions-about-indel-realignment-2012-2013

Comments

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Vivek,

    Technically the output of CountReads is not what you're looking at, that's just the filtering summary. The actual output of CountReads should be on a line that looks like

    INFO  10:11:15,268 CountReads - CountReads counted 33 reads in the traversal
    

    in your console output.

    Now, the two numbers should be the same, but until recently we had a bug in the traversal counter (the number you're looking at) so that may not be accurate. You should use the number explicitly output by CountReads.

    If that count is still very low, then maybe you do have a problem. You should certainly have more than half of your reads mapped to the exonic regions, otherwise you either have poor quality sequence data or the alignment did not go well. What processing steps have you applied to your data and what aligner did you use? Have you checked the number of duplicates?

  • vivekdas_1987vivekdas_1987 MilanMember

    Dear @Geraldine_VdAuwera ,
    Thank you for the reply. I have used BWA for aligning the reads to the reference genome and I had around 98% reads that got aligned to the reference genome. So my reads that got aligned for the sample on the reference genome I mentioned is around 80 million. And the there was no duplicates in the aligned reads, the duplicates I got was on the unmapped reads so my quality of alignment on the reference genome is high. So my next quality control which I was considering is to understand out of this 80 million how many are in the exome region by providing the target region bed file that is provided by the company. So I was thinking if I cannot get the read information from CountReads , then from the DepthofCoverage walker we get an output as sample_summary of total, mean, median, quartiles, and threshold proportions, aggregated over all bases, does the first column refer to the number of bases that are covered in the exome region. If so then this should help me in extracting the total number of bases that lie in the exome region of the entire genome specific to the bed file that has the regions mentioned right? Then from this total number of bases I can calculate the reads that actually lie on the exome region by simply diving it with 100 as we know each read is 100 bases. But since this is paired-end bam file aligned , so should I consider 100 bases for 1 read or 200 bases for getting the actual number of reads out of this data? Again there will be another catch which I am thinking. Its like the bases which am getting in total in depth of coverage if they are the ones in the exome region they might be far apart as well on the exome. So directly dividing it by 200 or 100 (am not sure what should be the count here) will my read count be correct? There might be situations where reads overlap or are far apart but is this consideration valid if want to extract the number of reads from the bases reported in the output? if this total bases report are just on the exome. If you can clarify this doubts then I guess I can be able to retrieve the number of reads for my sample that lie on the exome regions. Please let me know if you understand me clearly or not?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Vivek,

    I think I understand you, yes, but I wouldn't recommend following that approach. It would be very imprecise, for the reasons you rightly speculate about.

    You should be able to get the result you need using CountReads, there's no reason why it wouldn't work. Did you look for the line I indicated in your CountReads output? If you post the command line you used and the console output I can help you check that it worked properly.

  • vivekdas_1987vivekdas_1987 MilanMember
    edited December 2013

    Dear @Geraldine_VdAuwera,

    Thank you for your prompt answers. I am using the GATK version 2.3.4. Here am not getting the output as you mentioned above. Am getting something like mentioned below. Sorry for the formatting , please accept my apologies.

    INFO 12:19:10,894 Walker - [REDUCE RESULT] Traversal result is: 119673919

    INFO 12:19:10,900 ProgressMeter - done 1.20e+08 7.6 m 3.8 s 100.0% 7.6 m 0.1 s
    INFO 12:19:10,900 ProgressMeter - Total runtime 457.16 secs, 7.62 min, 0.13 hours
    INFO 12:19:11,066 MicroScheduler - 0 reads were filtered out during traversal out of 460476 total (0.00%)
    INFO 12:19:12,347 GATKRunReport - Uploaded run statistics report to AWS S3

    So am just seeing that am having as 0 reads were filtered out of 460476 total but then thats not the output reads right? There is something I can see as :

    INFO 12:19:10,894 Walker - [REDUCE RESULT] Traversal result is: 119673919

    Does this number gives the actual count of reads? My total reads that got mapped on the reference genome was 147033026. If the count 119673919 is the total number of reads then am in a favorable position in QC. Please let me know if am correct or not and also you understood me. Also the command I used is below:

    java -Xmx14g -jar /data/PGP/gmelloni/GenomeAnalysisTK-2.3-4-g57ea19f/GenomeAnalysisTK.jar -R /scratch/GT/vdas/test_exome/exome/hg19.fa -T CountReads -I /scratch/GT/vdas/pietro/exome_seq/results/T_S7998/T_S7998.realigned.bam -L /scratch/GT/vdas/referenceBed/hg19/ss_v4/S03723314_Regions.bed

    Sorry again for the formatting. Please let me know if am clear or not?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Ah, great, this confirms what I thought: the version you are using had a bug in the traversal counter, but the CountReads result is correct (even if the formatting is a little different). You have 100M+ reads mapped to your exonic regions, which is great. Good job :)

    I would recommend you upgrade your version of GATK, but otherwise you're good to go.

  • vivekdas_1987vivekdas_1987 MilanMember

    Dear @Geraldine_VdAuwera,

    Thank you very much for the guidance. Finally am able to smile as I can figure out the read counts that map on the exonic region. I was actually unable to figure this walker command earlier. Yes I can see I am having 100M+ reads mapped. Its a good QC score. I am using the old version. Thats true. Once I finish this whole round of samples with the old version then I will mirror everything to the new one as am new to this so need to build up the confidence level as well. Thank you once again for all the support , indeed you are doing a great job.

  • I am running IndelRealigner on my RNA-seq data set in -knownsOnly mode. But it is extremely slow, I guess it is because some regions where those very high expressed gene located have very high read depth. I noticed that there are two arguments, which are --maxConsensuses and --maxReadsForConsensuses / -greedy. what thresholds I should use for these two, are there any other ways I can speed the local realignment up? Any suggestions would be highly appreciated

    Liping

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @houliping,

    The optimal thresholds are used by default as noted in the tool documentation. If you find that this is too slow, you can try reducing the values from the defaults, but you will need to experiment a little as we do not provide detailed guidelines for this.

    I would usually recommend downsampling in order to get rid of some of the excess coverage, but I don't know what impact that might have on RNAseq analysis, since that is not something we currently support.

  • Thanks, I will try downsampling. I did try to reduce the values for those two arguments to half of the default thresholds, but it is still take very long time (~ 3 weeks) to finish the local realignment. Thanks again,

    Liping

  • BlueBlue Member

    I was just wondering what you guys thought of my realignment intervals length distribution.
    This is 30Mb from a single diploid sample without prior indel position information. Approximately 60,000 events , i.e. one every fifty bases seems like a lot.
    How indicative of true indels is the data from TargetCreator and IndelRealigner? Guess I'll have to check with the ug-vcf calls...
    Across the genome, distribution of 'all' events is uniform.
    Does multi-sample realignment improve the accuracy or efficiency of the realignment process ?

  • Hello:

    I am using IndelRealigner v 2.2-16 with Mills dataset of known indels. The queer thing is that the results are identical with and without Mills dataset. Are there any scenarios where the dataset of known indels is ignored (wrong format, versioning problem etc)?

    Thanks,
    Nik

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @NikTuzov
    Hi Nik,

    The best thing to do is use the latest version of GATK. 2.2 is really old.

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    No, the knowns won't be ignored, the program would throw an error if there was a problem. What you're observing just means that the consensus in your data agrees with the calls in the Mills dataset. That's generally a good sign :)

  • Is there a way to say whether the results are truly identical or just appear identical?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @NikTuzov
    Hi Nik,

    You can grep for the OC tag. Have a look at this thread for more information: http://gatkforums.broadinstitute.org/discussion/1868/view-the-local-realignment-around-indels-by-gatk

    -Sheila

Sign In or Register to comment.