Best approach for realignertargetcreator and indelrealigner

steve1980steve1980 Posts: 1Member

Hi,

I am trying to decide between two approaches for performing realignment around indels. I have ~600 samples that have been aligned to a very fragmented draft genome assembly.
What is best:
1. take each sample and create a list of targets, followed by realignment on each sample.
2. combine all samples into one large bam file and create a list of targets, followed by realignment on the same large bam file.

Also, would there be any advantages in terms of speed with either approach?

Cheers,

Steve

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,056Administrator, GATK Dev admin

    Hi Steve,

    Actually you don't need to combine the samples in a single BAM file to process them together, you can just pass them all as inputs in a list file.

    Realigning them all together is best because then the realignment will be consistent over all of them. That said it can be a lengthy process with a lot of samples, so if you find performance is an issue you can do the target creation on the full list, then realign in batches -- as long as you use the same target intervals file that is completely fine. You can also look into multithreading to speed things up.

    Geraldine Van der Auwera, PhD

  • stechenstechen University of PennsylvaniaPosts: 23Member

    Greetings!

    For using multiple bam files as inputs to RrealignerTargetCreator, I'd used "-I file_1.bam -I file_2.bam -I file_3.bam" in the command. This was with GATK 2.2. Is the syntax the same for the most recent version of GATK as well?

    Thank you!
    Stephanie

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,056Administrator, GATK Dev admin

    Hi Stephanie,

    Yes, the syntax is still the same. Keep in mind you can also pass in a list of files in a text file (with .list extension) to make it easier. FYI our Best Practice recommendation now is to realign per lane or per sample only. We have found that multisample realignment, while it never hurts results, yields little benefit compared to its computational cost.

    Geraldine Van der Auwera, PhD

  • IrantzuIrantzu Posts: 14Member
    edited January 8

    Hi Geraldine,
    This is a general question but I can not find the "correct" way to follow. My question is related to this post so I thought of writing here instead of posting a new question. So, I have 15 samples (15 bam files). I have used human_g1k_v37.fasta available in bundle to align them. Now I want to do "RealignerTargetCreator" and "IndelRealigner" steps.

    1) What file should I put in "-known" argument? Mills_and_1000G_gold_standard.indels.b37.vcf or 000G_phase1.indels.b37.vcf? Maybe both? Or I should create my own file?

    2) I have to run RealignerTargetCreator for each bam...?

    3) IndelRealigner, I have to put the same -known file as in RealignerTargetCreator step?

    Any help is greately appreciated :smile:

    Post edited by Irantzu on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,056Administrator, GATK Dev admin

    @‌Irantzu

    1) You can use both files.

    2) Normally you run RTC on each bam, but some people who have many samples just run RTC on a representative subset and apply the resulting intervals to all the samples. The assumption is that the subset will capture most if not all regions that need to be realigned. This is a valid approach but it depends on the subset being representative. If you have rare indels in a sample that is not in the subset, they may be missed. But if you are using HaplotypeCaller, there is a good chance it will rescue them. So it is worth considering as an approach to optimize runtime when you have a large number of samples.

    3) Yes.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.