The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Powered by Vanilla. Made with Bootstrap.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.
Register now for the upcoming GATK Best Practices workshop, Feb 20-22 in Leuven, Belgium. Open to all comers! More info and signup at http://bit.ly/2i4mGxz

Best approach for realignertargetcreator and indelrealigner

steve1980steve1980 Member Posts: 1

Hi,

I am trying to decide between two approaches for performing realignment around indels. I have ~600 samples that have been aligned to a very fragmented draft genome assembly.
What is best:
1. take each sample and create a list of targets, followed by realignment on each sample.
2. combine all samples into one large bam file and create a list of targets, followed by realignment on the same large bam file.

Also, would there be any advantages in terms of speed with either approach?

Cheers,

Steve

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 11,127 admin

    Hi Steve,

    Actually you don't need to combine the samples in a single BAM file to process them together, you can just pass them all as inputs in a list file.

    Realigning them all together is best because then the realignment will be consistent over all of them. That said it can be a lengthy process with a lot of samples, so if you find performance is an issue you can do the target creation on the full list, then realign in batches -- as long as you use the same target intervals file that is completely fine. You can also look into multithreading to speed things up.

    Geraldine Van der Auwera, PhD

  • stechenstechen University of PennsylvaniaMember Posts: 23

    Greetings!

    For using multiple bam files as inputs to RrealignerTargetCreator, I'd used "-I file_1.bam -I file_2.bam -I file_3.bam" in the command. This was with GATK 2.2. Is the syntax the same for the most recent version of GATK as well?

    Thank you!
    Stephanie

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 11,127 admin

    Hi Stephanie,

    Yes, the syntax is still the same. Keep in mind you can also pass in a list of files in a text file (with .list extension) to make it easier. FYI our Best Practice recommendation now is to realign per lane or per sample only. We have found that multisample realignment, while it never hurts results, yields little benefit compared to its computational cost.

    Geraldine Van der Auwera, PhD

  • IrantzuIrantzu Member Posts: 16
    edited January 2015

    Hi Geraldine,
    This is a general question but I can not find the "correct" way to follow. My question is related to this post so I thought of writing here instead of posting a new question. So, I have 15 samples (15 bam files). I have used human_g1k_v37.fasta available in bundle to align them. Now I want to do "RealignerTargetCreator" and "IndelRealigner" steps.

    1) What file should I put in "-known" argument? Mills_and_1000G_gold_standard.indels.b37.vcf or 000G_phase1.indels.b37.vcf? Maybe both? Or I should create my own file?

    2) I have to run RealignerTargetCreator for each bam...?

    3) IndelRealigner, I have to put the same -known file as in RealignerTargetCreator step?

    Any help is greately appreciated :smile:

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 11,127 admin

    @‌Irantzu

    1) You can use both files.

    2) Normally you run RTC on each bam, but some people who have many samples just run RTC on a representative subset and apply the resulting intervals to all the samples. The assumption is that the subset will capture most if not all regions that need to be realigned. This is a valid approach but it depends on the subset being representative. If you have rare indels in a sample that is not in the subset, they may be missed. But if you are using HaplotypeCaller, there is a good chance it will rescue them. So it is worth considering as an approach to optimize runtime when you have a large number of samples.

    3) Yes.

    Geraldine Van der Auwera, PhD

  • armarkesarmarkes LisbonMember Posts: 16

    Regarding this last issue, you mentioned that we can use a subset of samples in the RTC to produce the intervals.

    If the fastq files from all samples were all obtained using the same protocol and using only one flowcell lane, what will be the best number of samples that we can use for the RTC?
    Imagine I have 95 samples (each bam file is from a different sample), should I use 10 of this bam files to produce the RTC? Or it is necessary to use 50 (half of the cohort)? How can I know what is best? (considering randomly selection of this files).

    Can you help me with this issue?

    Thanks
    Ana Marques

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 11,127 admin

    We have never benchmarked this so we can't give you a universal answer. It will depend on how similar your samples are. Frankly in production we just run it independently on each sample because it's simpler to have the pipeline set up that way.

    Geraldine Van der Auwera, PhD

  • WVNicholsonWVNicholson Warwick University, CoventryMember Posts: 18

    @Geraldine_VdAuwera said:
    Hi Steve,

    Actually you don't need to combine the samples in a single BAM file to process them together, you can just pass them all as inputs in a list file.

    Realigning them all together is best because then the realignment will be consistent over all of them. That said it can be a lengthy process with a lot of samples, so if you find performance is an issue you can do the target creation on the full list, then realign in batches -- as long as you use the same target intervals file that is completely fine. You can also look into multithreading to speed things up.

    I have a question about a related problem. Recently I did realignment around indels with some data, by running the RealignerTargetCreator on all of the BAM files and then did the actually realignment for the individual BAM files with IndelRealigner. However, I now have some new BAM files which I want to analyse together with the older BAM files. I will have to create a new list of targets for the combined set of BAM files; but as well as doing the realignment around indels with the new BAM files this implies that I will have to re-run the IndelRealigner for the old BAM files as well for consistency? (The two sets of samples may be somewhat different. The old set of samples were from datasets obtained in our lab with DNA capture array technology from ancient and historical specimens. The new set of samples were from archived whole genome data sets - aligned versus the same custom reference used with the earlier dataset - which I borrowed in order to have some samples from modern specimens),

    William

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 11,127 admin

    Hi @WVNicholson, if you're going to generate new intervals, re-running the old files would be great for consistency. However, If you're using the HaplotypeCaller for variant calling, indel realignment is no longer necessary. You could therefore choose to just run the new files with the old intervals, to bring them in line with what was done on your older samples, and let HaplotypeCaller do the rest.

    Geraldine Van der Auwera, PhD

  • WVNicholsonWVNicholson Warwick University, CoventryMember Posts: 18

    @Geraldine_VdAuwera said:
    Hi @WVNicholson, if you're going to generate new intervals, re-running the old files would be great for consistency. However, If you're using the HaplotypeCaller for variant calling, indel realignment is no longer necessary. You could therefore choose to just run the new files with the old intervals, to bring them in line with what was done on your older samples, and let HaplotypeCaller do the rest.

    Unfortunately, I'm using a "legacy pipeline" that uses VarScan for variant calling and I wanted to improve the quality of the variant calls by adding the realignment around indels using GATK. Following my previous question, I went ahead with re-running the old files with the then new files from modern data. It looks like I may want to do the same thing again with a much larger number of new samples though; so it could be painful to re-run the earlier stuff yet again; so I'm wondering about the ramifications of just generating target intervals for the new set of samples only, merging the old and new set of intervals and then just doing realignment around indels for the new sample only? (I think ideally I would want to use the HaplotypeCaller; but a later program of ours currently wants output in the non-standard VarScan output format rather than VCF although VarScan did have supported added to output results in VCF format. Time is a bit too limited at the moment to think about adding full support for VCF as input format for our program),

    William

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 11,127 admin
    Hi William, if you can help it at all, I would really recommend modernizing your pipeline *before* running a larger number of samples. But if that's not possible, then either stick to the "old list" of intervals and run that on the new samples, or re-run all the samples. That's the best way to avoid batch effects in variant representation.

    Geraldine Van der Auwera, PhD

  • shleeshlee CambridgeMember, Administrator, Broadie, Moderator, Dev Posts: 424 admin

    Hi William (@WVNicholson),

    If you want to assess the ramifications of your two approaches, then perhaps you can compare variant call concordance using two different variant evaluators. One concordance calculation would require identical genomic coordinates for concordance, e.g. GATK's VariantEval. The second concordance calculation would allow for different representations of a variant by pinning them back to the reference, e.g. RTG-Tools' vcfeval (not a GATK tool). You could then contrast the two calculations and thereby quantitate any differences.

Sign In or Register to comment.