Holiday Notice:
The Frontline Support team will be offline February 18 for President's Day but will be back February 19th. Thank you for your patience as we get to all of your questions!

Poor SNP calling results

ajlivajliv LiverpoolMember

Hello, I'm new to the world of bioinformatics so apologies if this is not as detailed as some of the other questions!

I have performed the joint genotype haplotype caller on PCR free illumina sequencing of bacterial cells against a reference I produced by Pac-Bio. I've performed BWA mem, removing duplicates, indel realignment as described in your best practices. I cant do Base Recalibration due to lack of data. The BAM files produced at these stages look normal. I also created log files at each stage and these say that it ran successfully. However when I run Haplotype caller very few SNPS and Indels are identified in the g.vcf (4/5) and the bamout file visually looks empty (although it is ~170K in size).

I'm running GATK version 3.3. The code I'm using is:
Java -jar gatk.jar -R reference.fasta -T HaplotypeCaller -I preprocessed_reads.bam --pcr_indel_model --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -o sample1.g.vcf -bamout bamout.bam

Any help anyone could give me would be amazing as I'm really stuck!

Best Answer

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @ajliv
    Hi

    Can you post some IGV screenshots of the BAM files where you think variants should be called? Have you checked the base qualities and mapping qualities at those sites? Can you try lowering the --standard_min_confidence_threshold_for_calling as well?

    Thanks,
    Sheila

  • ajlivajliv LiverpoolMember

    Hi Sheila,

    Thank you for your response. In the screenshot you can see the pre-processed BAM used in the HaplotypeCaller in the bottom panel, the "bamout" BAM file generated by the Haplotypecaller in the middle, and the VCF generated by the HaplotypeCaller in the top. The pre-processed BAM shows the mapping quality, which looks good (I'm not sure about the base quality). What troubles me is the "bamout" BAM file is completely empty, so I think something is going wrong.

    Thank you,

    Anjeet

  • EADGEADG KielMember ✭✭✭

    Hi @ajliv,

    the HTC apply a bunch of filters to your reads...normally you see an info in the log, how many reads are filtered out by the specific filters.

    And maybe you can use a newer Version of GATK (A little bird told me that we get GATK3.7 soon :).

    In GATK 3.6 you can use the command --emitDroppedReads to see also reads which where dropped by the HTC for different reasons.

    Greetings
    EADG

  • ajlivajliv LiverpoolMember

    Hello!

    Firstly I would like to say thank you for your responses! I did increase the version of GATK to 3.6 but to no avail. I've managed to get a populated BAM file and numerous variants called by adjusting the ploidy.

    I just want to check adjusting the ploidy was correct as I'm not sure. So I am using bacterial cells with a ploidy of 1, however I'm looking at a population bacteria that I evolved to a specific bacteriacide, as they are not clonal these cells may not be homogenous and could contain different alleles in different individuals. Now I used ~4mls of 109 CFU/ml cells to create my PCR free illumina libraries, which is 1036 cells in total which if I used the (ploidy per individual) x (individuals in the pool) calculation it would come up with a very large number indeed.

    I found this explanation from Geraldine of how to reflect allele frequency as a ploidy from this thread (http://gatkforums.broadinstitute.org/gatk/discussion/8407/analysis-of-genome-evolution-over-time-with-gatk)

    "Note however that this assumes that you are working with genetically homogeneous samples. If you mean to say that you're actually sequencing evolved colonies and that you're interested in capturing the allelic divergence within a colony, then you should treat each colony as a pool of samples, with a theoretical ploidy set to reflect the smallest allele frequency you hope to capture. For example, if you want to be able to capture alleles present in 10% of the cells in the colony, then you would set your ploidy to 10. If you want to go deeper, e.g. to 5%, you would set it to 20. Note that GATK cannot currently go past a ploidy of 21 without some severe performance decay; this is not a hard cap but an effective limitation due to some code inefficiencies."

    What I don't understand here is if you wanted to go deeper, ie increase the allele frequency captured then you would increase the ploidy, however here it says to go deeper you decrease the captured alleles present to 5% and you would increase the ploidy to 20? Could someone explain this to me?

    Many thanks,

    Anjeet

  • ajlivajliv LiverpoolMember

    Sorry that should read 10-9 colony forming units per ml and 10-36 colony forming units total. Thanks, A

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    I think you would benefit from the allele culling functionality that was just released in the new version 3.7. Have a look at the version highlights (posted on the blog) and let us know if that makes sense.
  • ajlivajliv LiverpoolMember

    Hi Geraldine,

    Thank you so much for getting back to me, i've had a look at the version highlights, and I understand the theory, its just really confusing how I apply all this to my pooled dataset which has a infinitely larger estimated ploidy, number of alleles and genotypes.

    So ploidy= (Ploidy per individual) * (Individuals in pool)
    1 * (4 X 10^9 ) colony forming units, which isn't feesable!

    If I used a ploidy of 20 (which is the limit without encountering programme function decay), and reduced the population size by the same factor, then the allele frequency in the population i'm investigating would be 0.00000005%, which is really low. (If allele frequency = number of copies of allele / population size: 20 / 40,000,000 = 0.00000005%).

    I read in this thread that there is a way to set a theoretical ploidy to reflect the smallest allele frequency, for a population, could you explain this a bit more to me please?

    "Note however that this assumes that you are working with genetically homogeneous samples. If you mean to say that you're actually sequencing evolved colonies and that you're interested in capturing the allelic divergence within a colony, then you should treat each colony as a pool of samples, with a theoretical ploidy set to reflect the smallest allele frequency you hope to capture. For example, if you want to be able to capture alleles present in 10% of the cells in the colony, then you would set your ploidy to 10. If you want to go deeper, e.g. to 5%, you would set it to 20. Note that GATK cannot currently go past a ploidy of 21 without some severe performance decay; this is not a hard cap but an effective limitation due to some code inefficiencies."

    Many thanks,

    A

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    The new feature should solve your "too many PLs" warning problem. Try the new version on a subset of data with the same ploidy setting (20) as you did previously, and tell us what happens.
  • ajlivajliv LiverpoolMember

    Hi Geraldine,

    I ran the same sample, with GATK 3.7, ploidy 20 and the original AFCalculator: PL warning is there, along with HaplotypeCallerGenotypingEngine and a few others which I think I can ignore:


    Done. There were 7 WARN messages, the first 7 are repeated below.
    WARN 14:17:32,509 InbreedingCoeff - Annotation will not be calculated. InbreedingCoeff requires at least 10 unrelated samples.
    WARN 14:17:38,258 AFCalculator - Maximum allowed number of PLs (100) exceeded for sample sample1 at gnl|CGR|ADLGDCPD_1:9571-9571 with 231 possible genotypes. No PLs will be output for these genotypes (which may cause incorrect results in subsequent analyses) unless the --max_num_PL_values argument is increased accordingly. Unless the DEBUG logging level is used, this warning message is output just once per run and further warnings are suppressed.
    WARN 14:17:38,384 HaplotypeScore - Annotation will not be calculated, must be called from UnifiedGenotyper, not org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller
    WARN 14:37:40,184 HaplotypeCallerGenotypingEngine - Removed alt alleles where ploidy is 20 and original allele count is 4, whereas after trimming the allele count becomes 3. Alleles kept are:[T*, A, G]
    WARN 14:41:03,261 HaplotypeCallerGenotypingEngine - Removed alt alleles where ploidy is 20 and original allele count is 4, whereas after trimming the allele count becomes 3. Alleles kept are:[T*, A, C]
    WARN 15:14:58,645 HaplotypeCallerGenotypingEngine - Removed alt alleles where ploidy is 20 and original allele count is 4, whereas after trimming the allele count becomes 3. Alleles kept are:[CA*, AA, CAA]

    WARN 15:45:43,145 AFCalculator - Maximum allowed number of PLs (100) was exceeded 17943 time(s); the largest number of PLs found was 1771. No PLs will be output for these genotypes (which may cause incorrect results in subsequent analyses) unless the --max_num_PL_values argument is increased accordingly

    Do I need to increase the allele count and the Max allowed number of PL's?

    Thanks,

    A

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    OK, so the new feature helped bring down the ceiling for PLs from 10626 to 1771, that's a good sign. At this point you can decide to allow more PLs, I think 1771 should be a tractable number. I wouldn't increase the allele count though.

  • ajlivajliv LiverpoolMember

    Hi Geraldine,

    Thanks for your help! I did some reading on pooled sampling (I did PCR free sequencing), and it suggest that I should not run the de-duplication step on my data. Is that correct?

    Thanks,

    A

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    It depends how the pooled sampling is done. In cases where you prep multiple libraries from separate biological specimens, then pool them together before barcoding, then you could argue that there is a risk of marking things that are not duplicates, across libraries. But if you're prepping a library from a single bacterial population, then this does not apply, and you should NOT skip marking duplicates.

    Next time if you have a new question that's not related to the current thread, please post it as a new question, ok?

  • ajlivajliv LiverpoolMember

    Yes - Sorry! Thank you!

Sign In or Register to comment.