Problems with Mate-Pairs in 1000G CRAM files.

MehulSMehulS Member
edited January 30 in Ask the GATK team

I downloaded the WGS Phase 3 GRCh38 CRAM files from 1000 genomes. I created the reference directory as per their [instructions](http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/README_using_1000genomes_cram.
md "instructions") Upon running ValidateSamFile, on the CRAMs as well as the subsequent converted (to) BAMs; I got many errors regarding Mate pairs.

Running FixMateInformation with Add_MC_tag=true fixed the errors, apparently; but I haven't understood what the source of this error is. I asked 1000G but they said they couldn't replicate my error.

GATK version - 4.0.11
Java version - 1.8

Answers

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    @MehulS

    The Add_MC_tag=true just adds a tag to the Cigar string for the mate read in the bam/cram/sam file.

    That information can be found here

    There is a discussion on why this is important here

    As for why the downloaded data set does not contain this tag, I cannot figure out what makes that data set different.

    Was the data set you downloaded from the Broad resource bundle? Most of those references have been cleaned to avoid these errors when using GATK.

    I tried your links, but they did not work for me.

  • MehulSMehulS Member

    Thank you for your reply Adelaide. My data is from the 1000 genomes FTP server. Reposting them

    http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/data/

    README doc for CRAM file usage instructions: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/README_using_1000genomes_cram.md

  • MehulSMehulS Member
    edited January 31

    I'm primarily trying to figure out the general reasons why a file throws up these Mate pair errors. Any chance that ValidateSam is misinterpreting an error where it doesn't exist ?

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    @MehulS

    So, when you run ValidateSam on the fixed file, does it show this error?

    I think this is something unique to the Mate pairs file. I think Broad recommends using their version of these files because whatever read group, cigar strings are causing these errors have been cleaned before they are put in the Resource Bundle.

    Did you get this error when using the Broad version of the 1000 genomes file?

    I can't really speak to how the UK group has generated the bams. Have you tried comparing a few lines from the pre-validated file and after the FixMatePair to see exactly where in the bam the change is made?

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    HI @MehulS I haven't heard from you in a few days, so I imagine you found the answer. Please chime in if you get a chance on what worked for you.

    In the meantime, I am closing this ticket. Just post here to reopen it.

Sign In or Register to comment.