Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

sample_gender.report.txt file is empty

jason.harrisjason.harris Menlo Park, CAMember

I am getting a large number of errors like this from SVPreprocess:
SVQScript-1000.out:Exception in thread "main" org.broadinstitute.sv.commandline.ArgumentException: Gender map file not found: /hpc/research/users/jharris/SVModule/GeisingerGold/Genome_STRiP/URB875A1_metadata/sample_gender.report.txt

Which is not completely accurate: the sample_gender.report.txt file is present, but empty.

I gather from reading docs and tutorials that in previous versions of Genome STRiP, the user had to provide a "gender map file", but that now in v2.0.0, it does its own estimation using the genderMaskBedFile. It seems for whatever reason, this estimation process failed in my run.

I also found this sentence in the documentation (http://www.broadinstitute.org/software/genomestrip/org_broadinstitute_sv_qscript_SVPreprocess.html):
A report file is produced with sample gender and sex chromosome "dosage", but this information is currently not used in downstream processing by default. The user must explicitly specify a file containing the gender of each sample, which can be based on the read depth gender estimation or on the reported gender of each sample

First of all, it's simply not true that this report is not used downstream, as my dozens of error messages attest.

So, for a short term workaround, can I just fill in the required data in the empty sample_gender.report.txt file? Does this have the same format as I saw for the previous version of the program (i.e., a 2-column file with Sample name and either "M" or "F")?

Of course if I want to use this program going forward, I will need to try to understand why the gender estimation failed and how to mitigate it. Since all of the 1000+ log files have similar names ("SVQScript-NNNN.out"), what is the most efficient way for me to figure out which of those files represents the gender-estimation job, so I can learn more about why it failed?

If the internal gender estimation is unreliable, I could also work around the problem by providing my own gender report file up front. It looks like the previous version of the program had a genderMapFile option for this purpose, but the current documentation does not mention this option. Is there a way to override gender estimation if I can't easily fix whatever issue led to its failure?

Tagged:

Answers

  • bhandsakerbhandsaker Member, Broadie ✭✭✭✭

    Thanks for the detailed report.

    When you say sample_gender.report.txt is "empty", is it zero length or is there a header with column names but no data lines?

    You are right that the documentation is out of date - I will fix it. We used to require the user to supply a gender map (if you are going to call on sex chromosomes), and you still can using the same format as previously. Just supply your file with -genderMapFile.

    Because manual gender determination was an extra step (and it was sometimes not accurate), we have now changed the code so that if the user does not explicitly supply a gender map file, then we use the default gender calculations in metadata/sample_gender.report.txt. We also made this change after evaluating the gender calling accuracy in a number of cohorts. In addition, recent releases also detect unusual gender karyotypes (e.g. XXY males with Kleinfelter syndrome). Our handling of unusual gender karyotypes is not perfect yet (we just withhold them from discovery on the sex chromosomes), but it's improved over the earlier behavior.

    But I also agree that you should try to get to the bottom of your processing problems. You don't say this explicitly, but I'm assuming that preprocessing completed with no errors (i.e. the Queue script said everything ran without error). In this case, it sounds like the failure was something the software couldn't detect - for example if the java code wrote the file but NFS got an error which wasn't reported to the application (or we swallowed the exception, but we try to be careful never to do that). In general, it feels like there is some problem in your environment where writes to the shared file system from your compute cluster are not always reliable.

    To find the log for the job that did the gender calling, grep for CallSampleGender in the SVPreprocess-*.out files.
    Check to see if there was any error reported in this file. If not, then try deleting the file metadata/.sample_gender.report.txt.done (does this file exist?) and then rerun preprocessing.

Sign In or Register to comment.