We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

GATK4: CollectAlleleCount output & Model ModelSegments

Hi everyone,
I am trying to run the CNV discovery pipeline and I have noticed that the header of sample.allelicCounts.tsv (produced by CollectAlleleCount) gives problems when used to rum ModelSegments.

Indeed, ModelSegments gives me this error:
"A USER ERROR has occurred: Bad input: Bad header in file. Not all mandatory columns are present. Missing: POSITION, REF_COUNT, REF_NUCLEOTIDE, ALT_NUCLEOTIDE, ALT_COUNT"

And I think it's because of the CollectAlleleCount tsv header format:

Is there any specific option to modify the column order? Can I directly parse the file?




  • sleeslee Member, Broadie, Dev ✭✭✭
    edited January 2018

    Hi @alegasp89,

    ModelSegments expects the output of CollectAllelicCounts, so you should not be running into this issue unless there is some other unexpected formatting problem with your file (perhaps due to a nonstandard sample name). The order of the columns in the error message is arbitrary and does not need to match the order of the columns in the file.

    Could you attach a snippet of the rest of the header in your sample.allelicCounts.tsv file (being careful to preserve tabs, etc.)?

  • sleeslee Member, Broadie, Dev ✭✭✭
    edited January 2018

    Just following up, @alegasp89, did you figure out if your file contained a nonstandard sample name? If not, it would be great if we could get a bug fix in if necessary. Perhaps we could throw a more informative message even if the sample name was to blame.

  • dcampodcampo Los AngelesMember

    Hi, I am having the same issue, and don't know how to fix it.
    Attached there's a snippet of the sample file.


  • sleeslee Member, Broadie, Dev ✭✭✭
    edited June 2019

    @dcampo I didn't run into any issue with parsing your snippet when running GATK ModelSegments. Can you post your command line and version number?

    EDIT: It occurs to me that one possible error is that you are inadvertently passing some other file to the --allelic-counts argument (perhaps a denoised copy ratios file---which contains a CONTIG column, but none of the other columns listed in the exception).

    Post edited by slee on
  • dcampodcampo Los AngelesMember

    Hi, thanks for your reply. Here's my command line:

    java -jar $GATK ModelSegments \
    --denoised-copy-ratios ${name}_dir/${name}_CTC.denoisedCR.tsv \
    --allelic-counts ${name}_dir/${name}_CTC_clean.allelicCounts.tsv \
    --normal-allelic-counts ${name}_dir/${name}_WBC.denoisedCR.tsv \
    --output ${name}_dir \
    --output-prefix ${name}_CTC_clean

    I am using gatk-, and the snippet I sent is from the same ${name}_CTC_clean.allelicCounts.tsv file that I am passing to --allelic-counts.
    I've also checked the paths in the scripts, and everything seems to be in order. I also made sure that the field separator is a tab, as expected.
    And I tried to run one of the samples alone on a virtual node (as opposed to as a batch submission to the cluster), specifying full paths and all, but I am getting the same error.
    Now, the denoisedCR.tsv files have different headers, right? (after the sam-style header, they have the fields CONTIG START END LOG2_COPY_RATIO), so the error only refers to the allelicCounts.tsv file?
    These allelicCounts files come from CollectAllelicCounts, and I am following the tutorials 11682 and 11683.
    I am completely lost here, so any help will be much appreciated.

  • sleeslee Member, Broadie, Dev ✭✭✭

    @dcampo looks like you are passing a denoisedCR.tsv file to --normal-allelic-counts. You should be passing the result of running CollectAllelicCounts on the matched normal to this argument. Hope that resolves the issue!

  • dcampodcampo Los AngelesMember

    Oh, yes, that's the issue...it's running now.
    I knew it had to be something silly, but could not see it. Rookie mistake :)

Sign In or Register to comment.