Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

GermlineCNVCaller, some samples do not have read depth metadata

manolismanolis Member ✭✭
edited March 1 in Ask the GATK team

GATK4.1.0.0, linux server, bash

Hi, I'm testing the following commands:

  ${GATK4} --java-options "${javaOpt}" DetermineGermlineContigPloidy \
  -L ${fol7}/"${INTERVAL}.preprocessed.interval_list" \
  --interval-merging-rule "OVERLAPPING_ONLY" \
  ${INPUT.hdf5} \
  --contig-ploidy-priors ${fol8}/ContigPloidyPriors.tsv \
  -O ${fol8}/"Karyo"/ \
  --output-prefix "Karyo_cohort" \
  --verbosity "DEBUG" \
  --tmp-dir ${tmp}/

 ${GATK4} --java-options "${javaOpt}" GermlineCNVCaller \
 --run-mode COHORT \
 -L ${fol7}/"${INTERVAL}.preprocessed.interval_list" \
 --interval-merging-rule "OVERLAPPING_ONLY" \
  --contig-ploidy-calls ${fol8}/"Karyo"/ \
 ${INPUT.hdf5} \
 --output ${fol8}/"germCNV"/ \
 --output-prefix "germCNV_cohort" \
 --verbosity "DEBUG" \
 --tmp-dir ${tmp}/

In both tools (DetermineGermlineContigPloidy, GermlineCNVCaller) I used the same interval and the same input files... I can not figure out why in the second command "GermlineCNVCaller" I have the following error: "Some samples do not have read depth metadata"

Part of the logs:

  Traceback (most recent call last):
    File "/home/manolis/GATK4/tmp/cohort_denoising_calling.6605166682471840218.py", line 133, in <module>
      n_st, sample_names, sample_metadata_collection)
    File "/share/apps/bio/miniconda2/envs/gatk4100/lib/python3.6/site-packages/gcnvkernel/models/model_denoising_calling.py", line 379, in __init__
      sample_metadata_collection, sample_names, self.contig_list)
    File "/share/apps/bio/miniconda2/envs/gatk4100/lib/python3.6/site-packages/gcnvkernel/models/model_denoising_calling.py", line 618, in _get_baseline_
  copy_number_and_read_depth
      "Some samples do not have read depth metadata"
  AssertionError: Some samples do not have read depth metadata
  17:51:21.529 DEBUG ScriptExecutor - Result: 1
  17:51:21.531 INFO  GermlineCNVCaller - Shutting down engine
  [March 1, 2019 5:51:21 PM CET] org.broadinstitute.hellbender.tools.copynumber.GermlineCNVCaller done. Elapsed time: 0.77 minutes.
  Runtime.totalMemory()=3591372800
  org.broadinstitute.hellbender.utils.python.PythonScriptExecutorException: 
  python exited with 1
  Command Line: python ...
  ....
          at org.broadinstitute.hellbender.utils.python.PythonExecutorBase.getScriptException(PythonExecutorBase.java:75)
          at org.broadinstitute.hellbender.utils.runtime.ScriptExecutor.executeCuratedArgs(ScriptExecutor.java:126)
          at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeArgs(PythonScriptExecutor.java:170)

...

Many thanks for any help!

Best Answer

Answers

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @manolis,

    I think this code may be relevant. Read depth metadata include sample name (RGID), a global read depth measurement and an average ploidy measurement. Is it possible some of your samples may have missing or duplicate sample names and/or have no read-depth for a given shard's interval regions?

  • manolismanolis Member ✭✭
    edited March 2

    Hi @shlee,

    here some data (WES, male sample, excluded chrY):

    BAM file post BQSR step:
    @RG ID:A00125.27 SM:NA LB:LibXXX PL:Illumina

    I opened the NA_count.hdf5 (created with CollectReadCounts) with the tool HDF Compass:

    sample_metadata/sample_name: 0 | value NA

    Then, after DetermineGermlineContigPloidy

    contig_ploidy.tsv
    @RG ID:GATKCopyNumber SM:nan
    CONTIG PLOIDY PLOIDY_GQ
    chr1 2 121.58691665493308
    chr2 2 66.929042624026991
    chr3 2 125.72254121893937
    chr4 2 52.236504339151097
    chr5 2 129.29489779907942
    chr6 2 127.38939419262584
    chr7 2 126.1217246171709
    chr8 2 128.89776283927995
    chr9 2 127.18715358361072
    chr10 2 121.5834944974841
    chr11 2 119.35155491390761
    chr12 2 128.62071609172676
    chr13 2 64.176144741117383
    chr14 2 126.61902246145345
    chr15 2 126.34695492890827
    chr16 2 123.03293102822965
    chr17 2 113.35438809363458
    chr18 2 65.708345616618104
    chr19 2 20.120048966382647
    chr20 2 71.58939009824131
    chr21 2 115.8369431245886
    chr22 2 48.030234930280152
    chrX 1 49.31690049141848

    global_read_depth.tsv
    @RG ID:GATKCopyNumber SM:nan
    GLOBAL_READ_DEPTH AVERAGE_PLOIDY
    68.60203951133911 1.9654484693898575

    mu_psi_s_log__.tsv
    @shape=(1,)
    @dtype=float64
    @RG ID:GATKCopyNumber SM:nan
    -11.057087681461256

    sample_name.txt
    nan

    std_psi_s_log__.tsv
    @shape=(1,)
    @dtype=float64
    @RG ID:GATKCopyNumber SM:nan
    0.91610996962618341

    The sample name changed from "NA" to "nan" ... please, could you confirm me that if you run a sample with the name "NA"

    For the other samples I do not have problems with the "sample name", is always correct.

    The @RGID information are there (ID: SM:), I do not need to add something more, correct?

    There are not duplicate sample names in my cohort.

    If I had samples with no read-depth for a given shard interval regions I could have an error or warning at the "DetermineGermlineContigPloidy" step, no? At that step is there any check point for this?

    Moreover about "no read-depth for a given shard interval regions": if I have a region with no reads could be a possible CNV=0, homozygous deletion or of course a not covered region due to other reasons... That means that the "GermlineCNVCaller" must be tollerant to this issue... ? Is it correct?

    Post edited by manolis on
  • manolismanolis Member ✭✭

    An additional question: what is the cut-off value of the GQ to have a good PLOIDY_GQ?

  • manolismanolis Member ✭✭
    edited March 2

    I tested an another cohort and I have the same error :( I do not think that the name (NA changed to nan)
    is the main problem...

    Looking in the code (I'm not fully understand python) I saw:

        mandatory_tsv_columns = {
            io_consts.global_read_depth_column_name,
            io_consts.average_ploidy_column_name
        }
    
        def __init__(self,
                     sample_name: str,
                     global_read_depth: float,
                     average_ploidy: float):
            assert global_read_depth > 0
            assert average_ploidy > 0
            self.sample_name = sample_name
            self.global_read_depth = global_read_depth
            self.average_ploidy = average_ploidy
    

    As you told me the code needs: sample name (RGID), a global read depth measurement and an average ploidy measurement. Those informations are in the file: global_read_depth.tsv. Is it correct?

    Example:
    @RG ID:GATKCopyNumber SM:12-928
    GLOBAL_READ_DEPTH AVERAGE_PLOIDY
    80.49866491402294 2.0

    How is possible to check about "no read-depth for a given shard's interval regions"? That I can tell you is that I started from the provider bed file, I processed it (.inverval_list format) with PreprocessIntervals tool (--padding 250, --bin-length 0) and then just I used it in the germCNV pipeline...

    With CollectReadCounts I created a .tsv format file and I checked it: for example I found

    CONTIG  START   END COUNT
    chr1    11831   12423   0
    chr1    12424   12983   2
    chr1    12984   13908   60
    chr1    14371   15265   4
    chr1    15546   16164   0
    chr1    16494   17173   9
    chr1    17174   18169   133
    

    That you mean "no read-depth for a given shard's interval regions", COUNT = 0 ? Or that the number of the intervals in the .tsv file must be the same with the number of the intervals in the .interval_list? Just to understand what kind of check I have to do...

    Thanks in advance for the help!

  • manolismanolis Member ✭✭
    edited March 5

    Maybe today is my lucky day...

    After "DetermineGermlineContigPloidy" I added the following steps, before "GermlineCNVCaller"

    tar czf ${cohort_entity_id}-contig-ploidy-model.tar.gz -C ${output_dir_}/${cohort_entity_id}-model .
    tar czf ${cohort_entity_id}-contig-ploidy-calls.tar.gz -C ${output_dir_}/${cohort_entity_id}-calls .

    mkdir contig-ploidy-calls-dir
    tar xzf ${contig_ploidy_calls_tar} -C contig-ploidy-calls-dir

    Started running 30min ago (I'm testing the second cohort reported above, 37 female samples), still running... no errors for now...

    I think that the point was that in the first command reported in this thread I used the "contig-ploidy directory" and not the "contig-ploidy calls directory"...

    Post edited by manolis on
  • manolismanolis Member ✭✭
    edited March 5

    Yes, now is running... was that the main problem. I have to check the change of the sample name from "NA" to "nan", I will update you. Many thanks

    I think that the variation of my PLOIDY_GQs depends, maybe, from the provider design because I have the same distribution in over 60 samples with the same design and instrument... I have to check this.

    Post edited by manolis on
  • lakhujanivijaylakhujanivijay IndiaMember

    Hi

    I am running the following command:

    time gatk GermlineCNVCaller --run-mode COHORT -L scatter-sm/twelve_1of2.interval_list -I cvg/HG00096.tsv -I cvg/HG00268.tsv -I cvg/HG00419.tsv -I cvg/HG00759.tsv -I cvg/HG01051.tsv -I cvg/HG01112.tsv -I cvg/HG01500.tsv -I cvg/HG01565.tsv -I cvg/HG01583.tsv -I cvg/HG01595.tsv -I cvg/HG01879.tsv -I cvg/HG02568.tsv -I cvg/HG02922.tsv -I cvg/HG03006.tsv -I cvg/HG03052.tsv -I cvg/HG03642.tsv -I cvg/HG03742.tsv -I cvg/NA18525.tsv -I cvg/NA18939.tsv -I cvg/NA19017.tsv -I cvg/NA19625.tsv -I cvg/NA19648.tsv -I cvg/NA20502.tsv -I cvg/NA20845.tsv --contig-ploidy-calls ploidy-case-calls --annotated-intervals twelveregions.annotated.tsv --interval-merging-rule OVERLAPPING_ONLY --output cohort24-twelve/ --output-prefix cohort24-twelve_1of2 --verbosity DEBUG
    
    

    **And, I got the following error: ** (Last few lines of the execution)

    /home/bioinfo/miniconda2/envs/gatk/lib/python3.6/site-packages/h5py/__init__.py:68: UserWarning: h5py is running against HDF5 1.8.16 when it was built against 1.8.18, this may cause problems
      '{0}.{1}.{2}'.format(*version.hdf5_built_version_tuple)
    Traceback (most recent call last):
      File "/tmp/cohort_denoising_calling.1538985324736744138.py", line 133, in <module>
        n_st, sample_names, sample_metadata_collection)
      File "/home/bioinfo/miniconda2/envs/gatk/lib/python3.6/site-packages/gcnvkernel/models/model_denoising_calling.py", line 379, in __init__
        sample_metadata_collection, sample_names, self.contig_list)
      File "/home/bioinfo/miniconda2/envs/gatk/lib/python3.6/site-packages/gcnvkernel/models/model_denoising_calling.py", line 618, in _get_baseline_copy_number_and_read_depth
        "Some samples do not have read depth metadata"
    AssertionError: Some samples do not have read depth metadata
    17:06:08.242 DEBUG ScriptExecutor - Result: 1
    17:06:08.243 INFO  GermlineCNVCaller - Shutting down engine
    [17 July, 2019 5:06:08 PM IST] org.broadinstitute.hellbender.tools.copynumber.GermlineCNVCaller done. Elapsed time: 0.25 minutes.
    Runtime.totalMemory()=1497890816
    org.broadinstitute.hellbender.utils.python.PythonScriptExecutorException: 
    python exited with 1
    Command Line: python /tmp/cohort_denoising_calling.1538985324736744138.py --ploidy_calls_path=/media/bioinfo/primary2/20_basespace_data/with_demo_data/7_germline_cnv_caller_cohort_mode/ploidy-case-calls --output_calls_path=/media/bioinfo/primary2/20_basespace_data/with_demo_data/7_germline_cnv_caller_cohort_mode/cohort24-twelve/cohort24-twelve_1of2-calls --output_tracking_path=/media/bioinfo/primary2/20_basespace_data/with_demo_data/7_germline_cnv_caller_cohort_mode/cohort24-twelve/cohort24-twelve_1of2-tracking --modeling_interval_list=/tmp/intervals4030242154297661888.tsv --output_model_path=/media/bioinfo/primary2/20_basespace_data/with_demo_data/7_germline_cnv_caller_cohort_mode/cohort24-twelve/cohort24-twelve_1of2-model --enable_explicit_gc_bias_modeling=True --read_count_tsv_files /tmp/sample-02859526290181937611.tsv /tmp/sample-13203986626442421821.tsv /tmp/sample-21101821462697366461.tsv /tmp/sample-35327672039943924342.tsv /tmp/sample-45587001116585423025.tsv /tmp/sample-5177014631055481666.tsv /tmp/sample-61186963140861297727.tsv /tmp/sample-76931927280769201652.tsv /tmp/sample-83936152174870464380.tsv /tmp/sample-91677119363227105179.tsv /tmp/sample-10248884061334988341.tsv /tmp/sample-117704187159218473218.tsv /tmp/sample-121989362654148846573.tsv /tmp/sample-137028185212672198490.tsv /tmp/sample-144755818761423918677.tsv /tmp/sample-153692880347548013038.tsv /tmp/sample-165785095469194328602.tsv /tmp/sample-17108787667432637013.tsv /tmp/sample-182442868298900155211.tsv /tmp/sample-194765242652018908678.tsv /tmp/sample-203006103746924190830.tsv /tmp/sample-211198757025399580347.tsv /tmp/sample-222522297656770914995.tsv /tmp/sample-236265279669300744828.tsv --psi_s_scale=1.000000e-04 --mapping_error_rate=1.000000e-02 --depth_correction_tau=1.000000e+04 --q_c_expectation_mode=hybrid --max_bias_factors=5 --psi_t_scale=1.000000e-03 --log_mean_bias_std=1.000000e-01 --init_ard_rel_unexplained_variance=1.000000e-01 --num_gc_bins=20 --gc_curve_sd=1.000000e+00 --active_class_padding_hybrid_mode=50000 --enable_bias_factors=True --disable_bias_factors_in_active_class=False --p_alt=1.000000e-06 --cnv_coherence_length=1.000000e+04 --max_copy_number=5 --p_active=0.010000 --class_coherence_length=10000.000000 --learning_rate=1.000000e-02 --adamax_beta1=9.000000e-01 --adamax_beta2=9.900000e-01 --log_emission_samples_per_round=50 --log_emission_sampling_rounds=10 --log_emission_sampling_median_rel_error=5.000000e-03 --max_advi_iter_first_epoch=5000 --max_advi_iter_subsequent_epochs=200 --min_training_epochs=10 --max_training_epochs=50 --initial_temperature=1.500000e+00 --num_thermal_advi_iters=2500 --convergence_snr_averaging_window=500 --convergence_snr_trigger_threshold=1.000000e-01 --convergence_snr_countdown_window=10 --max_calling_iters=10 --caller_update_convergence_threshold=1.000000e-03 --caller_internal_admixing_rate=7.500000e-01 --caller_external_admixing_rate=1.000000e+00 --disable_caller=false --disable_sampler=false --disable_annealing=false
        at org.broadinstitute.hellbender.utils.python.PythonExecutorBase.getScriptException(PythonExecutorBase.java:75)
        at org.broadinstitute.hellbender.utils.runtime.ScriptExecutor.executeCuratedArgs(ScriptExecutor.java:126)
        at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeArgs(PythonScriptExecutor.java:170)
        at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeScript(PythonScriptExecutor.java:151)
        at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeScript(PythonScriptExecutor.java:121)
        at org.broadinstitute.hellbender.tools.copynumber.GermlineCNVCaller.executeGermlineCNVCallerPythonScript(GermlineCNVCaller.java:441)
        at org.broadinstitute.hellbender.tools.copynumber.GermlineCNVCaller.doWork(GermlineCNVCaller.java:288)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:138)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
        at org.broadinstitute.hellbender.Main.main(Main.java:291)
    
    real    0m16.409s
    user    0m17.735s
    sys 0m2.632s
    
    

    I have downloaded the data from this link

    What could possibly be wrong? What additional information would you require ?

  • sleeslee Member, Broadie, Dev ✭✭✭
    edited July 17

    @lakhujanivijay looks like you are following the gCNV tutorial. I think the command you want is:

    gatk GermlineCNVCaller --run-mode COHORT -L scatter-sm/twelve_1of2.interval_list -I cvg/HG00096.tsv -I cvg/HG00268.tsv -I cvg/HG00419.tsv -I cvg/HG00759.tsv -I cvg/HG01051.tsv -I cvg/HG01112.tsv -I cvg/HG01500.tsv -I cvg/HG01565.tsv -I cvg/HG01583.tsv -I cvg/HG01595.tsv -I cvg/HG01879.tsv -I cvg/HG02568.tsv -I cvg/HG02922.tsv -I cvg/HG03006.tsv -I cvg/HG03052.tsv -I cvg/HG03642.tsv -I cvg/HG03742.tsv -I cvg/NA18525.tsv -I cvg/NA18939.tsv -I cvg/NA19017.tsv -I cvg/NA19625.tsv -I cvg/NA19648.tsv -I cvg/NA20502.tsv -I cvg/NA20845.tsv --contig-ploidy-calls ploidy-calls --annotated-intervals twelveregions.annotated.tsv --interval-merging-rule OVERLAPPING_ONLY --output cohort24-twelve --output-prefix cohort24-twelve_1of2 --verbosity DEBUG
    

    Note that we specify --contig-ploidy-calls ploidy-calls because the ploidy-calls directory contains the DetermineGermlineContigPloidy calls for the cohort samples; looks like you are accidentally passing ploidy-case-calls, which contains the calls for the case sample that is run later in the tutorial.

  • lakhujanivijaylakhujanivijay IndiaMember

    Thanks , that solved the issue

Sign In or Register to comment.