Hi GATK Users,

Happy Thanksgiving!
Our staff will be observing the holiday and will be unavailable from 22nd to 25th November. This will cause a delay in reaching out to you and answering your questions immediately. Rest assured we will get back to it on Monday November 26th. We are grateful for your support and patience.
Have a great holiday everyone!!!

Regards
GATK Staff

Dynamically pass multiple input to Picard's MarkDuplicates (multiplexed data)

To pass multiple BAM files to MarkDuplicates we use the following syntax:

java -jar picard.jar MarkDuplicates \
    INPUT=lane1.bam \
    INPUT=lane2.bam \
    OUTPUT=dedup.bam \
    METRICS_FILE=dedub_metrics.txt

However, this syntax doesn't requires us to know the number of inputs beforehand. This is not particularly practical. Is there no way this can be done dynamically, for example, some GATK functions take files containing a list of input files (but MarkDuplicates doesn't).

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    The established way to do this is to use a script that generates the command appropriately.

  • jlorsaljlorsal Tenerife, Canary islands, SPAINMember
    edited January 19

    Dear olavur,

    This is how I do it using BASH resorces.

    Prepare a 'bam-list' text file grabbing all BAM filenames corresponding to the same sample (not shown; if you need more help with this, let me know). You may do it manually or just dinamically grab either the list of directories and BAM filenames (assuming you have a number of directories hosting BAM files to be deduped per each-sample per lane/run) or grab the content of a directory containing the BAM filenames to be dedupped).

    Let's name it 'bam_list':

    bamlist="path-to-your/bam_list-file"
    

    In any case, this 'bam-list' file will contain, in separate lines, the filenames of your inputs. An example of this file follows showing three BAM files to be merged by PICARD at the same time duplicates are marked (i.e. a sample 'sampleA' pooled through three lanes [1, 2, and 3] in the same run):

    /server/hiseq4000/run1/wgs/sampleA_S1_L001/sampleA_S1_L001.lanebam
    /server/hiseq4000/run1/wgs/sampleA_S1_L002/sampleA_S1_L002.lane.bam
    /server/hiseq4000/run1/wgs/sampleA_S1_L003/sampleA_S1_L003.lane.bam
    

    Parse each line of 'bam_list' file, add the string "INPUT" as suffix, and save it into a temp file:

    awk '{print "INPUT="$1}' ${bam_list} > ${bam_list_temp}
    

    'bam_list_temp' will display this content:

    INPUT=/server/hiseq4000/run1/wgs/sampleA_S1_L001/sampleA_S1_L001.lanebam
    INPUT=/server/hiseq4000/run1/wgs/sampleA_S1_L002/sampleA_S1_L002.lane.bam
    INPUT=/server/hiseq4000/run1/wgs/sampleA_S1_L003/sampleA_S1_L003.lane.bam
    

    Pass each line of the temporal 'bam_list_temp' file into a BASH array named "bam_array":

    readarray -t bam_array < ${bam_list_temp}
    

    To see the content of this array, run this:

    echo ${array[@]}
    

    and will see this content:

    INPUT=/server/hiseq4000/run1/wgs/sampleA_S1_L001/sampleA_S1_L001.lanebam INPUT=/server/hiseq4000/run1/wgs/sampleA_S1_L002/sampleA_S1_L002.lane.bam INPUT=/server/hiseq4000/run1/wgs/sampleA_S1_L003/sampleA_S1_L003.lane.bam
    

    Then run PICARD MarkDuplicates command (example for a HiSeq4000, using patterned flowcells, pixel distance 2500). Notice the '$(echo ${bam_array[@]})' argument: it will add the content of echoing the whole BASH array to PICARD command line:

    java -jar picard.jar MarkDuplicates \
        $(echo ${bam_array[@]}) \
        OUTPUT=${outfile} \
        CREATE_INDEX=true \
        MAX_RECORDS_IN_RAM=4000000 \
        METRICS_FILE=${metrics} \
        REMOVE_DUPLICATES=false \
        OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 \
        VALIDATION_STRINGENCY=SILENT \
        ASSUME_SORT_ORDER="coordinate" \
        CREATE_MD5_FILE=true \
        TMP_DIR=${temporal}
    
  • CPCCPC Member
    edited April 4

    Found this today.

    I am trying a simpler solution:

    params=$(cat bams.list.file | while read bam; do printf "I=$bam "; done);
    
    cmd="java -Xmx10g -jar picard.jar MergeSamFiles "$params" O=picard.merged.bam USE_THREADING=true"
    
    $cmd
    

    Best

  • alaabadrealaabadre FranceMember

    Or you could use bash for a simple loop:

    declare -a arr=( "sample1" "sample2" "sample3") # keep the quotes

    for i in "${arr[@]}"; do java -jar path-to-picard/picard.jar MarkDuplicates \
        INPUT=path-to-sample/"$i"_lane1.bam \
        INPUT=path-to-sample/"$i"_lane2.bam \
        OUTPUT="$i"_dedup.bam \
        METRICS_FILE="$i"_dedup_metrics.txt; done
    

    This will run each job one by one if your PC doesn't have much resources. However, if your PC has lots of resources or you are using a cluster, you could parallelize the jobs by simply adding a & at the end:

    for i in "${arr[@]}"; do java -jar path-to-picard/picard.jar MarkDuplicates \
        INPUT=path-to-sample/"$i"_lane1.bam \
        INPUT=path-to-folder-sample/"$i"_lane2.bam \
        OUTPUT=(path-to-output-folder)/"$i"_dedup.bam \
        METRICS_FILE="$i"_dedup_metrics.txt; done &
    

    Best regards,
    Alaa

Sign In or Register to comment.