To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

Dynamically pass multiple input to Picard's MarkDuplicates (multiplexed data)

To pass multiple BAM files to MarkDuplicates we use the following syntax:

java -jar picard.jar MarkDuplicates \
    INPUT=lane1.bam \
    INPUT=lane2.bam \
    OUTPUT=dedup.bam \
    METRICS_FILE=dedub_metrics.txt

However, this syntax doesn't requires us to know the number of inputs beforehand. This is not particularly practical. Is there no way this can be done dynamically, for example, some GATK functions take files containing a list of input files (but MarkDuplicates doesn't).

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    The established way to do this is to use a script that generates the command appropriately.

  • jlorsaljlorsal Tenerife, Canary islands, SPAINMember
    edited January 19

    Dear olavur,

    This is how I do it using BASH resorces.

    Prepare a 'bam-list' text file grabbing all BAM filenames corresponding to the same sample (not shown; if you need more help with this, let me know). You may do it manually or just dinamically grab either the list of directories and BAM filenames (assuming you have a number of directories hosting BAM files to be deduped per each-sample per lane/run) or grab the content of a directory containing the BAM filenames to be dedupped).

    Let's name it 'bam_list':

    bamlist="path-to-your/bam_list-file"
    

    In any case, this 'bam-list' file will contain, in separate lines, the filenames of your inputs. An example of this file follows showing three BAM files to be merged by PICARD at the same time duplicates are marked (i.e. a sample 'sampleA' pooled through three lanes [1, 2, and 3] in the same run):

    /server/hiseq4000/run1/wgs/sampleA_S1_L001/sampleA_S1_L001.lanebam
    /server/hiseq4000/run1/wgs/sampleA_S1_L002/sampleA_S1_L002.lane.bam
    /server/hiseq4000/run1/wgs/sampleA_S1_L003/sampleA_S1_L003.lane.bam
    

    Parse each line of 'bam_list' file, add the string "INPUT" as suffix, and save it into a temp file:

    awk '{print "INPUT="$1}' ${bam_list} > ${bam_list_temp}
    

    'bam_list_temp' will display this content:

    INPUT=/server/hiseq4000/run1/wgs/sampleA_S1_L001/sampleA_S1_L001.lanebam
    INPUT=/server/hiseq4000/run1/wgs/sampleA_S1_L002/sampleA_S1_L002.lane.bam
    INPUT=/server/hiseq4000/run1/wgs/sampleA_S1_L003/sampleA_S1_L003.lane.bam
    

    Pass each line of the temporal 'bam_list_temp' file into a BASH array named "bam_array":

    readarray -t bam_array < ${bam_list_temp}
    

    To see the content of this array, run this:

    echo ${array[@]}
    

    and will see this content:

    INPUT=/server/hiseq4000/run1/wgs/sampleA_S1_L001/sampleA_S1_L001.lanebam INPUT=/server/hiseq4000/run1/wgs/sampleA_S1_L002/sampleA_S1_L002.lane.bam INPUT=/server/hiseq4000/run1/wgs/sampleA_S1_L003/sampleA_S1_L003.lane.bam
    

    Then run PICARD MarkDuplicates command (example for a HiSeq4000, using patterned flowcells, pixel distance 2500). Notice the '$(echo ${bam_array[@]})' argument: it will add the content of echoing the whole BASH array to PICARD command line:

    java -jar picard.jar MarkDuplicates \
        $(echo ${bam_array[@]}) \
        OUTPUT=${outfile} \
        CREATE_INDEX=true \
        MAX_RECORDS_IN_RAM=4000000 \
        METRICS_FILE=${metrics} \
        REMOVE_DUPLICATES=false \
        OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 \
        VALIDATION_STRINGENCY=SILENT \
        ASSUME_SORT_ORDER="coordinate" \
        CREATE_MD5_FILE=true \
        TMP_DIR=${temporal}
    
Sign In or Register to comment.