Attention:
The frontline support team will be unavailable to answer questions on April 15th and 17th 2019. We will be back soon after. Thank you for your patience and we apologize for any inconvenience!

Ref: De novo assembled transcriptome; Growth conditions ; replicates;

npd75npd75 AustraliaMember

Hi,

I am working on a project which requires me to identify Variants in a marine species using RNAseq data.

I have assembled the transcriptome (using all available RNASeq data sets for the experiment) for my organism (no Reference is available).

I then referred to the link 'http://gatkforums.broadinstitute.org/gatk/discussion/1601/how-can-i-prepare-a-fasta-file-to-use-as-reference' and created the '.dict' file and indexed the reference.

java -Xmx3g -jar /share/apps/picard/1.115/CreateSequenceDictionary.jar R=seaurchin_topHit_Ids.fasta O=seaurchin_topHit_Ids.dict
samtools faidx seaurchin_topHit_Ids.fasta

I have the following RNASeq data-sets available:
Two growth Conditions : Control (C) vs Bubble (B)
At two Growth Times: 8h and 48h

RNASeq data is generated in replicates (R1 and R2)

So in all I have the following read-sets
C1-8h , C2-8h
C1-48h, C2-48h

B1-8h, B2-8h
B1-48h, B2-48h

The questions we might want to ask are :
Identify variants :
Q1)Across conditions : Condition C (Control) vs Condition B
Q2)Across times : Growth times 8h vs 48h

I am following the best practice guidelines for GATK for RNASeq data..

1) Mapping individual data-sets using bwa (I will repeat the pipeline using STAR later, if I can get through all steps)..I have been using bwa for a while, so felt more comfortable using the tool.

2) sort the individual 'sam files' and convert into 'bam'

3) Mark Duplicates for individual bams to get the '.dedup.bam files'

4) Add Read groups
I am keeping the RGSM tag common across datasets belonging to the same sample and separating the replicates by the RGID tag..
e.g.
java -Xmx3g -jar /share/apps/picard/1.115/AddOrReplaceReadGroups.jar \
I= B1-BB-8h.dedup.bam \
O= B1-BB-8h.dedup_RG.bam \
SORT_ORDER=coordinate \
RGID=B1-BB-8h.dedup \
RGLB=BB8h \
RGPL=illumina \
RGSM=BB8h \
CREATE_INDEX=True \
RGPU=run_barcode

java -Xmx3g -jar /share/apps/picard/1.115/AddOrReplaceReadGroups.jar \
I= B2-BB-8h.dedup.bam \
O= B2-BB-8h.dedup_RG.bam \
SORT_ORDER=coordinate \
RGID=B2-BB-8h.dedup \
RGLB=BB8h \
RGPL=illumina \
RGSM=BB8h \
CREATE_INDEX=True \
RGPU=run_barcode

java -Xmx3g -jar /share/apps/picard/1.115/AddOrReplaceReadGroups.jar \
I= B1-BC-8h.dedup.bam \
O= B1-BC-8h.dedup_RG.bam \
SORT_ORDER=coordinate \
RGID=B1-BC-8h.dedup \
RGLB=BB8h \
RGPL=illumina \
RGSM=BC8h \
CREATE_INDEX=True \
RGPU=run_barcode

Can someone confirm if the above steps look alright?

I am a bit unsure as to what I do further!!

Should I merge all dedup.bam files and do the downstream processing ? Since I have the read groups marked (associating samples but separating the conditions and growth times), can I expect to identify SNPs/indels etc for Q1 and Q2 ?

I will appreciate if someone can guide me through this,

regards,

Nandan

Best Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MA admin
    Accepted Answer

    Hi Nandan,

    If you use the same SM for all your sequence sets, the variant caller and downstream tools will not be able to distinguish them and will treat them all as one sample. I recommend you give different SMs to anything you want to contrast at the variant level.

    By the way, we are coming to Australia to give a couple of workshops (in Sydney and in Melbourne) in a few weeks (feb 1-4).if you haven't signed up yet you should definitely consider coming. Let me know if you need the registration webpage address.

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    Accepted Answer

    Hi Nandan,

    If you use the same SM for all your sequence sets, the variant caller and downstream tools will not be able to distinguish them and will treat them all as one sample. I recommend you give different SMs to anything you want to contrast at the variant level.

    By the way, we are coming to Australia to give a couple of workshops (in Sydney and in Melbourne) in a few weeks (feb 1-4).if you haven't signed up yet you should definitely consider coming. Let me know if you need the registration webpage address.

  • npd75npd75 AustraliaMember

    Hi Geraldine,

    Thanks for the comments..

    Actually I am the same Nandan coordinating with you and Katherine Champ (Bioplatforms, Australia) with reference to setting up the GATK workshop at UNSW Sydney (Feb 1-2) :smile: ..

    So I will be attendi the course. I have been working on this project and thought it was as good idea to try and get started with GATK before the course .

    regards,

    Nandan

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Oh hah, I thought the name was familiar... and uncommon enough in my environment that I should have realized you are one and the same :)

    Looking fwd to meeting you in person!

  • npd75npd75 AustraliaMember

    Hi Geraldine,

    A quick question..

    I have now obtained a .vcf file which includes replicate datasets from two samples..

    Sample 1:

    BB8h
    The two Replicates are marked by common sample tag and separate library tags

    RGSM=BB8h RGLB=B1-BB-8h
    RGSM=BB8h RGLB=B2-BB-8h

    And sample 2:

    BC8h

    The two Replicates marked by common sample tag and separate library tags

    RGSM=BC8h RGLB=B1-BC-8h
    RGSM=BC8h RGLB=B2-BC-8h

    As far as I understand, the raw .vcf file contains Variants identified by comparing with the "Reference.fasta file" and I now need to further filter variants of my specific interest using the "-T SelectVariants" option..

    My question is :
    If I want to separate out the variants which show DIFFERENCE in alleles between my above samples; I was thinking of somehow using the "PL" TAG to differentiate ?
    Just wanted to check with you if I am thinking in the right direction or I am completely off-track :-(

    regards,

    Nandan

  • npd75npd75 AustraliaMember
    edited January 2016

    Thanks Sheila.

    As you suggested, I separated my .vcf into two separate .vcfs and then proceeded with the GenotypeConcordance step.
    I am pasting the ".details_metrics" file from the Picard-GenotypeConcordance.

    htsjdk.samtools.metrics.StringHeader
    picard.vcf.GenotypeConcordance TRUTH_VCF=raw_snps_indels_Q20_B_UG_part_BC8h.vcf CALL_VCF=raw_snps_indels_Q20_B_UG_part_BB8h.vcf OUTPUT=BC_vs_BB TRUTH_SAMPLE=BC8h CALL_SAMPLE=BB8h OUTPUT_ALL_ROWS=false INTERSECT_INTERVALS=true MIN_GQ=0 MIN_DP=0 USE_VCF_INDEX=false MISSING_SITES_HOM_REF=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
    htsjdk.samtools.metrics.StringHeader
    Started on: Thu Jan 14 12:07:45 EST 2016

    METRICS CLASS picard.vcf.GenotypeConcordanceDetailMetrics
    VARIANT_TYPE TRUTH_SAMPLE CALL_SAMPLE TRUTH_STATE CALL_STATE COUNT CONTINGENCY_VALUES
    SNP BC8h BB8h HOM_REF HET_REF_VAR1 46258 FP,TN
    SNP BC8h BB8h HOM_REF HOM_VAR1 1679 FP
    SNP BC8h BB8h HET_REF_VAR1 HOM_REF 43309 TN,FN
    SNP BC8h BB8h HET_REF_VAR1 HET_REF_VAR1 333892 TP,TN
    SNP BC8h BB8h HET_REF_VAR1 HET_VAR1_VAR2 264 TP,FP
    SNP BC8h BB8h HET_REF_VAR1 HOM_VAR1 10698 TP,FP
    SNP BC8h BB8h HET_REF_VAR1 HET_REF_VAR2 664 FP,TN,FN
    SNP BC8h BB8h HET_REF_VAR1 NO_CALL 1552 EMPTY
    SNP BC8h BB8h HET_VAR1_VAR2 HOM_REF 1 FN
    SNP BC8h BB8h HET_VAR1_VAR2 HET_REF_VAR1 259 TP,FN
    SNP BC8h BB8h HET_VAR1_VAR2 HET_VAR1_VAR2 2161 TP
    SNP BC8h BB8h HET_VAR1_VAR2 HOM_VAR1 104 TP,FN
    SNP BC8h BB8h HET_VAR1_VAR2 HET_VAR1_VAR3 2 TP,FP,FN
    SNP BC8h BB8h HET_VAR1_VAR2 NO_CALL 3 EMPTY
    SNP BC8h BB8h HOM_VAR1 HOM_REF 1692 FN
    SNP BC8h BB8h HOM_VAR1 HET_REF_VAR1 10841 TP,FN
    SNP BC8h BB8h HOM_VAR1 HET_VAR1_VAR2 97 TP,FP,FN
    SNP BC8h BB8h HOM_VAR1 HOM_VAR1 35457 TP
    SNP BC8h BB8h HOM_VAR1 HET_REF_VAR2 4 FP,FN
    SNP BC8h BB8h HOM_VAR1 HOM_VAR2 1 FP,FN
    SNP BC8h BB8h HOM_VAR1 NO_CALL 3462 EMPTY
    SNP BC8h BB8h NO_CALL HET_REF_VAR1 1743 EMPTY
    SNP BC8h BB8h NO_CALL HET_VAR1_VAR2 1 EMPTY
    SNP BC8h BB8h NO_CALL HOM_VAR1 3655 EMPTY

    Looking at the above file, I assume that the tool summarizes (and counts) the types of concordances across the two sample .vcfs. But I could be wrong.

    I was actually planning to load my "bam" files for my two samples in IGV (across the reference) and try then to visualize those specific SNPs which display differences in their allele content .
    This was more for my and my collaborators confirmation that the SNPs we identified do display a different allele across the samples; before we do any kind of downstream analysis..

    So my question is : Based on the summary we obtained from above with reference to concordance across the samples, is there any way to actually filter SPECIFIC SNPs (with their IDs and positions in a separate .vcf) which SHOW allele differences?

    regards,

    Nandan

  • npd75npd75 AustraliaMember

    Hi Sheila,

    Actually I think I was a bit too hasty in putting forth the question..:-)

    I have now used -T VariantsToTable option to get all required fields and I should be able to compare/contrast across samples..

    java -Xmx2g -jar /projects/u71/nandan/tools_db/tools/GATK_3.5/GenomeAnalysisTK.jar \
    -R /projects/u71/nandan/Sven_Uthicke_populationGenomics/analysis/SNP_SFG/all_bams/seaurchin_topHit_Ids.fasta \
    -T VariantsToTable \
    -V raw_snps_indels_Q20_B_UG.vcf \
    -F CHROM -F POS -F ID -F REF -F ALT -F QUAL -F MULTI-ALLELIC -GF GT -GF AD -GF DP -GF GQ -GF PL \
    -o raw_snps_indels_Q20_B_UG1.table

    regards,

    Nandan

Sign In or Register to comment.