We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

How to build population frequency file based on GRCh38 for ContEst

shenglaishenglai ChicagoMember

Hi all,

This is Shenglai from the Univ of Chicago. Recently I have a question about how to build a population frequency file for ContEst. I visited CGA website, but there's limited information about how to build one on different ref. I have several following questions:
1. Can I use picard liftovervcf to upgrade provided hg19.hapmap3. input file to hg38? Although I tried and failed in this way, I was wondering if I did something wrong or it's not possible.
2. If not, Should I build the file on hapmap 2010-08_phaseII+III? There are allele and genotype frequencies, which one should I use? I noticed that in hapmap 2010-08, there are 11 population groups, but in ContEst provided hg19.hapmap3 file there are 17 population groups. Where can I find other groups' information or do I choose the wrong hapmap?
Any piece of advice would be helpful. Thank you.

Best regards
Shenglai

Tagged:

Best Answers

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Shenglai,

    We are discussing this internally and will try to get you a complete answer soon. In the meantime, can you please describe what you mean by "tried and failed" to liftover the hg19 hapmap file to hg38?

  • shenglaishenglai ChicagoMember

    Hi Geraldine

    I used following command for picard liftovervcf function:

    java -Xmx4g -jar picard.jar LiftoverVcf \
    I=hg19_population_stratified_af_hapmap_3.3.vcf \ ### Got from here : http://www.broadinstitute.org/cancer/cga/contest_download
    O=hg38_population_stratified_af_hapmap_3.3.vcf \
    C=hg19ToHg38.over.chain \ ### Got from UCSC golden path : ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/liftOver/
    REJECT=reject.txt \
    R=GRCh38.d1.vd1.fa ### Which has a .dict and .fai file

    The error message I got is :
    [Tue Jun 02 15:47:30 UTC 2015] picard.vcf.LiftoverVcf done. Elapsed time: 0.15 minutes.
    Runtime.totalMemory()=4151836672
    To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
    Exception in thread "main" java.lang.IllegalArgumentException: Must specify file or stream output type.
    at htsjdk.variant.variantcontext.writer.VariantContextWriterBuilder.build(VariantContextWriterBuilder.java:370)
    at picard.vcf.LiftoverVcf.doWork(LiftoverVcf.java:124)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:206)
    at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
    at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

    Since it is a "IllegalArgumentException", I believe that one of the arguments, most likely hg19_population_stratified_af_hapmap_3.3.vcf, due to its "INFO" column, is in an improper format.

    I haven't tried to liftover hapmap files yet, because there are only 11 genotypes other than 17 in hg19_population_stratified_af_hapmap_3.3.vcf.

    Hope these information would do help. Please let me know if I should do some other try out.

    Best
    SL

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    That command looks fine. The error message suggests that the program is not getting all the arguments, as if it stopped parsing after the input argument. I know it sounds silly, but try typing the command from scratch on a single line and see if that changes anything.

  • shenglaishenglai ChicagoMember

    Just tried 2 different ways, and got the same error message.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Can you post the ways you tried please?

  • shenglaishenglai ChicagoMember

    Sorry, my bad.

    I used :
    java -Xmx4g -jar picard.jar LiftoverVcf I=hg19_population_stratified_af_hapmap_3.3.vcf O=./hg38_population_stratified_af_hapmap_3.3.vcf C=hg19ToHg38.over.chain REJECT=./reject.txt R=GRCh38.d1.vd1.fa
    in one single command line.
    I got :
    [Wed Jun 03 15:13:06 UTC 2015] picard.vcf.LiftoverVcf done. Elapsed time: 0.18 minutes.
    Runtime.totalMemory()=4151836672
    To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
    Exception in thread "main" java.lang.IllegalArgumentException: Must specify file or stream output type.
    at htsjdk.variant.variantcontext.writer.VariantContextWriterBuilder.build(VariantContextWriterBuilder.java:370)
    at picard.vcf.LiftoverVcf.doWork(LiftoverVcf.java:124)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:206)
    at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
    at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

    and also used:
    java -Xmx4g -jar picard.jar LiftoverVcf \
    I=hg19_population_stratified_af_hapmap_3.3.vcf \
    O=hg38_population_stratified_af_hapmap_3.3.vcf \
    C=hg19ToHg38.over.chain \
    REJECT=reject.txt \
    R=GRCh38.d1.vd1.fa
    and I got :
    [Wed Jun 03 15:20:46 UTC 2015] picard.vcf.LiftoverVcf done. Elapsed time: 0.32 minutes.
    Runtime.totalMemory()=4151836672
    To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
    Exception in thread "main" java.lang.IllegalArgumentException: Must specify file or stream output type.
    at htsjdk.variant.variantcontext.writer.VariantContextWriterBuilder.build(VariantContextWriterBuilder.java:370)
    at picard.vcf.LiftoverVcf.doWork(LiftoverVcf.java:124)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:206)
    at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
    at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

  • shenglaishenglai ChicagoMember

    To run contest, I have questions on two required arguments, population frequency file and genotype vcf from array data.
    I also want to ask if we do not have array data, is it still possible for us to run the contest?
    For example, I want to do some research on TCGA data set, and want to check the cross-individual contamination level, but not all TCGA data have the array data where we can get the genotype information. Is it still possible to run the contest?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    That is odd. Out of curiosity, can you substitute a different hg19 VCF file into the liftover command, and see if it works? Maybe it is the hapmap file that's got something wrong in it.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    We have a version that can run ContEst without array data internally but it seems like the public version does not support this usage. I'll have to check with the cancer group if we can distribute the array-free version.

  • shenglaishenglai ChicagoMember

    I tried several times about using picard to liftover different vcf files.
    FYI, following results are based on picard 1.129 which I built from github clone, and under java 7 environment.
    java -jar picard-tools/picard.jar LiftoverVcf \

    I= common_all_20150416.vcf \
    O= ~/test_liftovervcf_1.vcf \
    C= hg38ToHg19.over.chain \
    REJECT=~/reject.txt \
    R= hg19.fa

    Tried to liftover large size NCBI SNP vcf file from hg38 to hg19

    java -jar picard-tools/picard.jar LiftoverVcf \

    I= clinvar_20150504.vcf\
    O= ~/test_liftovervcf_2.vcf \
    C= hg38ToHg19.over.chain \
    REJECT=~/reject.txt \
    R= hg19.fa

    Tried to liftover small size NCBI clinvar vcf file from hg38 to hg19

    However, I still got the same error message.
    Exception in thread "main" java.lang.IllegalArgumentException: Must specify file or stream output type.
    at htsjdk.variant.variantcontext.writer.VariantContextWriterBuilder.build(VariantContextWriterBuilder.java:370)
    at picard.vcf.LiftoverVcf.doWork(LiftoverVcf.java:124)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:206)
    at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
    at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

    Then I got the latest released picard from github which is 1.133 version, and also installed java 6, but I got the same error message using the same commands.

  • shenglaishenglai ChicagoMember

    And also, would you be able to explain how contest gets its result without taking array data from the sample?

  • shenglaishenglai ChicagoMember

    I liftedover hg19_population_stratified_af_hapmap_3.3.vcf to hg38 today by GATK liftovervariants.
    If I use hg19_population_stratified_af_hapmap_3.3.vcf downloaded from CGA website, I will have following error message:

    ERROR MESSAGE: The provided VCF file is malformed at approximately line number 4: The VCF specification does not allow for whitespace in the INFO field. Offending field value was "AC=66;AF=0.02369;ALL={C=0.97629, T=0.02371};AN=2786;ASW={C=1.00000, T=0.00000};CEU={C=1.00000, T=0.00000};CHB={C=1.00000, T=0.00000};CHD={C=1.00000, T=0.00000};CHS={C=0.00000, T=0.00000};CLM={C=0.00000, T=0.00000};FIN={C=0.00000, T=0.00000};GBR={C=0.00000, T=0.00000};GIH={C=1.00000, T=0.00000};IBS={C=0.00000, T=0.00000};JPT={C=1.00000, T=0.00000};LWK={C=1.00000, T=0.00000};MKK={C=0.82044, T=0.17956};MXL={C=1.00000, T=0.00000};PUR={C=0.00000, T=0.00000};TSI={C=1.00000, T=0.00000};YRI={C=0.99752, T=0.00248};set=MKK-YRI GT"

    When I removed all unwanted whitespace in the info column, I got the following error message:

    ERROR MESSAGE: The provided VCF file is malformed at approximately line number 4: The VCF specification does not allow for whitespace in the INFO field. Offending field value was "AC=66;AF=0.02369;ALL={C=0.97629,T=0.02371};AN=2786;ASW={C=1.00000,T=0.00000};CEU={C=1.00000,T=0.00000};CHB={C=1.00000,T=0.00000};CHD={C=1.00000,T=0.00000};CHS={C=0.00000,T=0.00000};CLM={C=0.00000,T=0.00000};FIN={C=0.00000,T=0.00000};GBR={C=0.00000,T=0.00000};GIH={C=1.00000,T=0.00000};IBS={C=0.00000,T=0.00000};JPT={C=1.00000,T=0.00000};LWK={C=1.00000,T=0.00000};MKK={C=0.82044,T=0.17956};MXL={C=1.00000,T=0.00000};PUR={C=0.00000,T=0.00000};TSI={C=1.00000,T=0.00000};YRI={C=0.99752,T=0.00248};set=MKK-YRI GT"

    So I guess it counts the tab before "GT" as whitespaces, although it is the 9th column.
    When I put the vcf file without 9th "GT" column, I got the following error message:

    ERROR MESSAGE: Input files /Users/Li/1_working_directory/2_population_vcf/ref/hg19_v3.vcf and reference have incompatible contigs: No overlapping contigs found.
    ERROR /Users/Li/1_working_directory/2_population_vcf/ref/hg19_v3.vcf contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X]
    ERROR reference contigs = [chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY]

    When I put new vcf file with "chr", finally it works:
    INFO 13:53:41,091 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
    INFO 13:53:41,092 ProgressMeter - | processed | time | per 1M | | total | remaining
    INFO 13:53:41,093 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime
    INFO 13:54:11,119 ProgressMeter - chr4:115449290 391010.0 30.0 s 76.0 s 26.0% 115.0 s 85.0 s
    INFO 13:54:41,130 ProgressMeter - chr10:10997313 836444.0 60.0 s 71.0 s 54.6% 109.0 s 49.0 s
    INFO 13:55:11,141 ProgressMeter - chr18:38945735 1294848.0 90.0 s 69.0 s 84.6% 106.0 s 16.0 s
    Converted 1455695 records; failed to convert 594 records.
    INFO 13:55:23,034 ProgressMeter - done 1456289.0 101.0 s 70.0 s 98.1% 102.0 s 1.0 s
    INFO 13:55:23,034 ProgressMeter - Total runtime 101.94 secs, 1.70 min, 0.03 hours
    INFO 13:55:24,820 GATKRunReport - Uploaded run statistics report to AWS S3

    However, it failed to convert 594 records, I would like to know how I can look those rejected calls? Do we have a REJECT parameter in GATK liftovervariants? I tried using -REJECT, but I got

    ERROR MESSAGE: Argument with name 'REJECT' isn't defined.
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Ack, just ran one more test to double-check something -- and it worked: just change the extension of the rejection file to .vcf and it works just fine.

  • shenglaishenglai ChicagoMember

    Besides the population frequency data, ContEst also requires a array based genotype data. In the CGA website, it is said we can convert birdseed data (output from CEL data) to vcf, and use the vcf as the input.
    The birdseed data I got from TCGA data portal looks like :
    Hybridization REF TRIBE_p_TCGAaffx_B1_2_GBM_Nsp_GenomeWideSNP_6_D08_155834 TRIBE_p_TCGAaffx_B1_2_GBM_Nsp_GenomeWideSNP_6_D08_155834
    Composite Element REF Call Confidence
    SNP_A-2131660 2 0.0059
    SNP_A-1967418 2 0.0072
    SNP_A-1969580 2 0.0104
    SNP_A-4263484 1 0.0123
    SNP_A-1978185 0 0.0018
    SNP_A-4264431 1 0.0065
    SNP_A-1980898 2 0.0041

    When I ran BirdseedToVCF.py,

    python BirdseedToVCF.py \
    --birdseed birdseed.calls.txt \ ###downloaded from TCGA, lvl2 data. Changed header in "probeset_id \t sample name". (the default header is Hybridization REF \t sample name \t sample name)
    --output_vcf GBM-Native-02-0047-Normal.vcf \
    --snp_annotation_file GenomeWideSNP_6.na30.annot.hg19.csv.pickle ###downloaded from CGA
    --array_sample TRIBE_p_TCGAaffx_B1_2_GBM_Nsp_GenomeWideSNP_6_D08_155834 \
    --vcf_sample GBM-Native-02-0047-Normal \
    --fasta /SCRATCH/ref/HG19_Broad_variant.fa

    I got the following error message:

    [Status] loading the FASTA file...
    [Status] loading the annotation file...
    [Status] loading from pickled annotation file...
    Traceback (most recent call last):
    File "BirdseedToVCF.py", line 271, in
    raise NameError("The genotypes (birdseed file) seems to have an unexpected header; we expect probeset_id followed by tab seperated sample names")
    NameError: The genotypes (birdseed file) seems to have an unexpected header; we expect probeset_id followed by tab seperated sample names

    Line 271 in python script is "if len(header) < 2 or header[0] != "probeset_id":".
    So I changed the header to "probeset_id TRIBE_p_TCGAaffx_B1_2_GBM_Nsp_GenomeWideSNP_6_D08_155834" and it worked perfectly.

    To run ContEst, this vcf file also needs to be sorted.

    I think so far I am able to use ContEst. Thank you so much for the help.

    Would you please answer my very first question, why we have 17 genotypes instead of 11 in our population allele frequency file, whenever you get answer?

    Regards
    Shenglai

  • shenglaishenglai ChicagoMember

    Actually, I should not ask why. I should ask how I can find those additional population allele frequency information. The reason I asked is that I am willing to build a more detailed population allele frequency file. In phase 3 I guess, there are 26 population groups, but NCBI only provide CAF information instead of telling us which group it belongs.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @shenglai, sorry for the delayed response. I'm not sure I'm the right person to answer your question -- that is out of scope of what we provide, as it's usually done as a downstream analysis. If you haven't yet found another source to get that answered I can try asking the dev team for advice, just let me know if you still need help with this.

  • shenglaishenglai ChicagoMember

    Hi @Geraldine_VdAuwera

    I am wondering if I should follow the steps that I found on 1k genome project web page: http://www.1000genomes.org/faq/how-can-i-get-allele-frequency-my-variant , to get the population allele frequency information.
    It would be super nice if you could ask the dev team whether I am in the right way. Thank you so much!

  • SheilaSheila Broad InstituteMember, Broadie admin

    @shenglai
    Hi,

    Yes, the link you have provided is appropriate to use.

    -Sheila

Sign In or Register to comment.