VCF Files

VCF is a text file format that contains information about variants found at specific positions in a reference genome. The file format consists of meta-information lines, a header line, and then data lines. Each data line contains information about a single variant.

More information is available here: www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41.

VCF File Format

The file naming convention for VCF files is as follows: SampleName_S#.vcf (where # is the sample number determined by ordering in the sample sheet).

The header of the VCF file describes the tags used in the remainder of the file. A description of the tags is also provided here and on www.broadinstitute.org/gatk/guide/article?id=1268.

Setting

Description

CHROM

The chromosome of the reference genome. Chromosomes appear in the same order as the reference FASTA file (generally karyotype order)

POS

The 1-based position of this variant in the reference chromosome. The convention for *.vcf files is that, for SNPs, this base is the reference base with the variant. For indels or deletions, this base is the reference base immediately before the variant. Variants are ordered by position.

ID

The rs number for the SNP obtained from dbSNP. If there are multiple rs numbers at this location, the list is semi-colon delimited. If no dbSNP entry exists at this position, the missing value ('.') is used.

REF

The reference genotype. For example, a deletion of a single T can be represented as reference TT and alternate T.

ALT

The alleles that differ from the reference read. For example, an insertion of a single T can be represented as reference A and alternate AT.

QUAL

A Phred-scaled quality score assigned by the variant caller. Higher scores indicate higher confidence in the variant (and lower probability of errors). For a quality score of Q, the estimated probability of an error is 10-(Q/10). For example, the set of Q30 calls has a 0.1% error rate. Many variant callers assign quality scores (based on their statistical models) which are high relative to the error rate observed in practice.

FILTER

See VCF FILTER Entries for possible entries.

FORMAT

See VCF FORMAT Entries for possible entries.

INFO

See VCF INFO Entries for possible entries.

INFO

Illumina Annotation Service (IAS) provided annotations are:

CSQT – Transcript consequence as predicted by Variant Effect Predictor (www.ensembl.org/info/docs/tools/vep/index.html) version 72. Only canonical transcripts are included in the VCF file to maintain readability. The ANT file contains consequences for all affected transcripts. This binary file can be loaded into VariantStudio for viewing; see www.illumina.com/clinical/clinical_informatics/illumina-variantstudio.ilmn.

A comma-separated list for each affected gene is provided. Each entry in the list includes the HGNC gene symbol (when available), transcript ID, and functional consequences in a delimited format: HGNC|TranscriptID|Consequence. If the annotation source selected was RefSeq, then many of the TranscriptIDs begin with NM_. If the selected annotation source was Ensembl, then the TranscriptIDs begin with ENST. The consequences are indicated using valid Sequence Ontology (SO) terms (www.ensembl.org/info/genome/variation/predicted_data.html#consequences).

CSQR – Regulatory consequence as predicted by Variant Effect Predictor (www.ensembl.org/info/docs/tools/vep/index.html) version 72. A comma-separated list for each affected regulatory region (including transcription factor binding sites) is provided using the following delimited format: RegulatoryID|Consequence. The annotations provided in this field come from the Ensembl database of regulatory features even if RefSeq was selected as the annotation source. Many of the RegulatoryIDs begin with ENSR. The consequences are indicated using valid Sequence Ontology (SO) terms (www.ensembl.org/info/genome/variation/predicted_data.html#consequences) and typically are either regulatory_region_variant or TF_binding_site_variant.
AF – The allele frequency from all populations of 1000 genomes data
AA – The inferred allele ancestral to the chimpanzee/human lineage
GMAF – Global minor allele frequency (GMAF); technically, the frequency of the second most frequent allele. Format: GlobalMinorAllele|AlleleFreqGlobalMinor
EVS – Allele frequency, sample count, and coverage taken from the Exome Variant Server (EVS). Format: AlleleFreqEVS|EVSCoverage|EVSSamples
cosmic – The numeric identifier for the variant in the Catalogue of Somatic Mutations in Cancer (COSMIC) database (cancer.sanger.ac.uk/cancergenome/projects/cosmic/).
clinvar – Clinical significance from the ClinVar database (www.ncbi.nlm.nih.gov/clinvar/).
phastCons – Denotes if the variant is an identical or similar sequence that occurs between species and maintained between species throughout evolution

SAMPLE

The sample column gives the values specified in the FORMAT column. One MAXGT sample column is provided for the normal genotyping (assuming the reference). For reference, a second column is provided for genotyping assuming the site is polymorphic.

Note

The full set of provided annotations from the Illumina Annotation Service (IAS) can be accessed through a binary annotation (ANT) file that accompanies the VCF file.

 

© 2014 Illumina, Inc. All rights reserved.

15050953 Rev. B