File Format

The file naming convention for VCF files is as follows: SampleName_S#.vcf (where # is the sample number determined by ordering in the sample sheet).

The header of the VCF file describes the tags used in the remainder of the file and has the column header:

##fileformat=VCFv4.1

##fileDate=20120317

##source=SequenceAnalysisReport.vshost.exe

##reference=

##phasing=none

##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">

##INFO=<ID=TI,Number=.,Type=String,Description="Transcript ID">

##INFO=<ID=GI,Number=.,Type=String,Description="Gene ID">

##INFO=<ID=CD,Number=0,Type=Flag,Description="Coding Region">

##FILTER=<ID=q20,Description="Quality below 20">

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE

A sample line of the VCF file, with the data that is used to populate each column described:

chr22 16285888 rs76548004 T C 17 d15;q20 DP=11;TI=NM_001136213;GI=POTEH;CD GT:GQ 1/0:17

VCF File Setting Descriptions

Setting

Description

ALT

The alleles that differ from the reference read. For example, an insertion of a single T could show reference A and alternate AT.

CHROM

The chromosome of the reference genome. Chromosomes appear in the same order as the reference FASTA file (generally karyotype order)

FILTER

If all filters are passed, the' PASS' is written. The possible filters are as follows:

q20 – The variant score is less than 20. (Configurable using the VariantFilterQualityCutoff setting in the config file)
r8 – For an Indel, the number of repeats in the reference (of a 1- or 2-base repeat) is greater than 8. (Configurable using the IndelRepeatFilterCutoff setting in the config file)

FORMAT

The format column lists fields (separated by colons), for example, "GT:GQ". The list of fields provided depends on the variant caller used. The available fields are as follow:

AD – Entry of the form X,Y where X is the number of reference calls, Y the number of alternate calls

GQ – Genotype quality

GT – Genotype. 0 corresponds to the reference base, 1 corresponds to the first entry in the ALT column, 2 corresponds to the second entry in the ALT column, etc. The '/' indicates that there is no phasing information.

NL – Noise level; an estimate of base calling noise at this position

SB – Strand bias at this position. Larger negative values indicate less bias; values near zero indicate more strand bias.

VF – Variant frequency. The percentage of reads supporting the alternate allele.

ID

The rs number for the SNP obtained from dbSNP. If there are multiple rs numbers at this location, the list is semi-colon delimited. If no dbSNP entry exists at this position, the missing value ('.') is used.

INFO

The possible entries in the INFO column:

AD – Entry of the form X,Y where X is the number of reference calls, Y the number of alternate calls.
CD – A flag indicating that the SNP occurs within the coding region of at least one RefGene entry
DP – The depth (number of base calls aligned to this position)
GI – A comma-separated list of gene IDs read from RefGene
NL – Noise level; an estimate of base calling noise at this position.
TI – A comma-separated list of transcript IDs read from RefGene
SB – Strand bias at this position.
VF – Variant frequency. The number of reads supporting the alternate allele.

POS

The 1-based position of this variant in the reference chromosome. The convention for VCF files is that, for SNPs, this base is the reference base with the variant. For indels or deletions, this base is the reference base immediately before the variant. Variants are in order of position.

QUAL

A Phred-scaled quality score assigned by the variant caller. Higher scores indicate higher confidence in the variant (and lower probability of errors). For a quality score of Q, the estimated probability of an error is 10-(Q/10). For example, the set of Q30 calls has a 0.1% error rate. Many variant callers assign quality scores (based on their statistical models) which are high relative to the error rate observed in practice.

REF

The reference genotype. For example, a deletion of a single T can read reference TT and alternate T.

SAMPLE

The sample column gives the values specified in the FORMAT column. One MAXGT sample column is provided for the normal genotyping (assuming the reference). For reference, a second column is provided for genotyping assuming the site is polymorphic. See the Starling documentation for more details.

note

Variant files for Isaac also contain off-target variant calls, with filter.