File Format
The file naming convention for VCF files is as follows: SampleName_S#.vcf (where # is the sample number determined by ordering in the sample sheet).
The header of the VCF file describes the tags used in the remainder of the file and has the column header:
##fileformat=VCFv4.1
##fileDate=20120317
##source=SequenceAnalysisReport.vshost.exe
##reference=
##phasing=none
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=TI,Number=.,Type=String,Description="Transcript ID">
##INFO=<ID=GI,Number=.,Type=String,Description="Gene ID">
##INFO=<ID=CD,Number=0,Type=Flag,Description="Coding Region">
##FILTER=<ID=q20,Description="Quality below 20">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE
A sample line of the VCF file, with the data that is used to populate each column described:
chr22 16285888 rs76548004 T C 17 d15;q20 DP=11;TI=NM_001136213;GI=POTEH;CD GT:GQ 1/0:17
Setting |
Description |
||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ALT |
The alleles that differ from the reference read. For example, an insertion of a single T could show reference A and alternate AT. |
||||||||||||||||||||||||
CHROM |
The chromosome of the reference genome. Chromosomes appear in the same order as the reference FASTA file (generally karyotype order) |
||||||||||||||||||||||||
FILTER |
If all filters are passed, the' PASS' is written. The possible filters are as follows:
|
||||||||||||||||||||||||
FORMAT |
The format column lists fields (separated by colons), for example, "GT:GQ". The list of fields provided depends on the variant caller used. The available fields are as follow: AD – Entry of the form X,Y where X is the number of reference calls, Y the number of alternate calls GQ – Genotype quality GT – Genotype. 0 corresponds to the reference base, 1 corresponds to the first entry in the ALT column, 2 corresponds to the second entry in the ALT column, etc. The '/' indicates that there is no phasing information. NL – Noise level; an estimate of base calling noise at this position SB – Strand bias at this position. Larger negative values indicate less bias; values near zero indicate more strand bias. VF – Variant frequency. The percentage of reads supporting the alternate allele. |
||||||||||||||||||||||||
ID |
The rs number for the SNP obtained from dbSNP. If there are multiple rs numbers at this location, the list is semi-colon delimited. If no dbSNP entry exists at this position, the missing value ('.') is used. |
||||||||||||||||||||||||
INFO |
The possible entries in the INFO column:
|
||||||||||||||||||||||||
POS |
The 1-based position of this variant in the reference chromosome. The convention for VCF files is that, for SNPs, this base is the reference base with the variant. For indels or deletions, this base is the reference base immediately before the variant. Variants are in order of position. |
||||||||||||||||||||||||
QUAL |
A Phred-scaled quality score assigned by the variant caller. Higher scores indicate higher confidence in the variant (and lower probability of errors). For a quality score of Q, the estimated probability of an error is 10-(Q/10). For example, the set of Q30 calls has a 0.1% error rate. Many variant callers assign quality scores (based on their statistical models) which are high relative to the error rate observed in practice. |
||||||||||||||||||||||||
REF |
The reference genotype. For example, a deletion of a single T can read reference TT and alternate T. |
||||||||||||||||||||||||
SAMPLE |
The sample column gives the values specified in the FORMAT column. One MAXGT sample column is provided for the normal genotyping (assuming the reference). For reference, a second column is provided for genotyping assuming the site is polymorphic. See the Starling documentation for more details. |
note
Variant files for Isaac also contain off-target variant calls, with filter.