Field Descriptions
The fixed fields #CHROM, POS, ID, REF, ALT, QUAL are defined in the VCF 4.1 standard provided by the 1000 Genomes Project. The fields ID, INFO, FORMAT, and sample are described in the metainformation.
|
•
|
CHROM: Chromosome: an identifier from the reference genome or an angle-bracketed ID String ("<ID>") pointing to a contig. |
|
•
|
POS: Position: The reference position, with the first base having position 1. Positions are sorted numerically, in increasing order, within each reference sequence CHROM. There can be multiple records with the same POS. Telomeres are indicated by using positions 0 or N+1, where N is the length of the corresponding chromosome or contig. |
|
•
|
ID: Semicolon separated list of unique identifiers where available. If this ID is a dbSNP variant, it is encouraged to use the rs number. No identifier is present in more than one data record. If there is no identifier available, then the missing value is used. |
|
•
|
REF: Reference bases: A,C,G,T,N; there can be multiple bases. The value in the POS field refers to the position of the first base in the string. For simple insertions and deletions in which either the REF or one of the ALT alleles would otherwise be null/empty, the REF and ALT strings include the base before the event. This modification is reflected in the POS field. The exception is when the event occurs at position 1 on the contig, in which case they include the base after the event. If any of the ALT alleles is a symbolic allele (an angle-bracketed ID String "<ID>"), the padding base is required. In that case, POS denotes the coordinate of the base preceding the polymorphism. |
|
•
|
ALT: Comma-separated list of alternate nonreference alleles called on at least 1 of the samples. Options are: |
|
•
|
Base strings made up of the bases A,C,G,T,N |
|
•
|
Angle-bracketed ID String (”<ID>”) |
|
•
|
Break-end replacement string as described in the section on break-ends. |
|
•
|
If there are no alternative alleles, then the missing value is used. |
|
•
|
QUAL: Phred-scaled quality score for the assertion made in ALT. ie -10log10 probability (call in ALT is wrong). If ALT is ”.” (no variant), this score is -10log10 p(variant). If ALT is not ”.”, this score is -10log10 p(no variant). High QUAL scores indicate high confidence calls. Although traditionally people use integer Phred scores, this field is permitted to be a floating point to enable higher resolution for low confidence calls if desired. If unknown, the missing value is specified. (Numeric) |
|
•
|
FILTER: PASS marks positions that have passed all filters. Otherwise, a semicolon-separated list of codes for filters that failed is provided. gVCF files use the following values: |
|
•
|
PASS: position has passed all filters. |
|
•
|
IndelConflict: Locus is in region with conflicting indel calls. |
|
•
|
SiteConflict: Site genotype conflicts with proximal indel call, which is typically a heterozygous SNV call made inside a heterozygous deletion. |
|
•
|
LowGQX: Locus GQX (minimum of {Genotype quality assuming variant position,Genotype quality assuming nonvariant position}) is less than 30 or not present. |
|
•
|
HighDPFRatio: The fraction of base calls filtered out at a site is greater than 0.3. |
|
•
|
HighSNVSB: SNV strand bias value (SNVSB) exceeds 10. High strand bias indicates a potential high false-positive rate for SNVs. |
|
•
|
HighSNVHPOL: SNV contextual homopolymer length (SNVHPOL) exceeds 6. |
|
•
|
HighREFREP: Indel contains an allele that occurs in a homopolymer or dinucleotide track with a reference repeat greater than 8. |
|
•
|
HighDepth: Locus depth is greater than 3x the mean chromosome depth. |
|
•
|
INFO: Additional information. INFO fields are encoded as a semicolon-separated series of short keys with optional values in the format: <key>=<data>[,data]. gVCF files use the following values: |
|
•
|
END: End position of the region described in this record. |
|
•
|
BLOCKAVG_min30p3a: nonvariant site block. All sites in a block are constrained to be nonvariant, have the same filter value, and have all sample values in range [x,y], y ≤ max(x+3,(x*1.3)). All printed site block sample values are the minimum observed in the region spanned by the block. |
|
•
|
SNVSB: SNV site strand bias. |
|
•
|
SNVHPOL: SNV contextual homopolymer length. |
|
•
|
CIGAR: CIGAR alignment for each alternate indel allele. |
|
•
|
RU: Smallest repeating sequence unit extended or contracted in the indel allele relative to the reference. RUs longer than 20 bases are not reported. |
|
•
|
REFREP: Number of times RU is repeated in reference. |
|
•
|
IDREP: Number of times RU is repeated in indel allele. |
|
•
|
FORMAT: Format of the sample field. FORMAT specifies the data types and order of the subfields. gVCF files use the following values: |
|
•
|
GQX: Minimum of {Genotype quality assuming variant position, Genotype quality assuming nonvariant position}. |
|
•
|
DP: Filtered base call depth used for site genotyping. |
|
•
|
DPF: Base calls filtered from input before site genotyping. |
|
•
|
AD: Allelic depths for the ref and alt alleles in the order listed. For indels, this value only includes reads that confidently support each allele (posterior probability 0.999 or higher that read contains indicated allele vs all other intersecting indel alleles). |
|
•
|
DPI: Read depth associated with indel, taken from the site preceding the indel. |
|
•
|
SAMPLE: Sample fields as defined by the header. |