FASTQ files explained

Dec 1, 2017

Illumina sequencing technology uses cluster generation and sequencing by synthesis (SBS) chemistry to sequence millions or billions of clusters on a flow cell, depending on the sequencing platform.  During SBS chemistry, for each cluster, base calls are made and stored for every cycle of sequencing by the Real-Time Analysis (RTA) software on the instrument. RTA stores the base call data in the form of individual base call (or BCL) files. When sequencing completes, the base calls in the BCL files must be converted into sequence data. This process is called BCL to FASTQ conversion.

A FASTQ file is a text file that contains the sequence data from the clusters that pass filter on a flow cell (for more information on clusters passing filter, see the “additional information” section of this bulletin). If samples were multiplexed, the first step in FASTQ file generation is demultiplexing.  Demultiplexing assigns clusters to a sample, based on the cluster’s index sequence(s). After demultiplexing, the assembled sequences are written to FASTQ files per sample. If samples were not multiplexed, the demultiplexing step does not occur, and, for each flow cell lane, all clusters are assigned to a single sample.

For a single-read run, one Read 1 (R1) FASTQ file is created for each sample per flow cell lane. For a paired-end run, one R1 and one Read 2 (R2) FASTQ file is created for each sample for each lane. FASTQ files are compressed and created with the extension *.fastq.gz.

What does a FASTQ file look like?

For each cluster that passes filter, a single sequence is written to the corresponding sample’s R1 FASTQ file, and, for a paired-end run, a single sequence is also written to the sample’s R2 FASTQ file. Each entry in a FASTQ files consists of 4 lines:

  1. A sequence identifier with information about the sequencing run and the cluster. The exact contents of this line vary based on the BCL to FASTQ conversion software used.
  2. The sequence (the base calls; A, C, T, G and N).
  3. A separator, which is simply a plus (+) sign.
  4. The base call quality scores. These are Phred +33 encoded, using ASCII characters to represent the numerical quality scores.

Here is an example of a single entry in a R1 FASTQ file:

More detailed information on the FASTQ format can be found here.

How to view a FASTQ file

FASTQ files can contain up to millions of entries and can be several megabytes or gigabytes in size, which often makes them too large to open in a normal text editor. Generally, it is not necessary to view FASTQ files, because they are intermediate output files used as input for tools that perform downstream analysis, such as alignment to a reference or de novo assembly.

If you need to view a FASTQ file for troubleshooting purposes or out of curiosity, you will need either a text editor that can handle very large files, or access to a Unix or Linux system where large files can be viewed via the command line.

How to generate FASTQ files

FASTQ file generation is the first step for all analysis workflows used by MiSeq Reporter on the MiSeq and Local Run Manager on the MiniSeq.  When analysis completes, the FASTQ files are located in <run folder>\Data\Intensities\BaseCalls on the MiSeq and <output folder>\Alignment_#\<subfolder>\Fastq on the MiniSeq.

For all runs uploaded to BaseSpace Sequence Hub, FASTQ file generation automatically occurs after the run is completely uploaded, and the FASTQ files are used as input for the various analysis apps on BaseSpace Sequence Hub.  On BaseSpace Sequence Hub, you can find your FASTQ files in the project(s) associated with your run.

The bcl2fastq conversion software can be used to generate FASTQ files from data generated on all current Illumina sequencing systems.

For information on the different settings that can be applied during FASTQ file generation, see the software user guides below.

MiSeq Reporter

Local Run Manager


Additional information

A description and requirements for clusters to pass filter can be found in section 1.5.8 of the MiSeq: Imaging and Base Calling online training course.

See 2-Channel SBS Technology for more information about base calling on NovaSeq, NextSeq and MiniSeq systems.

See Illumina Sequencing Technology for more information about base calling on MiSeq and HiSeq systems.