Taxonomic Database

The taxonomic database used is an Illumina-curated version of the May 2013 release of the Greengenes Consortium Database (greengenes.secondgenome.com/downloads).

Here are the current statistics for that database:

Taxonomic Level

# of classifications

Kingdoms

3

Phyla

33

Classes

74

Orders

148

Families

321

Genera

1086

Species

6466

To get taxonomies down to the species level, we used the Greengenes SQL database files (gg_13_5.sql.gz). Specifically our database started off with everything contained in the Greengenes clones, isolates, and symbionts tables. From there, we apply a set of filters:

1 Filter all entries where the 16S sequence length was below 1250 bp.
2 Filter all entries that had more than 50 wobble bases (i.e. M, R, W, S, Y, K, V, H, D, B, N)
3 Filter all entries that were only partially classified (no classification for genus or species)

The Greengenes database had a number of classifications placed in the wrong field. i.e. improper genus or species names, placing clone or strain IDs in the species field, etc. We developed a program to help identify and clean up these entries.

Ambiguous epithets and classifications (sp, aff, cf, genosp, genomosp) were removed, because they effectively mean the same thing as an empty taxonomic level.

Listeria monocytogenes (GenBank entry X56153.1), Listeria innocua (GenBank entry FJ774235.1), and PhiX (NCBI reference sequence: NC_001422) were added to the database to support internal research projects.

 

© 2014 Illumina, Inc. All rights reserved.

15055861 A