ABSTRACT
We report high-quality closed reference genomes for 1 bovine strain and 10 human Shiga toxin (Stx)-producing Escherichia coli (STEC) strains from serogroups O26, O45, O91, O103, O104, O111, O113, O121, O145, and O157. We also report draft assemblies, with standardized metadata, for 360 STEC strains isolated from watersheds, animals, farms, food, and human infections.
ANNOUNCEMENT
Shiga toxin (Stx)-producing Escherichia coli (STEC) strains cause significant human enteric disease (1–3). Among >129 O serogroups, O157, O26, O45, O111, O103, O121, and O145 cause most infections (4–6). Non-O157 STEC strains are increasingly reported (5–7), with recent widespread STEC O121 and O103 outbreaks in Canada and the United States sourced to flour (8) and ground beef, respectively (https://www.cdc.gov/ecoli/2019/o103-04-19/).
To catalogue pangenomic diversity, we generated closed reference genomes for 11 routinely used lab control strains from the “top seven” STEC O serogroups, plus O91 and O104, and 360 draft assemblies for 129 distinct O serogroups from STEC culture collections (1980 to 2013) originating from watersheds, farms or foods (n = 238), human infections (n = 74), proficiency panels (n = 27), and unknown sources (n = 32). Prior to selection, isolates were traditionally serotyped at national or provincial reference labs, and stx gene presence/subtype was assessed by preestablished generic and differentiating stx PCR assays (9–11).
DNA extracted from 1-ml Luria-Bertani broth cultures grown overnight at 37°C using MasterPure complete DNA purification kits (Epicentre Technologies Corp., Chicago, IL, USA) was fragmented by an E210 ultrasonicator (Covaris, Inc., Woburn, MA, USA). TruSeq DNA library preparation v2 kit (Illumina, San Diego, CA, USA) libraries were shotgun sequenced using an Illumina GAIIx system (2 × 150-bp paired-end cluster generation kit v4 and TruSeq SBS kit v5) or MiSeq platform (2 × 300 bp; v3 chemistry). Illumina reads were managed in the Integrated Rapid Infectious Disease Analysis (IRIDA) platform (12), assessed for quality (Q > 30) using FastQC (13), and trimmed using Trimmomatic v0.34 (14). Overlapping reads merged with FLASH v1.2.11 (15) were de novo assembled using SPAdes v3.8.2 (16)/Shovill 0.9.0 (17). Postassembly quality control was achieved using QUAST v5.0.0 (number of contigs, < 500; reference coverage, 70% [closest polished genome of FWSEC0001-0011]) (18). Reference strains were augmented with 2 × 300-bp MiSeq (v3 chemistry) reads from mate pair (∼8 kb) TruSeq libraries and with MinION Mk1b long reads from the rapid barcoding sequencing kit (SQK-RBK004; Oxford Nanopore Technologies Ltd., Oxford, UK) libraries. Albacore v2.3.0 base-called/quality-filtered long reads were de novo assembled using Canu v1.7 (19) and with quality-controlled Illumina mate pair reads as hybrid assemblies using Unicycler v0.4.4.0 (20). When assemblies appeared congruent in Mauve v20150226Build10 (21), Unicycler assemblies were used. Otherwise, mate pair reads were mapped to both assemblies using Bowtie 2 v2.3.4.1 (22) and BAM files assessed for coverage/connections using GAP5 v.1.2.14-r (23); long reads were mapped with BWA-MEM v0.7.17.1 (24) and assessed using Tablet v1.17.08.17 (25). Canu contigs were employed to scaffold/correct Unicycler contigs using the Staden package GAP4 (26, 27); all contigs were circularized and trimmed. Assemblies were Illumina read polished (5 rounds) using Bowtie 2/Pilon v1.20.1 (28). NCBI’s default Web BLASTN (29) identified plasmid contigs and confirmed that in silico O-serogroup determinations were congruent with traditional lab determinations. Read depth was assessed using SAMtools idxstats (30). After functional annotation using NCBI’s Prokaryotic Genome Annotation Pipeline (31), assemblies were reoriented to replication origin (dnaA) using Circlator v1.1.5 (32).
Illumina reference genome coverage ranged from 65.6× to 130.7× (average, 96.1×); MinION coverage ranged from 51.2× to 325.1× (average, 149.6×) (Table 1). Of 360 draft assemblies, 357 yielded scaffolds (average contigs, 153.0; average coverage depth, 111.7×). Eleven reference chromosomes and all plasmids but one were circularized (0 to 3 plasmids per strain). The reference chromosomes (4,955,402 to 5,697,154 bp) contained 4,967 to 5,833 coding sequences (CDS), 22 rRNAs, 90 to 103 tRNAs, and 8 to 11 noncoding RNAs (ncRNAs), as well as bacteriophages. These genomic resources augment available data and are ideal for pathogenomics applications and machine learning.
Characteristics and accession numbers for 11 high-quality STEC reference genomes and 360 STEC draft genome assemblies from 129 distinct serogroups sequenced for this study
Data availability.The standardized strain descriptions and accession numbers are presented in Table 1; the genomic data are publicly available in DDBJ/ENA/GenBank under BioProject no. PRJNA287560 and in the Sequence Read Archive under accession no. SRP155537. The versions described are the first versions.
ACKNOWLEDGMENTS
Many strains were actively collected during the Genomics Research and Development Initiative national shared priorities project on Food and Water Safety (GRDI-FWS); otherwise, they were acquired from the culture collections of T. Alexander, P. Delaquis, T. Edge, A. Gill, C. Gyles, C. Nadon, A. Scott, E. Topp, L. Tschetter, G. Wang, and the GRDI-FWS project partner organizations (namely, Agriculture and Agri-Food Canada, the Canadian Food Inspection Agency, Environment and Climate Change Canada, Health Canada, and the Public Health Agency of Canada [PHAC]). The National Microbiology Laboratory (NML)-Division of Enteric Diseases performed STEC serotyping (under direction by K. Tabor and K. Ziebell). C. Jokinen and R. Wang provided lab support. The NML Genomics Core (C. Bonner, B. Kaplen, V. Laminman, E. Landry, K. Melnychuk, T. Murphy, and G. Peters) performed sequencing. F. Pollari and K. Pintar collated metadata for FoodNet Canada isolates. IRIDA’s development team provided data management. The NML’s Bioinformatics Core and Scientific Informatics Services Division provided analysis capacity and infrastructure, respectively.
We thank the NCBI for all data assistance.
A.O. and T.L. were supported by the Government of Canada’s Federal Genomics Research and Development Initiative (GRDI) national shared priorities project on Food and Water Safety (GRDI-FWS). The work was funded by GRDI-FWS and an intramural GRDI to V.G. (for a portion of STEC draft genome assemblies and strain metadata), the Public Health Agency of Canada (for STEC reference genome closures), and Genome Canada/Genome BC (for metadata standardization).
The funders had no role in the study design, data collection, interpretation, public repository submission, or the decision to submit the work for publication.
FOOTNOTES
- Received 28 May 2019.
- Accepted 31 August 2019.
- Published 10 October 2019.
- © Crown copyright 2019.
This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International license.