Introduction

Fig. 1: Exponential decrease of bacteria searchability by BLAST.
The challenge
Comprehensive collections approaching millions of sequenced genomes have become central information sources in the life sciences. However, their size grows exponentially, at a faster rate than computational capacities. This makes effectively impossible to search these data using tools such as BLAST and its successors. For instance, the proportion of searchable bacteria decreases exponentially over time (Fig. 1).
Phylogenetic compression
Phylogenetic compression is a technique using evolutionary history to guide compression and search of large collections of microbial genomes using existing algorithms and data structures. This improves the compression ratios of assemblies, de Bruijn graphs, and 𝑘-mer indexes by 1–2 orders of magnitude. Consequently, this enables BLAST-like alignment to all sequenced bacteria until 2019 on ordinary desktop and laptop computers within a few hours.
How does it work?
Phylogenetic compression combines four ingredients:
  1. clustering of genomes into phylogenetically related groups, followed by
  2. inference of a compressive phylogeny that acts as a template for
  3. data reordering, prior to
  4. an application of a calibrated low-level compressor or indexer.
This general scheme can be instantiated to individual protocols for various data types and for diverse application use cases, such as genome compression and data search.
Application 1: Efficient parallelized compression of arbitrarily sized genome collections
MiniPhy (Minimization via Phylogenetic compression) implements phylogenetic compression for large bacteria data. Examples of miniphied collections are provided in the List of Compressed Genome Collection. For instance, the 661k collection was recompressed from the original 805 GB (GZip) to 17.5 GB (MiniPhy-MBGC).
Application 2: BLAST-like alignments across all pre-2019 bacteria
MOF (Microbes on a Flash Drive) implements BLAST-like search across all pre-2019 bacteria on ordinary laptop and desktop computers. For instance, time to search 2,826 EBI plasmids was reduced from the original 2120 CPU hours (BIGSI, pres./abs. only) to 44 CPU hours (MOF, pres./abs. and alignments).
Want to learn more about the science behind?
See the main paper about phylogenetic compression.

How-To’s

You’re a user

BLAST-like search across all pre-2019 bacteria from ENA
MOF (Microbes on a Flash drive) is a tool based on phylogenetic compression to align queries to all high-quality genomes from the 661k collection on standard desktop and laptops computer in a fashion similar to BLAST. All documentation and instructions for users are provided in the README of MOF.
Downloading phylogenetically compressed 661k and BIGSIdata collections
Phylogenetic compression allows to compress existing large genome collection by 1-2 orders compared to the state-of-the-art protocols. Two main collections provided for users are:

For a comprehensive list of all compressed collections and additional details, see the List of Compressed Genome Collections.

Phylogenetic compression of custom genome collections by MiniPhy
Phylogenetic compression can in principle be extremely straighforward and based entirely on simple dataset-specific scripts, ordering the genomes according to a phylogenetic tree and compressing them in that order.

MiniPhy, implements this specifically with MashTree and XZ (possibly followed by MBGC), and is suitable for most practical use cases. All documentation and instructions for users are provided in the README of MiniPhy, including information on batching in case of very large collections.

You’re a method developer

Evaluating your own low-level compressor in connection with phylogenetic compression
Recompress published tar archives, for instance, from the phylogenetically compressed 661k collection. If your compressor supports arbitrary content, just recompress a given TAR file, e.g., by
xzcat neisseria_gonorrhoeae__01.tar.xz \
  | your_general_compressor \
  > neisseria_gonorrhoeae__01.tar.compressed
If your compressor supports only the FASTA format, merge all the content (in the same file order) and recompress it, e.g., by
tar -xOvf neisseria_gonorrhoeae__01.tar.xz \
  | your_fasta_compressor \
  > neisseria_gonorrhoeae__01.fa.compressed
Evaluating your own phylogeny inference methods in connection with phylogenetic compression

Download one or more batches from the phylogenetically compressed 661k collection, infer their phylogeny using your method, and finally re-compress the genomes using MiniPhy with your phylogeny. This can be achieved by placing both {batch}.txt and {batch}.nw into the input/ directory

Evaluating your own genome indexer in connection with phylogenetic compression

Download one or more batches from the phylogenetically compressed 661k collection and index them in the order in which they appear in the archive. Indexing can be done either per individual batches (resulting in many small indexes), or by merging all the genome batches together while preserving the orders (resulting in one large index).

Genome order can be determined from from a .tar.xz file by
tar tf {batch}.tar.xz

List of Compressed Genome Collections

The following phylogenetically genome collections are provided for download on Zenodo. Supplementary metadata for all the datasets can be found in a dedicated repository.

661k

Genomes: 661,405 Illumina draft assemblies of (est.) 2,336 bacterial species
Length: 2.58 Tbp
Diversity: 44.3 G distinct canonical 31-mers
Original size: assemblies: 805 GB (750 GiB, GZip)
k-mer index: 936 GB (872 GiB, COBS Compact index)
Significance: All pre-2019 Illumina-sequenced bacterial isolates from ENA, all assembled using a unified pipeline

Assemblies

  • MiniPhy-XZ29.0 GB, production-ready
  • MiniPhy-MBGCv120.7 GB, experimental
  • MiniPhy-MBGCv217.5 GB, experimental

K-mer indexes

661k-HQ

Significance: Only those assemblies from the 661k that passed quality control (i.e., thus excluding contaminated samples)

K-mer indexes

BIGSIdata

Genomes: 425,160 de Bruijn graph of (est.) 1,443 microbial species
Length: 1.68 Tbp (total unitig length)
Diversity: 41.1 G distinct canonical 31-mers
Original size: 16.7 TB after McCortex cleaning
Significance: All pre-2016 ENA bacterial and viral genomes

de Bruijn graphs

  • Prototype of MiniPhy(P3)-XZ74.4 GB, production ready
  • MiniPhy(P3)-XZ52.3 GB, experimental

NCTC3k

Genomes: 1,065 near-complete assemblies of 259 bacterial species
Length: 4.35 Gbp
Diversity: 992 M distinct canonical 31-mers
Original size: 1.25 GB after gzip compression
Significance: A high-quality collection of diverse, nearly-complete bacterial genomes

Assemblies

GISP

Genomes: 1,102 Illumina draft assemblies of N. gonorrhoeae from this paper
Length: 2.36 Gbp
Diversity: 4.18 M distinct canonical 31-mers
Original size: 726 MB after gzip compression.
Significance:
Significance: A high-quality collection of draft assemblies of single bacterial species of a low diversity

Assemblies

SC2

Genomes: 590,779 complete assemblies of SARS-CoV-2
Length: 17.6 Gbp
Diversity: 1.85 M
Original size: 201 MB after xz compression
Significance: An extremely large collection of genomes of the same viral species

Assemblies

  • Equiv. of MiniPhy-XZ10.7 MB
    • Technique: Genomes sorted left-to-right with respect to GISAID phylogeny and compressed by XZ
    • Files: Upon request (due to the licensing policies)

List of Software Packages

Core packages for phylogenetic compression

MOF
BLAST-like search on laptops across all pre-2019 bacteria (the 661k-HQ collection). Implemented as a Snakemake pipeline,
MiniPhy
The main package for phylogenetic compression of individual genome batches.

Auxiliary packages for phylogenetic compression

MiniPhy-COBS
Building phylogenetically compressed COBS indexes from the output of MiniPhy.
de-MiniPhy-BIGSIdata
Download and extraction of de Bruijn graphs from the minified BIGSIdata collection.

Low-level tools particularly adapted for phylogenetic compression

ProPhyle
Metagenomic classifier, based on 𝑘-mer propagation, simplitigs, and 𝑘-mer indexing using the Burrows-Wheeler Transform. ProPhyle is used by MiniPhy as the underlying engine for 𝑘-mer propagation to compres de Bruijn graphs and for computing the phylogenetically explained data redundancy in genome collections. ProPhyle was modified for the purpose of MOF by adding a parameter that stops the indexing step after 𝑘-mer propagation.
COBS
High-performance k-mer index based on inverted indexes and Bloom filters; an efficient re-implementation of BIGSI with additional ideas. To use COBS in MOF, we implemented functionality for reading indexes from data streams and support for OS X (versions 0.2.0 and 0.2.1).

Cite

Main paper

More information about phylogenetic compression, MiniPhy, and MOF can be found in the main phylogenetic compression paper [1].

  [1]  K. Břinda, L. Lima, S. Pignotti, N. Quinones-Olvera, K. Salikhov, R. Chikhi, G. Kucherov, Z. Iqbal, and M. Baym, Efficient and robust search of microbial genomes via phylogenetic compression, bioRxiv 2023.04.15.536996, 2023. https://doi.org/10.1101/2023.04.15.536996

Low-level techniques

MiniPhy and MOF build upon several low-level computational techniques that we developed previously, including simplitigs [2], COBS [3], 𝑘-mer propagation and ProPhyle [4], and the linkage between alignment scores and 𝑘-mer matches [5].

  [2]  K. Břinda, M. Baym, and G. Kucherov, Simplitigs as an efficient and scalable representation of de Bruijn graphs, Genome Biology 22(96), 2021. https://doi.org/10.1186/s13059-021-02297-z
  [3] T. Bingmann, P. Bradley, F. Gauger, and Z. Iqbal, COBS: A Compact Bit-Sliced Signature Index, SPIRE 2019, 2019. https://doi.org/10.1007/978-3-030-32686-9_21
  [4] K. Břinda, Novel computational techniques for mapping and classification of Next-Generation Sequencing data. PhD thesis, University of Paris-Est, 2016. https://doi.org/10.5281/zenodo.1045317
  [5] K. Břinda, M. Sykulski, G. Kucherov, Spaced seeds improve 𝑘-mer-based metagenomic classification, Bioinformatics 31(22), 2015. https://doi.org/10.1093/bioinformatics/btv419

Authors

The project originally started in the Baym lab at Harvard Medical School and has been continuing in the Břinda group at Inria GenScale. The project was developed in collaboration with the Iqbal group at EMBL-EBI.