Metabuli: sensitive and specific metagenomic classification via joint analysis of amino acid and DNA (2024)

Data availability

The raw data for Fig. 2 and Extended Data Figs. 14 are provided as Source data. Performance measures at different ranks of benchmarks of Fig. 2a,b and Extended Data Fig. 1 are available in Supplementary Tables 13. The assemblies used for read simulation and database creation in synthetic benchmarks are listed in Supplementary Table 4, and the simulated reads are available via Zenodo at https://doi.org/10.5281/zenodo.10250585 (ref. 28). More detailed results and utilized accessions of Fig. 2c,d are provided in Supplementary Tables 5 and 6. The databases used in Fig. 2c,d were built using viral genomes (release 212) and a human genome (GCF_009914755.1) downloaded from NCBI RefSeq, and accessions of genomes of analyzed SARS-CoV-2 variants were denoted in ‘Pathogen detection tests’ section in Methods. Performance measures at different ranks of Fig. 2e and Extended Data Fig. 2 are provided in Supplementary Tables 79. Precision and recall of Extended Data Fig. 4 are available in Supplementary Table 10. The accessions of real data analyzed in Fig. 2g,h and Extended Data Fig. 3 are denoted in ‘Benchmarks with real metagenomes’ section in Methods. CAMI2-provided datasets and taxonomy used in Fig. 2e,f and Extended Data Fig. 2 can be downloaded from https://data.cami-challenge.org/participate. Source data are provided with this paper.

Code availability

Metabuli is GPLv3-licensed free open-source software. The source code and ready-to-use binaries, as well as precomputed databases (Supplementary Table 11), can be downloaded at metabuli.steineggerlab.com. The scripts used for benchmarks and plots are available at https://github.com/jaebeom-kim/metabuli-analysis and https://github.com/jaebeom-kim/metabuli-plots.

References

  1. Simon, H. Y., Siddle, K. J., Park, D. J. & Sabeti, P. C. Benchmarking metagenomics tools for taxonomic classification. Cell 178, 779–794 (2019).

    Article Google Scholar

  2. Nooij, S., Schmitz, D., Vennema, H., Kroneman, A. & Koopmans, M. P. Overview of virus metagenomic classification methods and their biological applications. Front. Microbiol. 9, 749 (2018).

    Article PubMed PubMed Central Google Scholar

  3. Mirdita, M., Steinegger, M., Breitwieser, F., Söding, J. & Levy Karin, E. Fast and sensitive taxonomic assignment to metagenomic contigs. Bioinformatics 37, 3029–3031 (2021).

    Article CAS PubMed PubMed Central Google Scholar

  4. Wood, D. E., Lu, J. & Langmead, B. Improved metagenomic analysis with Kraken 2. Genome Biol. 20, 1–13 (2019).

    Article Google Scholar

  5. Breitwieser, F. P., Baker, D. N. & Salzberg, S. L. Krakenuniq: confident and fast metagenomics classification using unique k-mer counts. Genome Biol. 19, 198 (2018).

    Article CAS PubMed PubMed Central Google Scholar

  6. Menzel, P., Ng, K. L. & Krogh, A. Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat. Commun. 7, 11257 (2016).

    Article CAS PubMed PubMed Central Google Scholar

  7. Kim, D., Song, L., Breitwieser, F. P. & Salzberg, S. L. Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res. 26, 1721–1729 (2016).

    Article CAS PubMed PubMed Central Google Scholar

  8. Dilthey, A. T., Jain, C., Koren, S. & Phillippy, A. M. Strain-level metagenomic assignment and compositional estimation for long reads with metamaps. Nat. Commun. 10, 3066 (2019).

    Article PubMed PubMed Central Google Scholar

  9. Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinform. 11, 119 (2010).

    Article Google Scholar

  10. Parks, D. H. et al. GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy. Nucleic Acids Res. 50, D785–D794 (2022).

    Article CAS PubMed Google Scholar

  11. Watson, M. & Warr, A. Errors in long-read assemblies can critically affect protein prediction. Nat. Biotechnol. 37, 124–126 (2019).

    Article CAS PubMed Google Scholar

  12. Meyer, F. et al. Critical assessment of metagenome interpretation: the second round of challenges. Nat. Methods 19, 429–440 (2022).

    Article CAS PubMed PubMed Central Google Scholar

  13. Nasko, D. J., Koren, S., Phillippy, A. M. & Treangen, T. J. Refseq database growth influences the accuracy of k-mer-based lowest common ancestor species identification. Genome Biol. 19, 1–10 (2018).

    Article Google Scholar

  14. Lu, J. et al. Metagenome analysis using the Kraken software suite. Nat. Protoc. 17, 2815–2839 (2022).

  15. Holtgrewe, M. Mason - A Read Simulator for Second Generation Sequencing Data. Technical Report (FU Berlin, 2010).

  16. Ono, Y., Hamada, M. & Asai, K. PBSIM3: a simulator for all types of PacBio and ONT long reads. NAR Genom. Bioinform. 4, lqac092 (2022).

  17. de la Cuesta-Zuluaga, J., Ley, R. E. & Youngblut, N. D. Struo: a pipeline for building custom databases for common metagenome profilers. Bioinformatics 36, 2314–2315 (2020).

  18. Youngblut, N. & Shen, W. nick-youngblut/gtdb_to_taxdump: Zenodo release. Zenodo https://doi.org/10.5281/zenodo.3696964 (2020).

  19. Frith, M. C. A new repeat-masking method enables specific detection of hom*ologous sequences. Nucleic Acids Res. 39, e23 (2011).

  20. Rahaman, M. M. et al. Genomic characterization of the dominating Beta, V2 variant carrying vaccinated (Oxford-AstraZeneca) and nonvaccinated COVID-19 patient samples in Bangladesh: a metagenomics and whole-genome approach. J. Med. Virol. 94, 1670–1688 (2022).

  21. Lentini, A., Pereira, A., Winqvist, O. & Reinius, B. Monitoring of the SARS-CoV-2 Omicron BA.1/BA.2 lineage transition in the Swedish population reveals increased viral RNA levels in BA.2 cases. Med 3, 636–643 (2022).

  22. Desai, N. et al. Temporal and spatial heterogeneity of host response to SARS-CoV-2 pulmonary infection. Nat. Commun. 11, 6319 (2020).

  23. Gehrig, J. L. et al. Finding the right fit: evaluation of short-read and long-read sequencing approaches to maximize the utility of clinical microbiome data. Microb. Genom. 8, 000794 (2022).

  24. Liu, L., Yang, Y., Deng, Y. & Zhang, T. Nanopore long-read-only metagenomics enables complete and high-quality genome reconstruction from mock and complex metagenomes. Microbiome 10, 209 (2022).

  25. Barnes, S. J. et al. Metagenome-assembled genomes from photo-oxidized and nonoxidized oil-degrading marine microcosms. Microbiol. Resour. Announc. 12, 6 (2023).

  26. Priest, T., Orellana, L. H., Huettel, B., Fuchs, B. M. & Amann, R. Microbial metagenome-assembled genomes of the Fram Strait from short and long read sequencing platforms. PeerJ 9, e11721 (2021).

  27. Huang, R. et al. Long-read metagenomics of marine microbes reveals diversely expressed secondary metabolites. Microbiol. Spectr. 11, e0150123 (2023).

  28. Kim, J. Simulated query reads used for benchmarks in Metabuli publication. Zenodo https://doi.org/10.5281/zenodo.10250585 (2023).

Download references

Acknowledgements

The authors thank E. Levy Karin for the valuable scientific feedback and the careful review and revision of the paper; J. Söding for the discussions on metamer encoding; M. Mirdita for the usability improvements of the software; H. Kim for the improvement of figures; S. Jaenicke for the voluntary examination of the software; and M. Kim for the feedback on the paper. M.S. acknowledges support by the National Research Foundation of Korea grants (2020M3-A9G7-103933, 2021-R1C1-C102065 and 2021-M3A9-I4021220), the Samsung DS research fund, and the Creative-Pioneering Researchers Program and AI-Bio Research Grant through Seoul National University.

Author information

Authors and Affiliations

  1. Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea

    Jaebeom Kim&Martin Steinegger

  2. School of Biological Sciences, Seoul National University, Seoul, Republic of Korea

    Martin Steinegger

  3. Institute of Molecular Biology and Genetics, Seoul National University, Seoul, Republic of Korea

    Martin Steinegger

  4. Artificial Intelligence Institute, Seoul National University, Seoul, Republic of Korea

    Martin Steinegger

Authors

  1. Jaebeom Kim

    View author publications

    You can also search for this author in PubMedGoogle Scholar

  2. Martin Steinegger

    View author publications

    You can also search for this author in PubMedGoogle Scholar

Contributions

J.K. and M.S. designed the research, developed the software, performed analysis and wrote the paper.

Corresponding author

Correspondence to Martin Steinegger.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks André Soares and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Lei Tang, in collaboration with the Nature Methods team. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Synthetic benchmark results.

Simulated short (Illumina) and long (PacBio HiFi, ONT, and PacBio Sequel II) reads were used for performance evaluation based on GTDB genomes and taxonomy. Hybrid = (x, y) is the result of applying the DNA-based tool x, followed by the AA-based tool y, where both are the best-performing. ad Subspecies-level classification tests. Reads were simulated from subspecies present in databases, and precision and recall were measured at subspecies rank. a) Hybrid = (KrakenUniq, Kraken2x). b-d) Hybrid = (MetaMaps, Kraken2X). Raw data for performance measurements at subspecies, species, genus, and family ranks are available in Supplementary Table 1. eh Species-level classification tests. Not the queried subspecies but their sibling subspecies were contained in databases to measure species-level classification. Hybrid = (KrakenUniq, Kraken2X). Raw data for performance measurements at species, genus, family, and order ranks are available in Supplementary Table 2. il Genus-level classification tests. Not the queried species but their sibling species were contained in databases, so how well each tool can detect hom*ology within the same genus was measured. i) Hybrid = (Kraken2, MMseqs2). j-l) Hybrid = (Kraken2, Kraken2X). Raw data for performance measurements at genus, family, order, and class ranks are available in Supplementary Table 3.

Source data

Extended Data Fig. 2 Benchmarks using CAMI2’s strain-madness, marine, and plant-associated datasets.

GTDB genomes and the CAMI2-provided taxonomy were used for the database creation. CAMI2-provided short reads of strain-madness (a), marine (b), and plant-associated (c) datasets were classified by each tool, and the average values of the metrics that were measured at the species and genus rank for each sample were plotted. Raw data and metrics for each sample are available in Supplementary Tables 79.

Source data

Extended Data Fig. 3 Comparison of Metabuli to best performing AA- and DNA-based tools on real long-read metagenomic samples.

In contrast to Fig. 2g–h, Kraken2X instead of Kaiju is utilized due to its superior performance on long reads. The databases were built using GTDB genomes and a human genome (T2T-CHM13v2.0) based on GTDB taxonomy edited to include a human taxon. Real nanopore sequencing data from human gut (a) and marine (b) environments, as well as PacBio HiFi reads from human gut (c) and marine (d) environments, were classified by each tool. The area is proportional to the number of reads within each panel. The proportion of reads classified by each tool is denoted in parentheses.

Source data

Extended Data Fig. 4 Subspecies-level classification performance by clade size.

All 2,382 query subspecies used in Extended Data Fig. 1a were divided into groups according to the number of subspecies siblings they had in the reference database, that is, by their species clade size. The average F1 score for queries in each group decreases as the clade’s size increases, indicating that more sibling subspecies pose a harder classification challenge to all tools. Precision and recall are available in Supplementary Table 10.

Source data

Full size table

Supplementary information

Supplementary Information

Supplementary Figs. 1–7.

Supplementary Tables

Supplementary Tables 1–10. Raw data and utilized accessions of Fig. 2a–e and Extended Data Figs. 1, 2 and 4. Supplementary Table 11. A list of provided prebuilt databases.

Source data

Source Data Fig. 2

Statistical source data.

Source Data Extended Data Fig. 1

Statistical source data.

Source Data Extended Data Fig. 2

Statistical source data.

Source Data Extended Data Fig. 3

Statistical source data.

Source Data Extended Data Fig. 4

Statistical source data.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Metabuli: sensitive and specific metagenomic classification via joint analysis of amino acid and DNA (1)

Cite this article

Kim, J., Steinegger, M. Metabuli: sensitive and specific metagenomic classification via joint analysis of amino acid and DNA. Nat Methods (2024). https://doi.org/10.1038/s41592-024-02273-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1038/s41592-024-02273-y

Metabuli: sensitive and specific metagenomic classification via joint analysis of amino acid and DNA (2024)

FAQs

How is metagenomics classified? ›

The classification methods of metagenomics sequencing data can be divided into two categories according to different sample data processed methods: one is based on the sequence of marker genes such as 16SrRNA, the other is based on whole-genome sequencing fragments.

What is metagenomics in microbiology? ›

Metagenomics is defined as the direct genetic analysis of genomes contained with an environmental sample.

How much DNA is needed for metagenomics? ›

If you are interested in genome or metagenome sequencing on any of the Illumina sequencers such as the Illumina MiSeq or Illumina NovaSeq, the recommended amount of DNA is 50 ng-500 ng. If the genome you are trying to sequence is large or complex, we strongly recommend submitting at least 100 ng of good quality gDNA.

What is the purpose of metagenomic analysis? ›

Metagenomics is the study of the structure and function of entire nucleotide sequences isolated and analyzed from all the organisms (typically microbes) in a bulk sample. Metagenomics is often used to study a specific community of microorganisms, such as those residing on human skin, in the soil or in a water sample.

What is metagenomics most useful for? ›

Metagenomics is a high throughput method that confirms the presence or absence of specific organisms or genes in a microbiome (Wani et al., 2022). Metagenomics has been used to identify the diversity of microbial communities in a variety of environments (Bahram et al., 2021).

What are the disadvantages of metagenomics? ›

The biggest disadvantage is that it depends on the expression of functional genes in foreign hosts, so the success rate of screening is very low. Thus, selecting the right host and cloning the full length of a gene or gene cluster is required.

How many types of metagenomics are there? ›

This method is mainly based on three different metagenomic approaches such as targeted metagenomics, functional metagenomics, and sequence-based metagenomics. Targeted metagenomics applies multiple feasible PCR-based methods to identify various resistance genes.

What is a metagenome assembled genome classification? ›

A Metagenome-Assembled Genome (MAG) is a single-taxon assembly based on one or more binned metagenomes that has been asserted to be a close representation to an actual individual genome (that could match an already existing isolate or represent a novel isolate).

How do you classify bioinformatics? ›

Basic bioinformatics services are classified by the EBI into three categories: SSS (Sequence Search Services), MSA (Multiple Sequence Alignment), and BSA (Biological Sequence Analysis).

Top Articles
Latest Posts
Article information

Author: Twana Towne Ret

Last Updated:

Views: 6247

Rating: 4.3 / 5 (64 voted)

Reviews: 95% of readers found this page helpful

Author information

Name: Twana Towne Ret

Birthday: 1994-03-19

Address: Apt. 990 97439 Corwin Motorway, Port Eliseoburgh, NM 99144-2618

Phone: +5958753152963

Job: National Specialist

Hobby: Kayaking, Photography, Skydiving, Embroidery, Leather crafting, Orienteering, Cooking

Introduction: My name is Twana Towne Ret, I am a famous, talented, joyous, perfect, powerful, inquisitive, lovely person who loves writing and wants to share my knowledge and understanding with you.