Bacterial Genome Downloading Made Easy with NCBI Tools

daivessparcernstum
Aug 2, 2023
9 min read

How to use ncbi-genome-download to download bacterial genomes from NCBI

If you are interested in downloading bacterial genomes from the National Center for Biotechnology Information (NCBI) FTP servers, you might find the ncbi-genome-download tool very useful. This tool is a Python script that allows you to download genomes from NCBI by various criteria, such as taxonomic name, assembly accession, assembly level, refseq category, and more. You can also choose the formats and files to download, such as GenBank, FASTA, protein, assembly report, etc. In this article, we will show you how to install and use ncbi-genome-download to download bacterial genomes from NCBI.

What is ncbi-genome-download and why use it?

A brief introduction to ncbi-genome-download

ncbi-genome-download is a Python script that was created by Kai Blin, a bioinformatician and software developer at the Novo Nordisk Foundation Center for Biosustainability. The idea was inspired by Mick Watson's Kraken downloader scripts, which are written in Perl and specific to building a Kraken database. However, ncbi-genome-download focuses on the actual genome downloading and supports different formats and criteria. The tool is open source and available on GitHub .

ncbi-genome-download bacteria

Download Zip

The benefits of using ncbi-genome-download

There are several benefits of using ncbi-genome-download over other methods of downloading genomes from NCBI. Some of them are:

It is easy to install and use. You can install it using pip or conda, and run it from the command line with simple options.

It is flexible and customizable. You can download genomes by different criteria, such as taxonomic name, assembly accession, assembly level, refseq category, genera, species, etc. You can also choose the formats and files to download, such as GenBank, FASTA, protein, assembly report, etc.

It is fast and efficient. You can run multiple downloads in parallel using the --parallel option. You can also resume interrupted downloads using the --resume option.

It is updated and maintained. The tool is regularly updated to reflect the changes in the NCBI FTP servers and the available genome data. You can also report issues or suggest features on GitHub .

How to install ncbi-genome-download

Using pip

If you have Python installed on your system, you can use pip to install ncbi-genome-download. Pip is a package manager for Python that allows you to install packages from PyPI , the Python Package Index. To install ncbi-genome-download using pip, run the following command:

pip install ncbi-genome-download

If this fails on older versions of Python, try updating your pip tool first:

ncbi-genome-download bacterial refseq genomes

ncbi-genome-download microbial genomes ftp

ncbi-genome-download prokaryotic genomes

ncbi-genome-download archaea genomes

ncbi-genome-download bacteria genera

ncbi-genome-download bacterial reference genomes

ncbi-genome-download microbial genomes refseq

ncbi-genome-download prokaryotes ftp

ncbi-genome-download archaeal genomes

ncbi-genome-download bacteria assembly levels

ncbi-genome-download bacterial genbank format

ncbi-genome-download microbial genomes pipeline

ncbi-genome-download prokaryotic annotation

ncbi-genome-download archaea refseq categories

ncbi-genome-download bacteria species

ncbi-genome-download bacterial fasta format

ncbi-genome-download microbial genomes blast

ncbi-genome-download prokaryotic genome submission

ncbi-genome-download archaeal genbank format

ncbi-genome-download bacteria strains

ncbi-genome-download bacterial gff format

ncbi-genome-download microbial genomes bioproject

ncbi-genome-download prokaryotic genome assembly

ncbi-genome-download archaea fasta format

ncbi-genome-download bacteria taxid

ncbi-genome-download bacterial protein sequences

ncbi-genome-download microbial genomes biosample

ncbi-genome-download prokaryotic genome analysis

ncbi-genome-download archaea gff format

ncbi-genome-download bacteria accession numbers

ncbi-genome-download bacterial rna sequences

ncbi-genome-download microbial genomes annotation tools

ncbi-genome-download prokaryotic genome browser

ncbi-genome-download archaea protein sequences

ncbi-genome-download bacteria download script

ncbi-genome-download bacterial genome size

ncbi-genome-download microbial genomes contact and outreach

ncbi-genome-download prokaryotic genome comparison

ncbi-genome-download archaea rna sequences

ncbi-genome-download bacteria download directory

pip install --upgrade pip

and then rerun the ncbi-genome-download install.

Using conda

If you prefer to use conda, a package manager for Python and other languages that allows you to install packages from various channels , you can also install ncbi-genome-download using conda. Conda is part of Anaconda [^12^ ), a distribution of Python and other tools for data science and machine learning. To install ncbi-genome-download using conda, run the following command:

conda install -c bioconda ncbi-genome-download

This will install ncbi-genome-download from the bioconda channel , which is a community-driven channel that provides bioinformatics packages for conda.

How to download bacterial genomes by different criteria

Using taxonomic name or ID

One of the most common ways to download bacterial genomes from NCBI is by using the taxonomic name or ID of the group of interest. For example, if you want to download all the genomes of the phylum Firmicutes, you can use the following command:

ncbi-genome-download --section refseq --group bacteria --taxon firmicutes

This will download all the genomes of the Firmicutes phylum from the refseq section of the NCBI FTP servers. The refseq section contains curated and annotated genomes that are considered reference sequences . You can also use the --section genbank option to download genomes from the genbank section, which contains all the genomes submitted to NCBI . However, note that some genomes may be duplicated or incomplete in the genbank section.

You can also use the taxonomic ID instead of the name, if you know it. For example, the taxonomic ID of Firmicutes is 1239, so you can use the following command:

ncbi-genome-download --section refseq --group bacteria --taxid 1239

This will download the same genomes as before. You can find the taxonomic ID of any group by using the NCBI Taxonomy Browser .

Using assembly accession or BioProject accession

If you want to download a specific genome or a set of genomes by their assembly accession or BioProject accession, you can use the --assembly-accessions or --bioprojects options. For example, if you want to download the genome of Escherichia coli K-12 MG1655, which has the assembly accession GCF_000005845.2 and the BioProject accession PRJNA57779, you can use either of these commands:

ncbi-genome-download --section refseq --group bacteria --assembly-accessions GCF_000005845.2

ncbi-genome-download --section refseq --group bacteria --bioprojects PRJNA57779

This will download only the genome of E. coli K-12 MG1655 from the refseq section. You can find the assembly accession and BioProject accession of any genome by using the NCBI Assembly Database or the NCBI BioProject Database .

Using assembly level or refseq category

If you want to filter the genomes by their assembly level or refseq category, you can use the --assembly-level or --refseq-category options. The assembly level indicates how complete and contiguous a genome assembly is, and it can be one of these values: complete, chromosome, scaffold, or contig . The refseq category indicates how representative and reliable a genome sequence is, and it can be one of these values: reference, representative, or na . For example, if you want to download only the complete genomes of bacteria that are reference sequences from the refseq section, you can use this command:

ncbi-genome-download --section refseq --group bacteria --assembly-level complete --refseq-category reference

This will download only the genomes that meet both criteria. You can also use multiple values for each option by separating them with commas. For example, if you want to download all the genomes of bacteria that are either complete or chromosome level assemblies from either the refseq or genbank sections, you can use this command:

ncbi-genome-download --section refseq,genbank --group bacteria --assembly-level complete,chromosome

This will download all the genomes that meet either criterion.

Using genera or species name

If you want to download genomes by their genera or species name, you can use the --genera or --species options. For example, if you want to download all the genomes of bacteria that belong to the genus Bacillus from the refseq section, you can use this command:

ncbi-genome-download --section refseq --group bacteria --genera Bacillus

This will download all the genomes of Bacillus species from the refseq section. You can also use the species name instead of the genus name, if you know it. For example, if you want to download only the genome of Bacillus subtilis 168, which is a model organism for bacterial genetics and physiology, you can use this command:

ncbi-genome-download --section refseq --group bacteria --species "Bacillus subtilis 168"

This will download only the genome of B. subtilis 168 from the refseq section. Note that you need to use quotation marks around the species name if it contains spaces. You can find the genera and species names of any genome by using the NCBI Genome Database .

How to choose the formats and files to download

Using the --formats option

By default, ncbi-genome-download will download the GenBank format files for each genome, which contain the nucleotide sequences and annotations of the genomic features. However, you can also choose other formats to download, such as FASTA, protein, assembly report, etc. To do this, you can use the --formats option and specify one or more formats separated by commas. For example, if you want to download both the GenBank and FASTA format files for each genome, you can use this command:

ncbi-genome-download --section refseq --group bacteria --formats genbank,fasta

This will download both the .gbff and .fna files for each genome from the refseq section. The .gbff files contain the GenBank format data, and the .fna files contain the FASTA format data. The FASTA format files only contain the nucleotide sequences without annotations. You can find a list of all the supported formats and their extensions on GitHub .

Using the --include option

Sometimes, you may want to download additional files that are not part of the standard formats, such as feature tables, protein tables, RNA tables, etc. To do this, you can use the --include option and specify one or more file extensions separated by commas. For example, if you want to download both the GenBank format files and the feature table files for each genome, you can use this command:

ncbi-genome-download --section refseq --group bacteria --formats genbank --include feature_table

This will download both the .gbff and .ftt files for each genome from the refseq section. The .ftt files contain the feature table data, which are tab-delimited files that summarize the genomic features and their locations. You can find a list of all the available file extensions on GitHub .

How to run multiple downloads in parallel

Using the --parallel option

If you want to speed up your downloads by running multiple downloads in parallel, you can use the --parallel option and specify the number of parallel processes to use. For example, if you want to download all the genomes of bacteria from the refseq section using 8 parallel processes, you can use this command:

ncbi-genome-download --section refseq --group bacteria --parallel 8

This will download all the genomes of bacteria from the refseq section using 8 parallel processes. Note that this may increase your network bandwidth usage and CPU load, so use it with caution and according to your system resources.

Conclusion and FAQs

In this article, we have shown you how to use ncbi-genome-download to download bacterial genomes from NCBI by various criteria, such as taxonomic name, assembly accession, assembly level, refseq category, genera, species, etc. We have also shown you how to choose the formats and files to download, such as GenBank, FASTA, protein, assembly report, etc. Finally, we have shown you how to run multiple downloads in parallel using the --parallel option.

We hope that this article has been helpful and informative for you. If you have any questions or comments about ncbi-genome-download or downloading bacterial genomes from NCBI in general, please feel free to leave them below. We will try our best to answer them as soon as possible.

Here are some frequently asked questions (FAQs) about ncbi-genome-download and downloading bacterial genomes from NCBI:

Q: How can I update ncbi-genome-download?

A: If you have installed ncbi-genome-download using pip or conda , you can update it using the same tool with the --upgrade option. For example:

pip install --upgrade ncbi-genome-download

conda update -c bioconda ncbi-genome-download

This will update ncbi-genome-download to the latest version available on PyPI or bioconda.

Q: How can I resume interrupted downloads?

A: If your download is interrupted for some reason, such as network failure or system crash, you can resume it using the --resume option. For example, if you were downloading all the genomes of bacteria from the refseq section using 8 parallel processes, and your download was interrupted, you can resume it using this command:

ncbi-genome-download --section refseq --group bacteria --parallel 8 --resume

This will resume the download from where it left off, without re-downloading the files that were already downloaded.

Q: How can I download genomes from other groups, such as archaea, fungi, viruses, etc.?

A: You can download genomes from other groups by using the --group option and specifying the group name. For example, if you want to download all the genomes of archaea from the refseq section, you can use this command:

ncbi-genome-download --section refseq --group archaea

This will download all the genomes of archaea from the refseq section. You can find a list of all the supported groups on GitHub .

Q: How can I download genomes from other domains, such as eukaryotes or prokaryotes?

A: You can download genomes from other domains by using the --domain option and specifying the domain name. For example, if you want to download all the genomes of eukaryotes from the refseq section, you can use this command:

ncbi-genome-download --section refseq --domain eukaryota

This will download all the genomes of eukaryotes from the refseq section. You can find a list of all the supported domains on GitHub .

Q: How can I download genomes from other sections, such as representative or reference?

A: You can download genomes from other sections by using the --section option and specifying the section name. For example, if you want to download all the genomes of bacteria that are representative sequences from the representative section, you can use this command:

ncbi-genome-download --section representative --group bacteria

This will download all the genomes of bacteria that are representative sequences from the representative section. The representative section contains genomes that are selected by NCBI as representative of their taxonomic groups . You can also use the --section reference option to download genomes that are reference sequences , which are selected by NCBI as reference standards for their species . You can find a list of all the supported sections on GitHub . 44f88ac181