How to use ncbi-genome-download to download bacterial genomes from NCBI
If you are interested in downloading bacterial genomes from the National Center for Biotechnology Information (NCBI) FTP servers, you might find the ncbi-genome-download tool very useful. This tool is a Python script that allows you to download genomes from NCBI by various criteria, such as taxonomic name, assembly accession, assembly level, refseq category, and more. You can also choose the formats and files to download, such as GenBank, FASTA, protein, assembly report, etc. In this article, we will show you how to install and use ncbi-genome-download to download bacterial genomes from NCBI.
What is ncbi-genome-download and why use it?
A brief introduction to ncbi-genome-download
ncbi-genome-download is a Python script that was created by Kai Blin, a bioinformatician and software developer at the Novo Nordisk Foundation Center for Biosustainability. The idea was inspired by Mick Watson's Kraken downloader scripts, which are written in Perl and specific to building a Kraken database. However, ncbi-genome-download focuses on the actual genome downloading and supports different formats and criteria. The tool is open source and available on GitHub .
ncbi-genome-download bacteria
The benefits of using ncbi-genome-download
There are several benefits of using ncbi-genome-download over other methods of downloading genomes from NCBI. Some of them are:
It is easy to install and use. You can install it using pip or conda, and run it from the command line with simple options.
It is flexible and customizable. You can download genomes by different criteria, such as taxonomic name, assembly accession, assembly level, refseq category, genera, species, etc. You can also choose the formats and files to download, such as GenBank, FASTA, protein, assembly report, etc.
It is fast and efficient. You can run multiple downloads in parallel using the --parallel option. You can also resume interrupted downloads using the --resume option.
It is updated and maintained. The tool is regularly updated to reflect the changes in the NCBI FTP servers and the available genome data. You can also report issues or suggest features on GitHub .
How to install ncbi-genome-download
Using pip
If you have Python installed on your system, you can use pip to install ncbi-genome-download. Pip is a package manager for Python that allows you to install packages from PyPI , the Python Package Index. To install ncbi-genome-download using pip, run the following command:
pip install ncbi-genome-download
If this fails on older versions of Python, try updating your pip tool first:
ncbi-genome-download bacterial refseq genomes
ncbi-genome-download microbial genomes ftp
ncbi-genome-download prokaryotic genomes
ncbi-genome-download archaea genomes
ncbi-genome-download bacteria genera
ncbi-genome-download bacterial reference genomes
ncbi-genome-download microbial genomes refseq
ncbi-genome-download prokaryotes ftp
ncbi-genome-download archaeal genomes
ncbi-genome-download bacteria assembly levels
ncbi-genome-download bacterial genbank format
ncbi-genome-download microbial genomes pipeline
ncbi-genome-download prokaryotic annotation
ncbi-genome-download archaea refseq categories
ncbi-genome-download bacteria species
ncbi-genome-download bacterial fasta format
ncbi-genome-download microbial genomes blast
ncbi-genome-download prokaryotic genome submission
ncbi-genome-download archaeal genbank format
ncbi-genome-download bacteria strains
ncbi-genome-download bacterial gff format
ncbi-genome-download microbial genomes bioproject
ncbi-genome-download prokaryotic genome assembly
ncbi-genome-download archaea fasta format
ncbi-genome-download bacteria taxid
ncbi-genome-download bacterial protein sequences
ncbi-genome-download microbial genomes biosample
ncbi-genome-download prokaryotic genome analysis
ncbi-genome-download archaea gff format
ncbi-genome-download bacteria accession numbers
ncbi-genome-download bacterial rna sequences
ncbi-genome-download microbial genomes annotation tools
ncbi-genome-download prokaryotic genome browser
ncbi-genome-download archaea protein sequences
ncbi-genome-download bacteria download script
ncbi-genome-download bacterial genome size
ncbi-genome-download microbial genomes contact and outreach
ncbi-genome-download prokaryotic genome comparison
ncbi-genome-download archaea rna sequences
ncbi-genome-download bacteria download directory
pip install --upgrade pip
and then rerun the ncbi-genome-download install.
Using conda
If you prefer to use conda, a package manager for Python and other languages that allows you to install packages from various channels , you can also install ncbi-genome-download using conda. Conda is part of Anaconda [^12^ ), a distribution of Python and other tools for data science and machine learning. To install ncbi-genome-download using conda, run the following command:
conda install -c bioconda ncbi-genome-download
This will install ncbi-genome-download from the bioconda channel , which is a community-driven channel that provides bioinformatics packages for conda.
How to download bacterial genomes by different criteria
Using taxonomic name or ID
One of the most common ways to download bacterial genomes from NCBI is by using the taxonomic name or ID of the group of interest. For example, if you want to download all the genomes of the phylum Firmicutes, you can use the following command:
ncbi-genome-download --section refseq --group bacteria --taxon firmicutes
This will download all the genomes of the Firmicutes phylum from the refseq section of the NCBI FTP servers. The refseq section contains curated and annotated genomes that are considered reference sequences . You can also use the --section genbank option to download genomes from the genbank section, which contains all the genomes submitted to NCBI . However, note that some genomes may be duplicated or incomplete in the genbank section.
You can also use the taxonomic ID instead of the name, if you know it. For example, the taxonomic ID of Firmicutes is 1239, so you can use the following command:
ncbi-genome-download --section refseq --group bacteria --taxid 1239
This will download the same genomes as before. You can find the taxonomic ID of any group by using the NCBI Taxonomy Browser .
Using assembly accession or BioProject accession
If you want to download a specific genome or a set of genomes by their assembly accession or BioProject accession, you can use the --assembly-accessions or --bioprojects options. For example, if you want to download the genome of Escherichia coli K-12 MG1655, which has the assembly accession GCF_000005845.2 and the BioProject accession PRJNA57779, you can use either of these commands:
ncbi-genome-download --section refseq --group bacteria --assembly-accessions GCF_000005845.2
ncbi-genome-download --section refseq --group bacteria --bioprojects PRJNA57779
This will download only the genome of E. coli K-12 MG1655 from the refseq section. You can find the assembly accession and BioProject accession of any genome by using the NCBI Assembly Database or the NCBI BioProject Database .
Using assembly level or refseq category
If you want to filter the genomes by their assembly level or refseq category, you can use the --assembly-level or --refseq-category options. The assembly level indicates how complete and contiguous a genome assembly is, and it can be one of these values: complete, chromosome, scaffold, or contig . The refseq category indicates how representative and reliable a genome sequence is, and it can be one of these values: reference, representative, or na . For example, if you want to download only the complete genomes of bacteria that are reference sequences from the refseq section, you can use this command:
ncbi-genome-download --section refseq --group bacteria --assembly-level complete --refseq-category reference
This will download only the genomes that meet both criteria. You can also use multiple values for each option by separating them with commas. For example, if you want to download all the genomes of bacteria that are either complete or chromosome level assemblies from either the refseq or genbank sections, you can use this command:
ncbi-genome-download --section refseq,genbank --group bacteria --assembly-level complete,chromosome
This will download all the genomes that meet either criterion.
Using genera or species name
If you want to download genomes by their genera or species name, you can use the --genera or --species options. For example, if you want to download all the genomes of bacteria that belong to the genus Bacillus from the refseq section, you can use this command:
ncbi-genome-download --section refseq --group bacteria --genera Bacillus
This will download all the genomes of Bacillus species from the refseq section. You can also use the species name instead of the genus name, if you know it. For example, if you want to download only the genome of Bacillus subtilis 168, which is a model organism for bacterial genetics and physiology, you can use this command:
ncbi-genome-download --section refseq --group bacteria --species "Bacillus subtilis 168"
This will download only the genome of B. subtilis 168 from the refseq section. Note that you need to use quotation marks around the species name if it contains spaces. You can find the genera and species names of any genome by using the NCBI Genome Database .
How to choose the formats and files to download
Using the --formats option
By default, ncbi-genome-download will download the GenBank format files for each genome, which contain the nucleotide sequences and annotations of the genomic features. However, you can also choose other formats to download, such as FASTA, protein, assembly report, etc. To do this, you can use the --formats option and specify one or more formats separated by commas. For example, if you want to download both the GenBank and FASTA format files for each genome, you can use this command:
ncbi-genome-download --section refseq --group bacteria --formats genbank,fasta
This will download both the .gbff and .fna files for each genome from the refseq section. The .gbff files contain the GenBank format data, and the .fna files contain the FASTA format data. The FASTA format files only contain the nucleotide sequences without annotations. You can find a list of all the supported formats and their extensions on GitHub .
Using the --include option
Sometimes, you may want to download additional files that are not part of the standard formats, such as feature tables, protein tables, RNA tables, etc. To do this, you can use the --include option and specify one or more file extensions separated by commas. For example, if you want to download both the GenBank format files and the feature table files for each genome, you can use this command:
ncbi-genome-download --section refseq --group bacteria --formats genbank --include feature_table
This will download both the .gbff and .ftt files for each genome from the refseq section. The .ftt files contain the feature table data, which are tab-delimited files that summarize the genomic features and their locations. You can find a list of all the available file extensions on GitHub .
How to run multiple downloads in parallel
Using the --parallel option
If you want to speed up your downloads by running multiple downloads in parallel, you can use the --parallel option and specify the number of parallel processes to use. For example, if you want to download all the genomes of bacteria from the refseq section using 8 parallel processes, you can use this command:
ncbi-genome-download --section refseq --group bacteria --parallel 8
This will download all the genomes of bacteria from the refseq section using 8 parallel processes. Note that this may increase your network bandwidth usage and CPU load, so use it with caution and according to your system resources.
Conclusion and FAQs
In this article, we have shown you how to use ncbi-genome-download to download bacterial genomes from NCBI by various criteria, such as taxonomic name, assembly accession, assembly level, refseq category, genera, species, etc. We have also shown you how to choose the formats and files to download, such as GenBank, FASTA, protein, assembly report, etc. Finally, we have shown you how to run multiple downloads in parallel using the --parallel option.
We hope that this article has been helpful and informative for you. If you have any questions or comments about ncbi-genome-download or downloading bacterial genomes from NCBI in general, please feel free to leave them below. We will try our best to answer them as soon as possible.
Here are some frequently asked questions (FAQs) about ncbi-genome-download and downloading bacterial genomes from NCBI:
Q: How can I update ncbi-genome-download?
A: If you have installed ncbi-genome-download using pip or conda , you can update it using the same tool with the --upgrade option. For example:
pip install --upgrade ncbi-genome-download
conda update -c bioconda ncbi-genome-download
This will update ncbi-genome-download to the latest version available on PyPI or bioconda.
Q: How can I resume interrupted downloads?
A: If your download is interrupted for some reason, such as network failure or system crash, you can resume it using the --resume option. For example, if you were downloading all the genomes of bacteria from the refseq section using 8 parallel processes, and your download was interrupted, you can resume it using this command:
ncbi-genome-download --section refseq --group bacteria --parallel 8 --resume
This will resume the download from where it left off, without re-downloading the files that were already downloaded.
Q: How can I download genomes from other groups, such as archaea, fungi, viruses, etc.?
A: You can download genomes from other groups by using the --group option and specifying the group name. For example, if you want to download all the genomes of archaea from the refseq section, you can use this command:
ncbi-genome-download --section refseq --group archaea
This will download all the genomes of archaea from the refseq section. You can find a list of all the supported groups on GitHub .
Q: How can I download genomes from other domains, such as eukaryotes or prokaryotes?
A: You can download genomes from other domains by using the --domain option and specifying the domain name. For example, if you want to download all the genomes of eukaryotes from the refseq section, you can use this command:
ncbi-genome-download --section refseq --domain eukaryota
This will download all the genomes of eukaryotes from the refseq section. You can find a list of all the supported domains on GitHub .
Q: How can I download genomes from other sections, such as representative or reference?
A: You can download genomes from other sections by using the --section option and specifying the section name. For example, if you want to download all the genomes of bacteria that are representative sequences from the representative section, you can use this command:
ncbi-genome-download --section representative --group bacteria
This will download all the genomes of bacteria that are representative sequences from the representative section. The representative section contains genomes that are selected by NCBI as representative of their taxonomic groups . You can also use the --section reference option to download genomes that are reference sequences , which are selected by NCBI as reference standards for their species . You can find a list of all the supported sections on GitHub . 44f88ac181
Comments