top of page
daivessparcernstum

Bacterial Genome Downloading Made Easy with NCBI Tools



How to use ncbi-genome-download to download bacterial genomes from NCBI




If you are interested in downloading bacterial genomes from the National Center for Biotechnology Information (NCBI) FTP servers, you might find the ncbi-genome-download tool very useful. This tool is a Python script that allows you to download genomes from NCBI by various criteria, such as taxonomic name, assembly accession, assembly level, refseq category, and more. You can also choose the formats and files to download, such as GenBank, FASTA, protein, assembly report, etc. In this article, we will show you how to install and use ncbi-genome-download to download bacterial genomes from NCBI.


What is ncbi-genome-download and why use it?




A brief introduction to ncbi-genome-download




ncbi-genome-download is a Python script that was created by Kai Blin, a bioinformatician and software developer at the Novo Nordisk Foundation Center for Biosustainability. The idea was inspired by Mick Watson's Kraken downloader scripts, which are written in Perl and specific to building a Kraken database. However, ncbi-genome-download focuses on the actual genome downloading and supports different formats and criteria. The tool is open source and available on GitHub .




ncbi-genome-download bacteria



The benefits of using ncbi-genome-download




There are several benefits of using ncbi-genome-download over other methods of downloading genomes from NCBI. Some of them are:


  • It is easy to install and use. You can install it using pip or conda, and run it from the command line with simple options.



  • It is flexible and customizable. You can download genomes by different criteria, such as taxonomic name, assembly accession, assembly level, refseq category, genera, species, etc. You can also choose the formats and files to download, such as GenBank, FASTA, protein, assembly report, etc.



  • It is fast and efficient. You can run multiple downloads in parallel using the --parallel option. You can also resume interrupted downloads using the --resume option.



  • It is updated and maintained. The tool is regularly updated to reflect the changes in the NCBI FTP servers and the available genome data. You can also report issues or suggest features on GitHub .



How to install ncbi-genome-download




Using pip




If you have Python installed on your system, you can use pip to install ncbi-genome-download. Pip is a package manager for Python that allows you to install packages from PyPI , the Python Package Index. To install ncbi-genome-download using pip, run the following command:


pip install ncbi-genome-download


If this fails on older versions of Python, try updating your pip tool first:


ncbi-genome-download bacterial refseq genomes


ncbi-genome-download microbial genomes ftp


ncbi-genome-download prokaryotic genomes


ncbi-genome-download archaea genomes


ncbi-genome-download bacteria genera


ncbi-genome-download bacterial reference genomes


ncbi-genome-download microbial genomes refseq


ncbi-genome-download prokaryotes ftp


ncbi-genome-download archaeal genomes


ncbi-genome-download bacteria assembly levels


ncbi-genome-download bacterial genbank format


ncbi-genome-download microbial genomes pipeline


ncbi-genome-download prokaryotic annotation


ncbi-genome-download archaea refseq categories


ncbi-genome-download bacteria species


ncbi-genome-download bacterial fasta format


ncbi-genome-download microbial genomes blast


ncbi-genome-download prokaryotic genome submission


ncbi-genome-download archaeal genbank format


ncbi-genome-download bacteria strains


ncbi-genome-download bacterial gff format


ncbi-genome-download microbial genomes bioproject


ncbi-genome-download prokaryotic genome assembly


ncbi-genome-download archaea fasta format


ncbi-genome-download bacteria taxid


ncbi-genome-download bacterial protein sequences


ncbi-genome-download microbial genomes biosample


ncbi-genome-download prokaryotic genome analysis


ncbi-genome-download archaea gff format


ncbi-genome-download bacteria accession numbers


ncbi-genome-download bacterial rna sequences


ncbi-genome-download microbial genomes annotation tools


ncbi-genome-download prokaryotic genome browser


ncbi-genome-download archaea protein sequences


ncbi-genome-download bacteria download script


ncbi-genome-download bacterial genome size


ncbi-genome-download microbial genomes contact and outreach


ncbi-genome-download prokaryotic genome comparison


ncbi-genome-download archaea rna sequences


ncbi-genome-download bacteria download directory


pip install --upgrade pip


and then rerun the ncbi-genome-download install.


Using conda




If you prefer to use conda, a package manager for Python and other languages that allows you to install packages from various channels , you can also install ncbi-genome-download using conda. Conda is part of Anaconda [^12^ ), a distribution of Python and other tools for data science and machine learning. To install ncbi-genome-download using conda, run the following command:


conda install -c bioconda ncbi-genome-download


This will install ncbi-genome-download from the bioconda channel , which is a community-driven channel that provides bioinformatics packages for conda.


How to download bacterial genomes by different criteria




Using taxonomic name or ID




One of the most common ways to download bacterial genomes from NCBI is by using the taxonomic name or ID of the group of interest. For example, if you want to download all the genomes of the phylum Firmicutes, you can use the following command:


ncbi-genome-download --section refseq --group bacteria --taxon firmicutes


This will download all the genomes of the Firmicutes phylum from the refseq section of the NCBI FTP servers. The refseq section contains curated and annotated genomes that are considered reference sequences . You can also use the --section genbank option to download genomes from the genbank section, which contains all the genomes submitted to NCBI . However, note that some genomes may be duplicated or incomplete in the genbank section.


You can also use the taxonomic ID instead of the name, if you know it. For example, the taxonomic ID of Firmicutes is 1239, so you can use the following command:


ncbi-genome-download --section refseq --group bacteria --taxid 1239


This will download the same genomes as before. You can find the taxonomic ID of any group by using the NCBI Taxonomy Browser .


Using assembly accession or BioProject accession




If you want to download a specific genome or a set of genomes by their assembly accession or BioProject accession, you can use the --assembly-accessions or --bioprojects options. For example, if you want to download the genome of Escherichia coli K-12 MG1655, which has the assembly accession GCF_000005845.2 and the BioProject accession PRJNA57779, you can use either of these commands:


ncbi-genome-download --section refseq --group bacteria --assembly-accessions GCF_000005845.2


ncbi-genome-download --section refseq --group bacteria --bioprojects PRJNA57779


This will download only the genome of E. coli K-12 MG1655 from the refseq section. You can find the assembly accession and BioProject accession of any genome by using the NCBI Assembly Database or the NCBI BioProject Database .


Using assembly level or refseq category




If you want to filter the genomes by their assembly level or refseq category, you can use the --assembly-level or --refseq-category options. The assembly level indicates how complete and contiguous a genome assembly is, and it can be one of these values: complete, chromosome, scaffold, or contig . The refseq category indicates how representative and reliable a genome sequence is, and it can be one of these values: reference, representative, or na . For example, if you want to download only the complete genomes of bacteria that are reference sequences from the refseq section, you can use this command:


ncbi-genome-download --section refseq --group bacteria --assembly-level complete --refseq-category reference


This will download only the genomes that meet both criteria. You can also use multiple values for each option by separating them with commas. For example, if you want to download all the genomes of bacteria that are either complete or chromosome level assemblies from either the refseq or genbank sections, you can use this command:


ncbi-genome-download --section refseq,genbank --group bacteria --assembly-level complete,chromosome


This will download all the genomes that meet either criterion.


Using genera or species name




If you want to download genomes by their genera or species name, you can use the --genera or --species options. For example, if you want to download all the genomes of bacteria that belong to the genus Bacillus from the refseq section, you can use this command:


ncbi-genome-download --section refseq --group bacteria --genera Bacillus


This will download all the genomes of Bacillus species from the refseq section. You can also use the species name instead of the genus name, if you know it. For example, if you want to download only the genome of Bacillus subtilis 168, which is a model organism for bacterial genetics and physiology, you can use this command:


ncbi-genome-download --section refseq --group bacteria --species "Bacillus subtilis 168"


This will download only the genome of B. subtilis 168 from the refseq section. Note that you need to use quotation marks around the species name if it contains spaces. You can find the genera and species names of any genome by using the NCBI Genome Database .


How to choose the formats and files to download




Using the --formats option




By default, ncbi-genome-download will download the GenBank format files for each genome, which contain the nucleotide sequences and annotations of the genomic features. However, you can also choose other formats to download, such as FASTA, protein, assembly report, etc. To do this, you can use the --formats option and specify one or more formats separated by commas. For example, if you want to download both the GenBank and FASTA format files for each genome, you can use this command:


ncbi-genome-download --section refseq --group bacteria --formats genbank,fasta


This will download both the .gbff and .fna files for each genome from the refseq section. The .gbff files contain the GenBank format data, and the .fna files contain the FASTA format data. The FASTA format files only contain the nucleotide sequences without annotations. You can find a list of all the supported formats and their extensions on GitHub .


Using the --include option




Sometimes, you may want to download additional files that are not part of the standard formats, such as feature tables, protein tables, RNA tables, etc. To do this, you can use the --include option and specify one or more file extensions separated by commas. For example, if you want to download both the GenBank format files and the feature table files for each genome, you can use this command:


ncbi-genome-download --section refseq --group bacteria --formats genbank --include feature_table


This will download both the .gbff and .ftt files for each genome from the refseq section. The .ftt files contain the feature table data, which are tab-delimited files that summarize the genomic features and their locations. You can find a list of all the available file extensions on GitHub .


How to run multiple downloads in parallel




Using the --parallel option




If you want to speed up your downloads by running multiple downloads in parallel, you can use the --parallel option and specify the number of parallel processes to use. For example, if you want to download all the genomes of bacteria from the refseq section using 8 parallel processes, you can use this command:


ncbi-genome-download --section refseq --group bacteria --parallel 8


This will download all the genomes of bacteria from the refseq section using 8 parallel processes. Note that this may increase your network bandwidth usage and CPU load, so use it with caution and according to your system resources.


Conclusion and FAQs




In this article, we have shown you how to use ncbi-genome-download to download bacterial genomes from NCBI by various criteria, such as taxonomic name, assembly accession, assembly level, refseq category, genera, species, etc. We have also shown you how to choose the formats and files to download, such as GenBank, FASTA, protein, assembly report, etc. Finally, we have shown you how to run multiple downloads in parallel using the --parallel option.


We hope that this article has been helpful and informative for you. If you have any questions or comments about ncbi-genome-download or downloading bacterial genomes from NCBI in general, please feel free to leave them below. We will try our best to answer them as soon as possible.


Here are some frequently asked questions (FAQs) about ncbi-genome-download and downloading bacterial genomes from NCBI:


Q: How can I update ncbi-genome-download?




A: If you have installed ncbi-genome-download using pip or conda , you can update it using the same tool with the --upgrade option. For example:


pip install --upgrade ncbi-genome-download


conda update -c bioconda ncbi-genome-download


This will update ncbi-genome-download to the latest version available on PyPI or bioconda.


Q: How can I resume interrupted downloads?




A: If your download is interrupted for some reason, such as network failure or system crash, you can resume it using the --resume option. For example, if you were downloading all the genomes of bacteria from the refseq section using 8 parallel processes, and your download was interrupted, you can resume it using this command:


ncbi-genome-download --section refseq --group bacteria --parallel 8 --resume


This will resume the download from where it left off, without re-downloading the files that were already downloaded.


Q: How can I download genomes from other groups, such as archaea, fungi, viruses, etc.?




A: You can download genomes from other groups by using the --group option and specifying the group name. For example, if you want to download all the genomes of archaea from the refseq section, you can use this command:


ncbi-genome-download --section refseq --group archaea


This will download all the genomes of archaea from the refseq section. You can find a list of all the supported groups on GitHub .


Q: How can I download genomes from other domains, such as eukaryotes or prokaryotes?




A: You can download genomes from other domains by using the --domain option and specifying the domain name. For example, if you want to download all the genomes of eukaryotes from the refseq section, you can use this command:


ncbi-genome-download --section refseq --domain eukaryota


This will download all the genomes of eukaryotes from the refseq section. You can find a list of all the supported domains on GitHub .


Q: How can I download genomes from other sections, such as representative or reference?




A: You can download genomes from other sections by using the --section option and specifying the section name. For example, if you want to download all the genomes of bacteria that are representative sequences from the representative section, you can use this command:


ncbi-genome-download --section representative --group bacteria


This will download all the genomes of bacteria that are representative sequences from the representative section. The representative section contains genomes that are selected by NCBI as representative of their taxonomic groups . You can also use the --section reference option to download genomes that are reference sequences , which are selected by NCBI as reference standards for their species . You can find a list of all the supported sections on GitHub . 44f88ac181


1 view0 comments

Recent Posts

See All

Comments


!
Widget Didn’t Load
Check your internet and refresh this page.
If that doesn’t work, contact us.
bottom of page