################################################################################ README for ftp://ftp.ncbi.nlm.nih.gov/genomes/all/ Last updated: February 8, 2016 ################################################################################ ========== Background ========== Sequence data is provided for all single organism genome assemblies that are included in NCBI's Assembly resource (www.ncbi.nlm.nih.gov/assembly/). This includes submissions to databases of the International Nucleotide Sequence Database Collaboration, which are available in NCBI's GenBank database, as well as the subset of those submissions that are included in NCBI's RefSeq Genomes project. Available by anonymous FTP at: ftp://ftp.ncbi.nlm.nih.gov/genomes/ Please refer to README files and the FTP FAQ for additional information: http://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/ ===================================================================== Genome sequence and annotation data is provided in three directories: ===================================================================== 1) all: content is the union of GenBank and RefSeq assemblies. Subdirectories are provided per assembly accession and version. [This directory is not suitable for browsing because it holds many thousands of entries.] Only subdirectories for "latest" assemblies are refreshed when annotation is updated or when software updates are released, so new file formats or improvements to existing formats are not available for non-latest assemblies. 2) genbank: content includes primary submissions of assembled genome sequence and associated annotation data, if any, as exchanged among members of the International Nucleotide Sequence Database Collaboration, of which NCBI's GenBank database is a member. The GenBank directory area includes genome sequence data for a larger number of organisms than the RefSeq directory area; however, some assemblies are unannotated. The directory is further organized by taxonomic groups. 3) refseq: content includes assembled genome sequence and RefSeq annotation data. All prokaryotic and eukaryotic RefSeq genomes have annotation. RefSeq annotation data may be calculated by NCBI annotation pipelines or propagated from the GenBank submission. The RefSeq directory area includes fewer organisms than the GenBank directory area because not all genome assemblies are selected for the RefSeq project. The directory is further organized by taxonomic groups. Genome assemblies of interest can be identified using the NCBI Assembly resource (www.ncbi.nlm.nih.gov/assembly), or by using the assembly summary report files that are provided for both all genbank and all refseq assemblies: ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank.txt or ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/assembly_summary_genbank.txt ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt or ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt Assembly summary report files containing information on assemblies for a particular taxonomic group or species are provided in the group and Genus_species directories under the "genbank" and "refseq" directory trees. e.g. ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/archaea/Sulfolobus_islandicus/assembly_summary.txt Search the meta-data fields, or filter the files, to find assemblies of interest. =========================== Data provided per assembly: =========================== Sequence and other data files provided per assembly are named according to the rule: [assembly accession.version]_[assembly name]_[content type].[optional format] File formats and content: *_assembly_report.txt file Tab-delimited text file reporting the name, role and sequence accession.version for objects in the assembly. The file header contains meta-data for the assembly including: assembly name, assembly accession.version, scientific name of the organism and its taxonomy ID, assembly submitter, and sequence release date. The file header also indicates whether the assembly version is "latest", "replaced", or "suppressed". *_assembly_stats.txt file Tab-delimited text file reporting statistics for the assembly including: total length, ungapped length, contig & scaffold counts, contig-N50, scaffold-L50, scaffold-N50, scaffold-N75, and scaffold-N90 *_assembly_regions.txt Provided for assemblies that include alternate or patch assembly units. Tab-delimited text file reporting the location of genomic regions and the alt/patch scaffolds placed within those regions. *_assembly_structure directory This directory will only be present if the assembly has internal structure. When present, it will contain AGP files that define how component sequences are organized into scaffolds and/or chromosomes. Other files define how scaffolds and chromosomes are organized into non-nuclear and other assembly-units, and how any alternate or patch scaffolds are placed relative to the chromosomes. Refer to the README.txt file in the assembly_structure directory for additional information. *_feature_table.txt.gz Tab-delimited text file reporting locations and attributes for a subset of annotated features. Included feature types are: gene, CDS, RNA (all types), operon, C/V/N/S_region, and V/D/J_segment. Replaces the .ptt & .rnt format files that were provided in the old genomes FTP directories. See the "Description of files" section below for details of the file format. *_genomic.fna.gz file FASTA format of the genomic sequence(s) in the assembly. Repetitive sequences in eukaryotes are masked to lower-case (see below). The FASTA title is formatted as sequence accession.version plus description. The genomic.fna.gz file includes all top-level sequences in the assembly (chromosomes, plasmids, organelles, unlocalized scaffolds, unplaced scaffolds, and any alternate loci or patch scaffolds). Scaffolds that are part of the chromosomes are not included because they are redundant with the chromosome sequences; sequences for these placed scaffolds are provided under the assembly_structure directory. *_genomic.gbff.gz file GenBank flat file format of the genomic sequence(s) in the assembly. This file includes both the genomic sequence and the CONTIG description (for CON records), hence, it replaces both the .gbk & .gbs format files that were provided in the old genomes FTP directories. *_genomic.gff.gz file Annotation of the genomic sequence(s) in Generic Feature Format Version 3 (GFF3). Sequence identifiers are provided as accession.version. Additional information about NCBI's GFF files is available at ftp://ftp.ncbi.nlm.nih.gov/genomes/README_GFF3.txt. *_protein.faa.gz file FASTA format of the accessioned protein products annotated on the genome assembly The FASTA title is formatted as sequence accession.version plus description. *_protein.gpff.gz file GenPept format of the accessioned protein products annotated on the genome assembly *_rm.out.gz file RepeatMasker output; Provided for Eukaryotes *_rm.run file Documentation of the RepeatMasker version, parameters, and library; Provided for Eukaryotes *_rna.fna.gz file FASTA format of accessioned RNA products annotated on the genome assembly; Provided for RefSeq assemblies as relevant (Note, RNA and mRNA products are not instantiated as a separate accessioned record in GenBank but are provided for some RefSeq genomes, most notably the eukaryotes.) The FASTA title is provided as sequence accession.version plus description. *_rna.gbff.gz file GenBank flat file format of RNA products annotated on the genome assembly; Provided for RefSeq assemblies as relevant *_wgsmaster.gbff.gz GenBank flat file format of the WGS master for the assembly (present only if a WGS master record exists for the sequences in the assembly). md5checksums.txt file file checksums are provided for all data files in the directory ===================== Description of files: ===================== Masking of fasta sequences in genomic.fna.gz files -------------------------------------------------- Repetitive sequences in eukaryotic genome assembly sequence files, as identified by WindowMasker (Morgulis A, Gertz EM, Schaffer AA, Agarwala R. 2006. Bioinformatics 22:134-41), have been masked to lower-case. Alignment programs typically have parameters that control whether the program will ignore lower-case masking, treat it as soft-masking (i.e. only for finding initial matches) or treat it as hard-masking. By default NCBI BLAST will ignore lower-case masking but this can be changed by adding options to the blastn command-line. To have blastn treat lower-case masking in the query sequence as soft-masking add: -lcase_masking To have blastn treat lower-case masking in the query sequence as hard-masking add: -lcase_masking -soft_masking false Alternatively, commands such as the following can be used to generate either unmasked sequence or sequence masked with Ns. Example commands to remove lower-case masking: perl -pe '/^[^>]/ and $_=uc' genomic.fna > genomic.unmasked.fna -or- awk '{if(/^[^>]/)$0=toupper($0);print $0}' genomic.fna > genomic.unmasked.fna Example commands to convert lower-case masking to masking with Ns (hard-masked): perl -pe '/^[^>]/ and $_=~ s/[a-z]/N/g' genomic.fna > genomic.N-masked.fna -or- awk '{if(/^[^>]/)gsub(/[a-z]/,"N");print $0}' genomic.fna > genomic.N-masked.fna *_feature_table.txt.gz ---------------------- Tab-delimited text file reporting locations and attributes for a subset of annotated features. Included feature types are: gene, CDS, RNA (all types), operon, C/V/N/S_region, and V/D/J_segment. The file is tab delimited (including a #header) with the following columns: col 1: feature: INSDC feature type col 2: class: For ncRNA features, this is the ncRNA_class for the feature. For gene features, this is the gene biotype computed based on the set of child features for that gene. See the description of the gene_biotype attribute in the GFF3 documentation for more details: ftp://ftp.ncbi.nlm.nih.gov/genomes/README_GFF3.txt col 3: assembly: assembly accession.version col 4: assembly_unit: name of the assembly unit, such as "Primary Assembly", "ALT_REF_LOCI_1", or "non-nuclear" col 5: seq_type: sequence type, computed from the "Sequence-Role" and "Assigned-Molecule-Location/Type" in the *_assembly_report.txt file. The value is computed as: if an assembled-molecule, then reports the location/type value. e.g. chromosome, mitochondrion, or plasmid if an unlocalized-scaffold, then report "unlocalized scaffold on ". e.g. unlocalized scaffold on chromosome else the role, e.g. alternate scaffold, fix patch, or novel patch col 6: chromosome col 7: genomic_accession col 8: start: feature start coordinate (base-1). start is always less than end col 9: end: feature end coordinate (base-1) col10: strand col11: product_accession: accession.version of the product referenced by this feature, if exists col12: non-redundant_refseq: for bacteria and archaea assemblies, the non-redundant WP_ protein accession corresponding to the CDS feature. May be the same as column 11, for RefSeq genomes annotated directly with WP_ RefSeq proteins, or may be different, for genomes annotated with genome-specific protein accessions (e.g. NP_ or YP_ RefSeq proteins) that reference a WP_ RefSeq accession. col13: related_accession: for eukaryotic RefSeq annotations, the RefSeq protein accession corresponding to the transcript feature, or the RefSeq transcript accession corresponding to the protein feature. col14: name: For genes, this is the gene description or full name. For RNA, CDS, and some other features, this is the product name. col15: symbol: gene symbol col16: GeneID: NCBI GeneID, for those RefSeq genomes included in NCBI's Gene resource col17: locus_tag col18: feature_interval_length: sum of the lengths of all intervals for the feature (i.e. the length without introns for a joined feature) col19: product_length: length of the product corresponding to the accession.version in column 11. Protein product lengths are in amino acid units, and do not include the stop codon which is included in column 18. Additionally, product_length may differ from feature_interval_length if the product contains sequence differences vs. the genome, as found for some RefSeq transcript and protein products based on mRNA sequences and also for INSDC proteins that are submitted to correct genome discrepancies. col20: attributes: semi-colon delimited list of a controlled set of qualifiers. The list currently includes: partial, pseudo, pseudogene, ribosomal_slippage, trans_splicing, anticodon=NNN (for tRNAs), old_locus_tag=XXX ________________________________________________________________________________ National Center for Biotechnology Information (NCBI) National Library of Medicine National Institutes of Health 8600 Rockville Pike Bethesda, MD 20894, USA tel: (301) 496-2475 fax: (301) 480-9241 e-mail: info@ncbi.nlm.nih.gov ________________________________________________________________________________