######################################### Los Alamos Oralgen Database FTP Archive ######################################### This FTP site contains the database files derived from the Los Alamos Oralgen Database. The file "oralgen.sql" was provided by Dr. Gang (Gary) Xie of Los Alamos National Lab. This file is the direct mySQL database dump and can be used to reconstruct the original LANL Oralgen Database. However, the original web site and query interface files are not available. Thus even though we could reconstruct the database, it is very cumbersome to search and navigate through the genome and annotations. Before the original LANL Oralgen Web interface can be provided to us by LANL, the temporary workaround is to examine the database tables in plain text format or as MS Excel format. To facilitate the continuing access to the LANL Oralgen annotations for the scientific community, we exported the database tables to the .csv file format (comma-separated values). Each database in the oralgen.sql was exported to a designed subfolder named based on the genome name (if available). The database name => genome name conversion is listed below: =============================== database name => folder name =============================== AAD11S1 => Aggregatibacter_actinomycetemcomitans_D11S-1 AAD7S1 => Aggregatibacter_actinomycetemcomitans_D7S-1 aact => Actinobacillus_actinomycetemcomitans_HK1651 anae => Actinomyces_naeslundii_MG1 aodo => Actinomyces_odontolyticus_ATCC_17982 av1 => Actinomyces_phage_Av-1 dentoti => Oralgen_toti dentoti2 => Streptococcus_toti fnuc => Fusobacterium_nucleatum_ATCC_25586 fnucp => Fusobacterium_nucleatum_polymorphum fnucv => Fusobacterium_nucleatum_subsp._vincentii_ATCC_49256 frag_recr frservice hhv1 => Human_herpesvirus_1 hhv2 => Human_Herpesvirus_2 hhv5 => Human_Herpesvirus_5_AD169 hhv8 => Human_Herpesvirus_8_BC-1 metagenomics omdb oralgensearch oralpp => Actinobacillus_actinomycetemcomitans_VT745 pgin => Porphyromonas_gingivalis pgin2 => Porphyromonas_gingivalis_ATCC_33277 pintnew => Prevotella_intermedia_17 pmic => Parvimonas_micra_ATCC_33270 predpath prv1 => Pseudorabies_virus_Kaplan+ saga => Streptococcus_agalactiae_2603V-R sgor => Streptococcus_gordonii_Challis_substr._CH1 smit => Streptococcus_mitis_NCTC_12261 smut => Streptococcus_mutans_UA159 spne => Streptococcus_pneumoniae_TIGR4 spyo => Streptococcus_pyogenes_M1_GAS srna ssan => Streptococcus_sanguinis_SK36 sther => Streptococcus_thermophilus_CNRZ1066 streptoti tden_new => Treponema_denticola_ATCC_35405 test tfor => Tannerella_forsythensis_ATCC_43037 totoweb user_comments =============================== Each of the database contains multiple tables, which have been exported to .csv files in the corresponding folder. Users can download the .csv files and directly open them in MS Excel. Note that some tables may contain too long a text string and thus may be split into multiple rows when the file is opened in Excel. For example, the table "genome_table.csv" have a field that contain very long string of genomic sequence in a single cell, and thus will be split into multiple rows. The .sql file is the file needed for reconstructing the database to it's original state and may be used as such if so desired. ####################################################### Matching Oralgen Gene IDs to Current NCBI Protein IDs ####################################################### Many researchers have asked for the information with regards to the counterpart of the genes annotated by LANL Oralgen, to the NCBI annotated genes. The information is important because the LANL Oralgen annotated gene IDs have been published in many journals, and the unavailability of the LANL Oralgen database makes it impossible to relate Oralgen genes cited in publications to the current NCBI annotated genes. Here we provide a hash table that relates the Oralgen Gene IDs to the current NCBI gene IDs of the same genome. Due to the fact that some of the genomic sequences annotated in Oralgen are different from those in NCBI, the matching of the genes/proteins between the two can not be done by genomic coordination (e.g., start, stop and strand info of the genes). The safest way to correlate the genes annotated in two annotation systems, is through protein sequence similarity comparison between genes/proteins of two "supposedly" same genomes. This is especially the case when the genomic sequences are different, even by a single nucleotide. In each of the genome folder, we provide such hash table, if the NCBI annotation of the counterpart genome is available. The table is in Excel .csv format with the file name: "lanl_gene_id_2_ncbi_blastp.csv". The table contains the top hit BLASTP search result of each of the Oralgen annotated protein sequences. If the BLASTP result shows identical hit (100% identity, identical values between q.start and s.start, and q.end and s.end, this is the indication that the two proteins are identical and thus the two gene IDs can be considered equivalent. Below is an example of part of a lanl_gene_id_2_ncbi_blastp.csv table: ====================================================================================================================================== Queryid GI ProteinID LocusID %_identity align_len mismatches gap_openings q.start q.end s.start s.end e-value bit_score TF1413 375254175 YP_005013342.1 BFO_0392 100 210 0 0 1 210 1 210 4.00E-110 390 TF0511 375256329 YP_005015496.1 BFO_2834 100 57 0 0 2 58 1 57 3.00E-27 112 TF1116 375255861 YP_005015028.1 BFO_2180 35.82 67 36 2 43 106 221 283 8.00E-06 42 ====================================================================================================================================== Clearly TF1414 (Oralgen ID) is the same as BFO_0392 (NCBI Locus ID) due to 100% identity and identical start and stop positions in the alignment. FT0511 and BFO_2834 have 100% identity but the alignment start and stop position are different (2-58 for TF0511 but 1-57 for BFO_2834). Both are still considered identical proteins and the alignment discrepancy is probably due to different ORF calling scheme. The third example shows that TF1116 and BFO_2180 are clearly different proteins due to low sequence identity. The folder "ncbi_seq" contains the NCBI protein sequence and annotation files used in the BLASTP search. We provide these information for the research community voluntarily to facilitate the scientific research. Information provided are either direct export of the original data file provided by LANL, or automatically generated by computer programs. Please examine these information carefully to ensure the accuracy. If you wish to cite the information provided here, please use the following format: "Chen, T. 2014. The Los Alamos Oralgen FTP Archive. ftp://www.homd.org/lanl_oralgen" Or simpley cite the ftp link: "ftp://www.homd.org/lanl_oralgen" in the text. If you have any question, please feel free to contact me by email. Sincerely, George Tsute Chen, Ph.D. Associate Research Investigator Department of Molecular Genetics The Forsyth Institute 245 First Street, Cambridge, MA Phone: (617) 892-8359 Fax: (617) 262-4021 Email: tchen@forsyth.org