#################################################### #Bacterial 16S rRNA Sequence Identification Pipeline # Copyright Tsute Chen, The Forsyth Institute #################################################### ####### Pipeline Description ####### 1. Preprocess reads - find unique sequences associate read IDs and sample IDs 2. Submit all unique reads for BLASTN search against 16S rRNA Reference sequence set, using the BLASTN parameters: -q -5 -r 4 -G 5 -E 5 3. Parse the BLASTN results using the following criteria: a. for each read, the alignment length must be >=90% of read length b. for each read, the best hit to references is determined by highest percent identity AND score (reflecting alignment length). If a single best hit is found, or multiple equal hits to more than one reference sequences but these reference sequences all represent the same species, then the single hit species was recorded. c. if a read hit multiple reference sequences that represent multiple species with equal percent identity and score, all the species were recorded in the original results. Also a consensus taxonomy level for these multiple species will be determined (can be Genus, Family, or other higher taxonomy level, depending on what species are hit). In this case, the read cannot be assigned to single species due to tied % and score. 4. Record the best hit to different tables for various percent identity cutoffs: 100%, 99%, 98%,97%, 95%, 90%, and all (no minimal %cutoff) 5. Read count data from the 98% cutoff were used to calculate combined counts at different taxonomy level. And percent read count by sample were used to chart the stack-column graphs. ####### File Description ####### 1. 00_README.txt: This file 2. reference.fasta:: 16S rRNA Reference sequences used in this analysis 3. reference.taxonomy: Reference sequence taxonomy in Mothur format 4.DEMO.all.raw.counts.xlsx Original BLASTN parsing results. There are 8 separate Spread sheets: 100.00%, 99.00%, 98.50%, 98.00%, 97.00%, 95.00%, 90.00%, and All Each are the read BLASTN hits to the reference sequences with at least the percent identity cutoff indicated and the taxonomy assignment results. Columns in the Excel Spreadsheets: Taxon_ID: The species level taxonomy IDs that the reads were matched Consensus_Tax_Level: The lowest level of taxonomy that the reads hits. DPCOFGS denotes Domain, Family, Class, Order, Family, Genus and Species respectively. For example, if it is DPCOFGS that means all the reads hit the same species; if it is DPCOF, that means all the reads hit the same Family, but multiple Genera. Consensus_Taxonomy: Detail Taxonomy that the reads hit. Different taxonomy levels are seperated by the ";" symbol and terms of the same taxonomy level are separated by the "|" symbol, if the reads hit multiple level/terms. For example, Bacteria;Firmicutes;Bacilli;Bacillales;Bacillaceae;Bacillus;pumilus|subtilis means the reads hit both Bacillus pumilus and Bacillus subtilis, hence their consensus taxonomy level will be DPCOFG. Species: The scientific names (Genus+Species) of all the species that the reads matched to. The rest of the columns are the read counts of individual samples. 5.DEMO.98.raw.count.xlsx Same as the above 98% spread sheet, data with this cutoff were used to compile the read count at various taxonomy level or combined level in the next two Excel files. 6.DEMO.98.count.xlsx Read counts from the 98% identity cutoff at the following taxonomy levels: Phylum, Class, Order, Family, Genus, Species, and Combined (mixed level). 7.DEMO.98.per.xlsx The above count data were converted to percentage with total read counts (including unassigned) of each sample. This is a way of data normalization. The stacked column charts were then plotted based on the percentage. Since Excel only allow no morn than 255 rows of data for column charts, only first 255 row of the data in those taxonomy level with more than 255 rows were plotted. The rows were ordered based on the total sum or each row. 8. read_mapping_files.zip: Zip compressed file containing - DEMO.reads_matched.98.taxonomy.txt: read IDs with their assigned taxonomy DEMO.reads_unmatched.98.groups.txt: unassigned read IDs in Mothur group file format DEMO.reads_unmatched.98.names.txt: unassigned reads IDs in Mothur name file format 9.unique_reads.tar.gz: Gzip compressed file containing the unique sequences from the original reads reads.fasta: unique read sequences, these were sent for BLASTN search reads.groups: unique read sequence IDs with associated group (sample) IDs. reads.names: unique read sequence IDs with associated identical read IDs. Since BLASTN search of millions of reads is very time consuming so identical sequences from the original read data were only searched once. Depending on the sequencing quality and sample community, the original sample reads can contain up to 1/3 of repeated sequences. Hence searching only the unique sequences saves a significant amount of time. The total read counts will account for these repeated sequences. For example, an unique read, although only searched once but if it appear multiple times in sample or different samples the counts will be tallied accordingly.