####################################################
#Bacterial 16S rRNA Sequence Identification Pipeline
# Copyright Tsute Chen, The Forsyth Institute
####################################################

####### Pipeline Description ####### 

1. Preprocess reads - find unique sequences associate read 
IDs and sample IDs

2. Submit all unique reads for BLASTN search against 16S rRNA
Reference sequence set, using the BLASTN parameters: 
-q -5 -r 4 -G 5 -E 5

3. Parse the BLASTN results using the following criteria:

  a. for each read, the alignment length must be >=90% of read length
  
  b. for each read, the best hit to references is determined by highest percent
  identity AND score (reflecting alignment length). If a single best hit is found,
  or multiple equal hits to more than one reference sequences but these reference
  sequences all represent the same species, then the single hit species was recorded.
  
  c. if a read hit multiple reference sequences that represent multiple species with
  equal percent identity and score, all the species were recorded in the original results.
  Also a consensus taxonomy level for these multiple species will be determined (can 
  be Genus, Family, or other higher taxonomy level, depending on what species are hit).
  In this case, the read cannot be assigned to single species due to tied % and score.
 
4. Record the best hit to different tables for various percent identity cutoffs:
100%, 99%, 98%,97%, 95%, 90%, and all (no minimal %cutoff)

5. Read count data from the 98% cutoff were used to calculate combined counts
   at different taxonomy level. And percent read count by sample were used to chart 
   the stack-column graphs.


####### File Description ####### 


1. 00_README.txt: This file

2. reference.fasta:: 16S rRNA Reference sequences used in this analysis
3. reference.taxonomy: Reference sequence taxonomy in Mothur format


4.DEMO.all.raw.counts.xlsx
  
Original BLASTN parsing results. There are 8 separate Spread sheets:
100.00%, 99.00%, 98.50%, 98.00%, 97.00%, 95.00%, 90.00%, and All
Each are the read BLASTN hits to the reference sequences with at least 
the percent identity cutoff indicated and the taxonomy assignment results.

Columns in the Excel Spreadsheets:

  Taxon_ID: The species level taxonomy IDs that the reads were matched

  Consensus_Tax_Level: The lowest level of taxonomy that the reads hits.
    DPCOFGS denotes Domain, Family, Class, Order, Family, Genus and Species respectively.
    For example, if it is DPCOFGS that means all the reads hit the same species;
    if it is DPCOF, that means all the reads hit the same Family, but multiple Genera.
    
  Consensus_Taxonomy: Detail Taxonomy that the reads hit. Different taxonomy levels 
  are seperated by the ";" symbol and terms of the same taxonomy level are separated by
  the "|" symbol, if the reads hit multiple level/terms.
  For example,
  	Bacteria;Firmicutes;Bacilli;Bacillales;Bacillaceae;Bacillus;pumilus|subtilis
  	means the reads hit both Bacillus pumilus and Bacillus subtilis, hence their consensus 
  	taxonomy level will be DPCOFG.

  Species: The scientific names (Genus+Species) of all the species that the reads matched to.

The rest of the columns are the read counts of individual samples.


5.DEMO.98.raw.count.xlsx
  
Same as the above 98% spread sheet, data with this cutoff were used to compile
the read count at various taxonomy level or combined level in the next two
Excel files.

6.DEMO.98.count.xlsx
  
Read counts from the 98% identity cutoff at the following taxonomy levels:
Phylum, Class, Order, Family, Genus, Species, and Combined (mixed level).

7.DEMO.98.per.xlsx
  
The above count data were converted to percentage with total read counts (including
unassigned) of each sample. This is a way of data normalization. The stacked 
column charts were then plotted based on the percentage. Since Excel only allow
no morn than 255 rows of data for column charts, only first 255 row of the data in
those taxonomy level with more than 255 rows were plotted. The rows were ordered
based on the total sum or each row.


8. read_mapping_files.zip: Zip compressed file containing -

  DEMO.reads_matched.98.taxonomy.txt: read IDs with their assigned taxonomy
  DEMO.reads_unmatched.98.groups.txt: unassigned read IDs in Mothur group file format
  DEMO.reads_unmatched.98.names.txt: unassigned reads IDs in Mothur name file format


9.unique_reads.tar.gz: Gzip compressed file containing the unique sequences from the original reads

  reads.fasta: unique read sequences, these were sent for BLASTN search
  reads.groups: unique read sequence IDs with associated group (sample) IDs.
  reads.names: unique read sequence IDs with associated identical read IDs.

  Since BLASTN search of millions of reads is very time consuming so identical sequences from the
  original read data were only searched once. Depending on the sequencing quality and sample community,
  the original sample reads can contain up to 1/3 of repeated sequences. Hence searching only
  the unique sequences saves a significant amount of time. The total read counts will account for
  these repeated sequences. For example, an unique read, although only searched once but if it appear
  multiple times in sample or different samples the counts will be tallied accordingly.