16S_Soil_Microbiome2017_FA19

R Packages

The following code uses several packages including vegan for community ecology, tidyr and plyr for organizing data tables and manipulating strings of text, and ggplot for plotting data.

Project Summary

The following code is used to analyze 16S data from bulk soils collected from 47 Juglans nigra trees. Raw reads were generated on the Illumina Miseq platform (2x250) and processed into OTUs using the bioinformatics pipeline Mothur. Following sequence processing in Mothur a total of 37,604 OTUs were generated from 181,195 unique sequences (2,256,393 sequences total from 2,933,074 reads prior to processing).Taxonomic assignments were made in Mothur using the SILVA database. OTUs were generated in MOTHUR using a 97% similarity threshold for sequences aligned to the V4 region of the 16S rRNA region.

Data Import

Below raw OTU tables (data has not yet been rarefied), metadata which indicates the state from which the sample was collected and clone ID, and the taxonomic classifications for each OTU are imported into R.

Rarefaction and Singleton Removal

To evaluate coverage and differences in sequencing depth between samples, I generate rarefaction curves with sequencing depth on the x-axis and number of OTUs on the y-axis. These curves can be used to evaluate where to set the rarefaction cut-off to reduce sampling biases when calculating alpha-diversity measures such as diversity and observed richness. Additionally, to reduce the effects of rare taxa on ordinations, I remove singletons here as well.

Rarefaction Curves

Below I generate the rarefaction curves using the rarecurve function from vegan. This will take sometime to run, particularly on larger OTU tables. The output is then subset so that individual samples can be plotted in ggplot.

#I then create an object that spans the range of the OTU richness. This will be used to plot the vertical line in my ggplot rarefaction curve. 
y<-(0:6500)

#I then create objects containing the x and y coordinates values for each sample. 

sequencesoil47<-attr(rarebacsoilcurve[[47]],which="Subsample")
sequencesoil46<-attr(rarebacsoilcurve[[46]],which="Subsample")
sequencesoil45<-attr(rarebacsoilcurve[[45]],which="Subsample")
sequencesoil44<-attr(rarebacsoilcurve[[44]],which="Subsample")
sequencesoil43<-attr(rarebacsoilcurve[[43]],which="Subsample") 
sequencesoil42<-attr(rarebacsoilcurve[[42]],which="Subsample")
sequencesoil41<-attr(rarebacsoilcurve[[41]],which="Subsample")
sequencesoil40<-attr(rarebacsoilcurve[[40]],which="Subsample")
sequencesoil39<-attr(rarebacsoilcurve[[39]],which="Subsample")
sequencesoil38<-attr(rarebacsoilcurve[[38]],which="Subsample")
sequencesoil37<-attr(rarebacsoilcurve[[37]],which="Subsample")
sequencesoil36<-attr(rarebacsoilcurve[[36]],which="Subsample")
sequencesoil35<-attr(rarebacsoilcurve[[35]],which="Subsample")
sequencesoil34<-attr(rarebacsoilcurve[[34]],which="Subsample")
sequencesoil33<-attr(rarebacsoilcurve[[33]],which="Subsample")
sequencesoil32<-attr(rarebacsoilcurve[[32]],which="Subsample")
sequencesoil31<-attr(rarebacsoilcurve[[31]],which="Subsample")
sequencesoil30<-attr(rarebacsoilcurve[[30]],which="Subsample")
sequencesoil29<-attr(rarebacsoilcurve[[29]],which="Subsample")
sequencesoil28<-attr(rarebacsoilcurve[[28]],which="Subsample")
sequencesoil27<-attr(rarebacsoilcurve[[27]],which="Subsample")
sequencesoil26<-attr(rarebacsoilcurve[[26]],which="Subsample")
sequencesoil25<-attr(rarebacsoilcurve[[25]],which="Subsample")
sequencesoil24<-attr(rarebacsoilcurve[[24]],which="Subsample")
sequencesoil23<-attr(rarebacsoilcurve[[23]],which="Subsample")
sequencesoil22<-attr(rarebacsoilcurve[[22]],which="Subsample")
sequencesoil21<-attr(rarebacsoilcurve[[21]],which="Subsample")
sequencesoil20<-attr(rarebacsoilcurve[[20]],which="Subsample")
sequencesoil19<-attr(rarebacsoilcurve[[19]],which="Subsample")
sequencesoil18<-attr(rarebacsoilcurve[[18]],which="Subsample")
sequencesoil17<-attr(rarebacsoilcurve[[17]],which="Subsample")
sequencesoil16<-attr(rarebacsoilcurve[[16]],which="Subsample")
sequencesoil15<-attr(rarebacsoilcurve[[15]],which="Subsample")
sequencesoil14<-attr(rarebacsoilcurve[[14]],which="Subsample")
sequencesoil13<-attr(rarebacsoilcurve[[13]],which="Subsample")
sequencesoil12<-attr(rarebacsoilcurve[[12]],which="Subsample")
sequencesoil11<-attr(rarebacsoilcurve[[11]],which="Subsample")
sequencesoil10<-attr(rarebacsoilcurve[[10]],which="Subsample")
sequencesoil9<-attr(rarebacsoilcurve[[9]],which="Subsample")
sequencesoil8<-attr(rarebacsoilcurve[[8]],which="Subsample")
sequencesoil7<-attr(rarebacsoilcurve[[7]],which="Subsample")
sequencesoil6<-attr(rarebacsoilcurve[[6]],which="Subsample")
sequencesoil5<-attr(rarebacsoilcurve[[5]],which="Subsample")
sequencesoil4<-attr(rarebacsoilcurve[[4]],which="Subsample")
sequencesoil3<-attr(rarebacsoilcurve[[3]],which="Subsample")
sequencesoil2<-attr(rarebacsoilcurve[[2]],which="Subsample")
sequencesoil1<-attr(rarebacsoilcurve[[1]],which="Subsample")

#Here I plot the rarefaction curves. 

library(ggplot2)
library(ggpubr)
soilrarecurv<-ggplot()+
  geom_line(aes(x=sequencesoil47,y=rarebacsoilcurve[[47]]), size=1)+
  geom_line(aes(x=sequencesoil46,y=rarebacsoilcurve[[46]]),size=1)+
  geom_line(aes(x=sequencesoil45,y=rarebacsoilcurve[[45]]),size=1 )+
  geom_line(aes(x=sequencesoil44,y=rarebacsoilcurve[[44]]),size=1)+
  geom_line(aes(x=sequencesoil43,y=rarebacsoilcurve[[43]]),size=1)+
  geom_line(aes(x=sequencesoil42,y=rarebacsoilcurve[[42]]),size=1)+
  geom_line(aes(x=sequencesoil41,y=rarebacsoilcurve[[41]]),size=1)+
  geom_line(aes(x=sequencesoil40,y=rarebacsoilcurve[[40]]),size=1)+
  geom_line(aes(x=sequencesoil39,y=rarebacsoilcurve[[39]]), size=1)+  
  geom_line(aes(x=sequencesoil38,y=rarebacsoilcurve[[38]]), size=1)+  
  geom_line(aes(x=sequencesoil37,y=rarebacsoilcurve[[37]]), size=1)+
  geom_line(aes(x=sequencesoil36,y=rarebacsoilcurve[[36]]),size=1)+
  geom_line(aes(x=sequencesoil35,y=rarebacsoilcurve[[35]]),size=1 )+
  geom_line(aes(x=sequencesoil34,y=rarebacsoilcurve[[34]]),size=1)+
  geom_line(aes(x=sequencesoil33,y=rarebacsoilcurve[[33]]),size=1)+
  geom_line(aes(x=sequencesoil32,y=rarebacsoilcurve[[32]]),size=1)+
  geom_line(aes(x=sequencesoil31,y=rarebacsoilcurve[[31]]),size=1)+
  geom_line(aes(x=sequencesoil30,y=rarebacsoilcurve[[30]]),size=1)+
  geom_line(aes(x=sequencesoil29,y=rarebacsoilcurve[[29]]), size=1)+  
  geom_line(aes(x=sequencesoil28,y=rarebacsoilcurve[[28]]), size=1)+
  geom_line(aes(x=sequencesoil27,y=rarebacsoilcurve[[27]]), size=1)+
  geom_line(aes(x=sequencesoil26,y=rarebacsoilcurve[[26]]),size=1)+
  geom_line(aes(x=sequencesoil25,y=rarebacsoilcurve[[25]]),size=1 )+
  geom_line(aes(x=sequencesoil24,y=rarebacsoilcurve[[24]]),size=1)+
  geom_line(aes(x=sequencesoil23,y=rarebacsoilcurve[[23]]),size=1)+
  geom_line(aes(x=sequencesoil22,y=rarebacsoilcurve[[22]]),size=1)+
  geom_line(aes(x=sequencesoil21,y=rarebacsoilcurve[[21]]),size=1)+
  geom_line(aes(x=sequencesoil20,y=rarebacsoilcurve[[20]]),size=1)+
  geom_line(aes(x=sequencesoil19,y=rarebacsoilcurve[[19]]), size=1)+
  geom_line(aes(x=sequencesoil18,y=rarebacsoilcurve[[18]]),size=1)+
  geom_line(aes(x=sequencesoil17,y=rarebacsoilcurve[[17]]),size=1)+
  geom_line(aes(x=sequencesoil16,y=rarebacsoilcurve[[16]]),size=1)+
  geom_line(aes(x=sequencesoil15,y=rarebacsoilcurve[[15]]),size=1)+
  geom_line(aes(x=sequencesoil14,y=rarebacsoilcurve[[14]]), size=1)+
  geom_line(aes(x=sequencesoil13,y=rarebacsoilcurve[[13]]), size=1)+
  geom_line(aes(x=sequencesoil12,y=rarebacsoilcurve[[12]]),size=1)+
  geom_line(aes(x=sequencesoil11,y=rarebacsoilcurve[[11]]),size=1 )+
  geom_line(aes(x=sequencesoil10,y=rarebacsoilcurve[[10]]),size=1)+
  geom_line(aes(x=sequencesoil9,y=rarebacsoilcurve[[9]]),size=1)+
  geom_line(aes(x=sequencesoil8,y=rarebacsoilcurve[[8]]),size=1)+
  geom_line(aes(x=sequencesoil7,y=rarebacsoilcurve[[7]]),size=1)+
  geom_line(aes(x=sequencesoil6,y=rarebacsoilcurve[[6]]),size=1)+
  geom_line(aes(x=sequencesoil5,y=rarebacsoilcurve[[5]]), size=1)+
  geom_line(aes(x=sequencesoil4,y=rarebacsoilcurve[[4]]),size=1)+
  geom_line(aes(x=sequencesoil3,y=rarebacsoilcurve[[3]]),size=1)+
  geom_line(aes(x=sequencesoil2,y=rarebacsoilcurve[[2]]),size=1)+
  geom_line(aes(x=sequencesoil1,y=rarebacsoilcurve[[1]]),size=1)+
  geom_path(aes(x=37329,y=y), linetype=5)+
  #scale_color_manual(values=c("#00CCCC","#336699","#336699","#336699","#663300","#663300","#00CCCC","#CC9933","#CC9933","#CC9933","#0000FF","#0000FF","#006600","#006600"))+
  theme(panel.grid.major=element_blank(),
       panel.grid.minor=element_blank(),
       panel.background = element_blank(),
  panel.border=element_rect(color="black", size=1, fill=NA))+
 theme(axis.title.x=element_text(size=14, face="bold"))+
  theme(axis.title.y=element_text(size=14, face="bold"))+
  theme(axis.text.x=element_text(size=12, face="bold"))+
  theme(axis.text.y=element_text(size=12, face="bold"))+ 
labs(x=("Sequencing Depth"), y=expression(bold(paste("OTU Richness", sep=""))))
soilrarecurv

Sample Rarefaction

Below I rarefy my samples using the rrarefy function from vegan. Rarefaction cut-offs are determined by reviewing the rarefaction curves and identifying the lowest number of sequences that can be retained that minimize sample loss but maximize sequencing depth across samples. Based on the above rarefaction curve, a cut-off of 37,329 sequences per samples.

##     TN272ASBAC   TNWTMB19SBAC      TN55BSBAC      TNLS2SBAC  INWTMCB29SBAC 
##          37329          37684          38265          38442          39419 
## WA130BNL20SBAC  INWTMCB33SBAC     TN272CSBAC  INWTMCB28SBAC   IN55MCB6SBAC 
##          39425          39683          39851          40200          40940 
##   WA130RN4SBAC   WA132RN8SBAC    WA55RN2SBAC IN272MCB10SBAC  WA272RN10SBAC 
##          41540          42393          42612          42632          42834 
##  IN130MCB2SBAC     TN130ASBAC   IN55MCB7SBAC   WA132RN7SBAC     TN132BSBAC 
##          43301          43475          43653          43660          43673 
##      TN55CSBAC  WA55BNL19SBAC     TN130BSBAC IN132MCB17SBAC IN272MCB11SBAC 
##          44219          44795          44871          44903          45177 
##   IN55MCB8SBAC      TN55ASBAC      TNLS3SBAC IN130MCB26SBAC     TN272BSBAC 
##          45805          45902          46124          46558          46730 
##     TN132ASBAC   WA132RN9SBAC  IN130MCB9SBAC      TNLS1SBAC WA272BNL17SBAC 
##          47123          47622          48478          48700          49595 
##   TNWTMB20SBAC  WAWTBNL22SBAC     TN132CSBAC IN132MCB16SBAC  WAWTBNL23SBAC 
##          49818          50268          50973          51569          51933 
##     TN130CSBAC    WA55RN1SBAC   TNWTMB21SBAC IN272MCB24SBAC  WAWTBNL21SBAC 
##          59111          59566          68835          68965          75457 
## IN132MCB27SBAC WA272BNL18SBAC 
##          75813          76472

Relative Abundance Charts

To visualize differences in the relative abundance of different taxa, I reorganize the OTU table and taxonomy strings to create relative abundance stacked bar charts. Below, stacked bar charts are made for phylum, class, and order levels.

Taxonomy Subsetting

To begin, the taxonomy file is subsetted to include only taxonomy information for OTUs present in the rarefied OTU table with singletons removed. The OTU table is first transposed so that OTU IDs are the row names and sample names are the column names. Then the taxonomic assignments are added to the OTU table. Taxonomic strings are then split by semi-colon into individual columns by Phylum, Class, Order, Family, and Genus (SILVA does not go to species level).

#Now that samples have been rarefied I can perform alpha and beta diversity analyses on my data sets. To begin I first reformat my OTU table so that I can do some filtering and add taxonomy information.  
soil.rareotu.nosingles<-read.table("/Volumes/AaronOnufrakMac/researchprojects_2019.2022_ajo/j.nigra_microbiome_2017_ajo_gmw/soil16sstatistics_jnigra17_2019_ajo/soil16smothur_jnigra17_microbiome/soil16sotutable.rarefied.nosingleton_jnigra17_fa19_ajo.shared", header=TRUE, sep="\t")

#I now transpose my OTU table so that I can begin to add in taxonomy information. 
t.soil.rareotu.nosingles<-as.data.frame(t(soil.rareotu.nosingles))

#The below function is asking specifically for the OTU IDs. These are the row names. I will use these to subset the taxonomy file. 
t.soil.rareotu.nosingleslabs<-labels(t.soil.rareotu.nosingles)
soil.taxrare<-subset(taxonomy.soil, rownames(taxonomy.soil) %in% t.soil.rareotu.nosingleslabs[[1]])

#Now I am subsetting the new taxonomy file to include only the taxonomy information stored in column 2. 
soil.taxrareinfo<-(soil.taxrare[,2])

#Below I am using the separate function from tidyr to split the taxonomy strings into columns by semi-colon so that I can rename the OTUs to give them more meaning for downstream analyses.
library(tidyr)

#The SILVA database only provides classifications up to genus level, new columns are created only up to Genus level.
soil.taxonomylabs<-c("kingdom","phylum","class","order","family","genus")

#Below I create a new data frame with a column containing taxonomy information.
t.soil.rareotu.nosinglestax<-data.frame(taxonomy=soil.taxrareinfo,t.soil.rareotu.nosingles)

#I then separate the taxonomy strings into new columns labeled with the taxonomy labels saved in the taxonomylabs object. Strings are being separated based on the presence of semi-colons. 
t.soil.rareotu.nosinglestaxsep<-separate(t.soil.rareotu.nosinglestax,into=soil.taxonomylabs,col=taxonomy,sep=";")

Relative Abundance OTU Table

I then create a relative abundance OTU table. This table contains taxonomic information and can be searched to look at specific OTUs more easily.

#Now that samples have been rarefied I can perform alpha and beta diversity analyses on my data sets. To begin I first reformat my OTU table so that I can do some filtering and add taxonomy information.  
soil.rareotu.nosingles<-read.table("/Volumes/AaronOnufrakMac/researchprojects_2019.2022_ajo/j.nigra_microbiome_2017_ajo_gmw/soil16sstatistics_jnigra17_2019_ajo/soil16smothur_jnigra17_microbiome/soil16sotutable.rarefied.nosingleton_jnigra17_fa19_ajo.shared", header=TRUE, sep="\t")

soil.rareotu.nosingles.table<-decostand(soil.rareotu.nosingles,method="total")

#I now transpose my OTU table so that I can begin to add in taxonomy information. 
t.soil.rareotu.nosingles.table<-as.data.frame(t(soil.rareotu.nosingles.table))

#The below function is asking specifically for the OTU IDs. These are the row names. I will use these to subset the taxonomy file. 
t.soil.rareotu.nosingleslabs<-labels(t.soil.rareotu.nosingles.table)
soil.taxrare<-subset(taxonomy.soil, rownames(taxonomy.soil) %in% t.soil.rareotu.nosingleslabs[[1]])

#Now I am subsetting the new taxonomy file to include only the taxonomy information stored in column 2. 
soil.taxrareinfo<-(soil.taxrare[,2])

#Below I am using the separate function from tidyr to split the taxonomy strings into columns by semi-colon so that I can rename the OTUs to give them more meaning for downstream analyses.
library(tidyr)

#The SILVA database only provides classifications up to genus level, new columns are created only up to Genus level.
soil.taxonomylabs<-c("kingdom","phylum","class","order","family","genus")

#Below I create a new data frame with a column containing taxonomy information.
t.soil.rareotu.nosinglestax.table<-data.frame(taxonomy=soil.taxrareinfo,t.soil.rareotu.nosingles.table)

#I then separate the taxonomy strings into new columns labeled with the taxonomy labels saved in the taxonomylabs object. Strings are being separated based on the presence of semi-colons. 
t.soil.rareotu.nosinglestaxsep.table<-separate(t.soil.rareotu.nosinglestax.table,into=soil.taxonomylabs,col=taxonomy,sep=";")

library(DT)
datatable(t.soil.rareotu.nosinglestaxsep.table,fillContainer = T, height="500dpx")

Phylum Relative Abundance Chart

Below I generate a phylum relative abundance plot. I first extract the phylum information. I then clean the taxa names to remove assignment confidence values. OTU raw abundance is then summed by phylum and state, phyla that represent less than 1% of the total community are grouped into other, and relative abundance is calculated. Stacked bar charts are made using the melt function and ggplot.

#Below I create a data frame that consists of phylum assignment and the rarefied OTU table. 
t.soil.raretabphy<-data.frame(phylum=t.soil.rareotu.nosinglestaxsep$phylum,t.soil.rareotu.nosingles)

#I then remove the assignment confidence values contained in parentheses found in each taxonomic string using the sub function.
t.soil.raretabphy$phylum<-sub("\\(.*)","",x=t.soil.raretabphy$phylum)

#The following code is for the generation of phylum relative abundance stacked bar charts. I first sum each OTU by the phylum it belongs to.
library(plyr)
t.soil.rareotutabphy<-as.data.frame(ddply(t.soil.raretabphy, .(phylum),colwise(sum)))
rownames(t.soil.rareotutabphy)<-make.names(t.soil.rareotutabphy$phylum)
t.soil.rareotutabphy<-t.soil.rareotutabphy[,2:48]

#I then transpose the data frame so that sample names are now row names and column names are OTU names. 
soil.rareotutabphy<-as.data.frame(t(t.soil.rareotutabphy))
soil.rareotutabphy.state<-data.frame(state=soil.met.rare$State,soil.rareotutabphy)

#I then sum each OTU by state and subset to exclude sample names. This is because the following functions only work on matrices containing numeric data. 
soil.rareotutabphy.statesum<-as.data.frame(ddply(soil.rareotutabphy.state, .(state),colwise(sum)))
soil.phylumcols<-soil.rareotutabphy.statesum[,2:46]

#I am now creating an other category. Other includes those OTUs belonging to a phylum that comprises less than 1% of the total community. This new object is referred to as phyothers.
soil.phyoth<-soil.phylumcols[,colSums(soil.phylumcols)/sum(soil.phylumcols)<=0.01]
soil.phyother<-rowSums(soil.phyoth)

#I then create an object that contains all of the phyla that comprise more than 1% of the total community.
soil.phyreg<-soil.phylumcols[,colSums(soil.phylumcols)/sum(soil.phylumcols)>0.01]

#I then create a new dataframe containingt the state information, the other column, and the remaining phyla.
soil.phytot2<-data.frame(state=soil.rareotutabphy.statesum$state,soil.phyreg,Other=soil.phyother)

#These values are then converted to relative abundance using the decostand function from vegan.
soil.phyrelabund<-decostand(soil.phytot2[,2:13],method="total")
soil.phyrelabund<-data.frame(state=soil.rareotutabphy.statesum$state,soil.phyrelabund)

#Below I use the melt function from the data.table package to reformat the data into stacked bar graph format.
library(data.table)
library(ggplot2)
library(ggpubr)
library(RColorBrewer)

soil.phyrelabundmelt<-melt(soil.phyrelabund, id.vars="state", variable.name="Phylum")
soil.colors.n.phylum <- length(unique(soil.phyrelabundmelt[,'Phylum']))

soil.phyrel<-ggplot(soil.phyrelabundmelt, aes(x=state, y=value, fill=Phylum))+
  geom_bar(stat="identity", show.legend=TRUE, color="black")+
  scale_fill_manual(values=colorRampPalette(brewer.pal(9, 'Set1'))(soil.colors.n.phylum)) +
  xlab("State") +
  ylab("Relative Abundance") +
  theme(panel.border = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.text.x = element_text(size = 14), axis.ticks.x = element_blank(), axis.text.y = element_text(size = 14), axis.line.y.left = element_line(), axis.title.y = element_text(size = 14))+
   scale_y_continuous(breaks=c(0,0.25,0.50,0.75,1),limits=c(0,1.05))+
  geom_text(data=NULL,aes(x=0.50, y=1.05,label="16S Soil"),hjust=0,colour="black")
  
ggplotly(soil.phyrel,tooltips=c("Phylum"))

Class Relative Abundance Chart

Below I generate a class relative abundance plot. I first extract the class information from the OTU table that contains the taxonomy information. I then clean the taxa names to remove assignment confidence values and other non-taxonomic associated string portions from the class assignment. I then create an unknown group which consists of uncultured bacteria and bacteria unclassified at the class level. OTU raw abundance is then summed by class and state, classes that represent less than 1% of the total community are grouped into other, and relative abundance is calculated. Stacked bar charts are made using the melt function from ggplot.

#The following chunk creates a relative abundance bar chart at the class level. 
library(plyr)
t.soil.raretabcls<-data.frame(class=t.soil.rareotu.nosinglestaxsep$class,t.soil.rareotu.nosingles)

#What I am doing below is cleaning up the class names. These include assignment confidence levels, _or, and _.
t.soil.raretabcls$class<-sub("\\(.*)","",x=t.soil.raretabcls$class)
t.soil.raretabcls$class<-sub("_or","",x=t.soil.raretabcls$class)
t.soil.raretabcls$class<-sub("_"," ",x=t.soil.raretabcls$class)

#I then sum the OTUs by class and make the row names the class name. 
t.soil.rareotutabcls<-as.data.frame(ddply(t.soil.raretabcls, .(class),colwise(sum)))
rownames(t.soil.rareotutabcls)<-make.names(t.soil.rareotutabcls$class)
t.soil.rareotutabcls<-t.soil.rareotutabcls[,2:48]

#I then transpose the data frame and sum each class by state. 
soil.rareotutabcls<-as.data.frame(t(t.soil.rareotutabcls))
soil.rareotutabcls.state<-data.frame(state=soil.met.rare$State,soil.rareotutabcls)
soil.rareotutabcls.statesum<-as.data.frame(ddply(soil.rareotutabcls.state, .(state),colwise(sum)))

#At the class level, SILVA may classify things as unknown class, phylum.unclassified, or uncultured. These don't provide a lot of information and thus are grouped into a knew group called unknown. 
soil.clsunknown<-soil.rareotutabcls.statesum[,grep(".unclassified|Unknown.class|uncultured",names(soil.rareotutabcls.statesum))]
soil.clsunknown<-rowSums(soil.clsunknown)

#I then use the same grep function as above to pull out the class with meaningful information. By setting invert=TRUE I select items lacking the strings in the grep command.
soil.clstot<-soil.rareotutabcls.statesum[,grep(".unclassified|Unknown.class|uncultured",names(soil.rareotutabcls.statesum), invert=TRUE)]

#I then subset the data so that only numerical data is present and creating a new class of objects, other, that includes OTUs from classes that represent 1% or less of the total community. 
soil.clscols<-soil.clstot[,2:125]
soil.clsoth<-soil.clscols[,colSums(soil.clscols)/sum(soil.clscols)<=0.01]
soil.clsother<-rowSums(soil.clsoth)
soil.clsreg<-soil.clscols[,colSums(soil.clscols)/sum(soil.clscols)>0.01]

#I then create a new data frame that contains all of the class data and determine relative abundance using the decostand function from vegan. 
soil.clstot2<-data.frame(state=soil.clstot$state,soil.clsreg,Unclassified=soil.clsunknown,Other=soil.clsother)
soil.clsrelabund<-decostand(soil.clstot2[,2:23],method="total")
soil.clsrelabund<-data.frame(state=soil.rareotutabcls.statesum$state,soil.clsrelabund)

#Below I use the melt function from the data.table package to reformat the data into stacked bar graph format.
library(data.table)
library(ggplot2)
library(ggpubr)
library(RColorBrewer)
soil.clsrelabundmelt<-melt(soil.clsrelabund, id.vars="state", variable.name="class")

colorCount<- length(unique(soil.clsrelabundmelt[,'class']))
soil.cls.getPalette=colorRampPalette(brewer.pal(9,"Set1"))

#Below I generate the relative abundance bar charts in ggplot
soil.clsrel<-ggplot(soil.clsrelabundmelt, aes(x=state, y=value, fill=class))+
  geom_bar(stat="identity",show.legend=TRUE,color="black")+
  scale_fill_manual(values=soil.cls.getPalette(colorCount))+
  xlab("State") +
  ylab("Relative Abundance") +
  theme(panel.border = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.text.x = element_text(size = 14), axis.ticks.x = element_blank(), axis.text.y = element_text(size = 14), axis.line.y.left = element_line(), axis.title.y = element_text(size = 14))+
  scale_y_continuous(breaks=c(0,0.25,0.50,0.75,1),limits=c(0,1.05))+
  geom_text(data=NULL,aes(x=0.50, y=1.05,label="16S Soil"),hjust=0,colour="black")
ggplotly(soil.clsrel,tooltip=c("class"))

Order Relative Abundance Chart

Below I generate an order relative abundance plot. I first extract the order information from the OTU table that contains the taxonomy information. I then clean the taxa names to remove assignment confidence values and other non-taxonomic associated string portions from the order assignment. I then create an unknown group which consists of uncultured bacteria and bacteria unclassified at the order level. OTU raw abundance is then summed by order and state, orders that represent less than 1% of the total community are grouped into other, and relative abundance is calculated. Stacked bar charts are made using the melt function from ggplot.

#The following chunk creates a relative abundance bar chart at the order level.

#What I am doing below is cleaning up the order names. These include assignment confidence levels, _or, and _.
library(plyr)
t.soil.raretabord<-data.frame(order=t.soil.rareotu.nosinglestaxsep$order,t.soil.rareotu.nosingles)
t.soil.raretabord$order<-sub("\\(.*)","",x=t.soil.raretabord$order)
t.soil.raretabord$order<-sub("_or","",x=t.soil.raretabord$order)
t.soil.raretabord$order<-sub("_"," ",x=t.soil.raretabord$order)

#I then sum the OTUs by order and make the row names the order name. 
t.soil.rareotutabord<-as.data.frame(ddply(t.soil.raretabord, .(order),colwise(sum)))
rownames(t.soil.rareotutabord)<-make.names(t.soil.rareotutabord$order)
t.soil.rareotutabord<-t.soil.rareotutabord[,2:48]

#I then transpose the data frame and sum each order by state.
soil.rareotutabord<-as.data.frame(t(t.soil.rareotutabord))
soil.rareotutabord.state<-data.frame(state=soil.met.rare$State,soil.rareotutabord)
soil.rareotutabord.statesum<-as.data.frame(ddply(soil.rareotutabord.state, .(state),colwise(sum)))

#At the class level, SILVA may classify things as unknown class, phylum.unclassified, or uncultured. These don't provide a lot of information and thus are grouped into a knew group called unknown.
soil.ordunknown<-soil.rareotutabord.statesum[,grep(".unclassified|Unknown.Order|uncultured",names(soil.rareotutabord.statesum))]

#I then use the same grep function as above to pull out the order with meaningful information. By setting invert=TRUE I select items lacking the strings in the grep command. 
soil.ordunknownsum<-rowSums(soil.ordunknown)
soil.ordtot<-soil.rareotutabord.statesum[,grep(".unclassified|Unknown.Order|uncultured",names(soil.rareotutabord.statesum), invert=TRUE)]

#I then subset the data so that only numerical data is present and creating a new class of objects, other, that includes OTUs from order that represent 1% or less of the total community.
soil.ordcols<-soil.ordtot[,2:294]
soil.ordoth<-soil.ordcols[,colSums(soil.ordcols)/sum(soil.ordcols)<=0.01]
soil.ordother<-rowSums(soil.ordoth)
soil.ordreg<-soil.ordcols[,colSums(soil.ordcols)/sum(soil.ordcols)>0.01]
soil.pseud<-(soil.ordcols$Pseudomonadales)/sum(soil.ordcols)
#I then create a new data frame that contains all of the order data and determine relative abundance using the decostand function from vegan. 
soil.ordtot2<-data.frame(state=soil.ordtot$state,soil.ordreg,Other=soil.ordother,Unclassified=soil.ordunknownsum)
soil.ordrelabund<-decostand(soil.ordtot2[,2:24],method="total")
soil.ordrelabund<-data.frame(state=soil.rareotutabcls.statesum$state,soil.ordrelabund)

#Below I use the melt function from the data.table package to reformat the data into stacked bar graph format.
library(data.table)
library(ggplot2)
library(ggpubr)
library(RColorBrewer)
soil.ordrelabundmelt<-melt(soil.ordrelabund, id.vars="state", variable.name="order")

colorCount<- length(unique(soil.ordrelabundmelt[,'order']))
soil.ord.getPalette=colorRampPalette(brewer.pal(9,"Set1"))

#Below I generate the relative abundance bar charts in ggplot
soil.ordrel<-ggplot(soil.ordrelabundmelt, aes(x=state, y=value, fill=order))+
  geom_bar(stat="identity",show.legend=TRUE,color="black")+
  scale_fill_manual(values=soil.ord.getPalette(colorCount))+
  xlab("State") +
  ylab("Relative Abundance") +
  theme(panel.border = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.text.x = element_text(size = 14), axis.ticks.x = element_blank(), axis.text.y = element_text(size = 14), axis.line.y.left = element_line(), axis.title.y = element_text(size = 14))+
scale_y_continuous(breaks=c(0,0.25,0.50,0.75,1),limits=c(0,1.05))+
  geom_text(data=NULL,aes(x=0.50, y=1.05,label="16S Soil"),hjust=0,colour="black")
ggplotly(soil.ordrel,tooltip=c("order"))

Taxonomy Filtering and Merging

Taxonomy is first filtered to include only those OTUs present in the rarefied OTU table with no singletons. Then the OTU table is transposed so that OTU ID is the row name and sample ID is the column name. Taxonomy strings are then added to the OTU table and strings are split by semi-colon.

Principal Coordinate Analysis

To visually assess differences in the community composition of soil bacterial communities by clone and state, I perform a principal coordinate analysis (PCoA).

Relative Abundance Calculation

Prior to conducting a PCoA I first convert my OTU table from raw OTU abundances to relative abundance using the decostand function from vegan.

Principal Coordinate Analysis

I then perform the PCoA using the pcoa function from the ape package. Prior to performing the PCoA, a distance matrix needs to be calculated for the OTU table. In this case, I use the bray-curtis method.

Plotting of Principal Coordinate Analysis

Then to plot the PCoA, I use the ggplot function. To use ggplot I need to first pull out the site scores for each sample and generate ellipses.

#We need to first subset our metadata table to include only those samples that passed rarefaction.  
soil.met.rare<-subset(soil.met,soil.met$Group%in% soil.samples)

#Then we extract PCoA site scores (these will be the x and y coordinates for each sample)
soil.bac.pcoavec<-as.data.frame(soil.bac.pcoa$vectors)
soil.bac.pcoasitescores<-data.frame(PC1=soil.bac.pcoavec$Axis.1, PC2=soil.bac.pcoavec$Axis.2)

#Create a new dataframe that includes the site scores from above with metadata from your study. 
soil.bac.pcoagraph<-data.frame(soil.bac.pcoasitescores,PC1=soil.bac.pcoasitescores$PC1, PC2=soil.bac.pcoasitescores$PC2, State=soil.met.rare$State, Clone=soil.met.rare$Clone,group=soil.met$Group)

#This is where you make confidence ellipses. I don't know what all the code means. Just know where to plug in my objects. 
soil.bac.pcoaellipse<-ordiellipse(soil.bac.pcoasitescores,soil.bac.pcoagraph$State, display="sites", kind="sd",draw="none")

df_ell <- data.frame()
for(g in levels(soil.bac.pcoagraph$State)){
df_ell <- rbind(df_ell, cbind(as.data.frame(with(soil.bac.pcoagraph[soil.bac.pcoagraph$State==g,],                                                vegan:::veganCovEllipse(soil.bac.pcoaellipse[[g]]$cov,soil.bac.pcoaellipse[[g]]$center,soil.bac.pcoaellipse[[g]]$scale))) ,State=g))}

#Plot PCoA in ggplot
library(ggplot2)
soil.pcoa<-ggplot(soil.bac.pcoagraph, aes(PC1,PC2, colour=State))+
  #geom_text(aes(label=group))+
  geom_point(aes(shape=Clone), size=3.5)+
  geom_path(data=df_ell, aes(x=PC1, y=PC2, colour=State),size=1, linetype=2)+
 theme(axis.title.x=element_text(size=14, face="bold"))+
  theme(axis.title.y=element_text(size=14, face="bold"))+
  theme(axis.text.x=element_text(size=12, face="bold"))+
  theme(axis.text.y=element_text(size=12, face="bold"))+ 
  scale_color_manual(values=c("#006600","#3399FF","#FF9900"))+
  scale_shape_manual(values=c(16,10,8,7,0,2))+
  scale_x_continuous(breaks=c(-0.60,-0.30,0,0.30,0.60),limits=c(-0.60,0.60))+
  scale_y_continuous(breaks=c(-0.60,-0.30,0,0.30,0.60),limits=c(-.60,.60))+
  theme(panel.grid.major=element_blank(),
        panel.grid.minor=element_blank(),
    panel.border=element_rect(colour="black", size=1, fill=NA))+
  theme(panel.grid.major=element_blank(),
       panel.grid.minor=element_blank(),
       panel.background = element_blank(),
  panel.border=element_rect(color="black", size=1, fill=NA))+
  geom_text(data=NULL,aes(x=-0.60,y=0.60,label="16S Soil"),hjust=0,colour="black")
soil.pcoa

PERMANOVA

To formally test how well state and clone explain differences in soil bacteria community composition, I perform a PERMANOVA using the adonis function from vegan.

Df SumsOfSqs MeanSqs F.Model R2 Pr(>F)
State 2 2.0599954 1.0299977 9.2177029 0.3014686 0.0001000
Clone 4 0.3035548 0.0758887 0.6791466 0.0444235 0.8984102
Residuals 40 4.4696503 0.1117413 NA 0.6541079 NA
Total 46 6.8332005 NA NA 1.0000000 NA

Richness and Diversity

Then to assess the alpha diversity of our samples, I calculate the observed richness and shannon diversity index using the vegan package. This is done using the rarefied OTU table that contains singletons. Singletons are included to keep sampling depth standard between samples.

Richness

Observed OTU richness is first calculated for each sample. Then due to the unbalanced nature of our study design, we use a type 3 ANOVA.

Sum Sq Df F value Pr(>F)
(Intercept) 58353120.3 1 271.0432117 0.0000000
State 1307072.2 2 3.0355963 0.0620663
clone 488691.1 4 0.5674778 0.6880388
State:clone 834679.0 8 0.4846229 0.8579215
Residuals 6889306.8 32 NA NA

Sum Sq Df F value Pr(>F)
(Intercept) 123304417.80 1 638.5533064 0.0000000
State 5112004.75 2 13.2367015 0.0000387
clone 99047.17 4 0.1282332 0.9712996
Residuals 7723985.86 40 NA NA
diff lwr upr p adj
TN-IN -153.02222 -526.9365 220.8920 0.5836065
WA-IN 630.81905 233.3656 1028.2725 0.0011434
WA-TN 783.84127 402.7126 1164.9699 0.0000341
132-130 110.58360 -499.2636 720.4308 0.9850583
272-130 29.13915 -580.7080 638.9864 0.9999183
55-130 122.69471 -487.1525 732.5419 0.9780435
WT-130 91.21111 -481.6406 664.0628 0.9908401
272-132 -81.44444 -673.0831 510.1942 0.9947458
55-132 12.11111 -579.5276 603.7498 0.9999972
WT-132 -19.37249 -572.7998 534.0548 0.9999764
55-272 93.55556 -498.0831 685.1942 0.9910765
WT-272 62.07196 -491.3554 615.4993 0.9976229
WT-55 -31.48360 -584.9109 521.9437 0.9998364

Sum Sq Df F value Pr(>F)
(Intercept) 312378420 1 1756.94650 0.00e+00
State 5206781 2 14.64255 1.34e-05
Residuals 7823033 44 NA NA
diff lwr upr p adj
TN-IN -153.0222 -510.5703 204.5258 0.5571720
WA-IN 630.8190 250.7621 1010.8760 0.0006350
WA-TN 783.8413 419.3946 1148.2880 0.0000139

Diversity

Shannon diversity is first calculated for each sample. Then due to the unbalanced nature of our study design, we use a type 3 ANOVA.

Sum Sq Df F value Pr(>F)
(Intercept) 129.0868144 1 2951.0386914 0.0000000
State 0.5568037 2 6.3645125 0.0047094
clone 0.1237413 4 0.7072090 0.5929482
State:clone 0.1436735 8 0.4105628 0.9060315
Residuals 1.3997709 32 NA NA

Sum Sq Df F value Pr(>F)
(Intercept) 264.6935053 1 6859.813328 0.0000000
State 1.9610185 2 25.410939 0.0000001
clone 0.1004248 4 0.650654 0.6297693
Residuals 1.5434444 40 NA NA
diff lwr upr p adj
TN-IN 0.0807888 -0.0863575 0.2479351 0.4739357
WA-IN 0.4885664 0.3108976 0.6662351 0.0000001
WA-TN 0.4077776 0.2374063 0.5781488 0.0000025

Sum Sq Df F value Pr(>F)
(Intercept) 663.709482 1 17764.92771 0
State 1.995571 2 26.70685 0
Residuals 1.643869 44 NA NA
diff lwr upr p adj
TN-IN 0.0807888 -0.0831117 0.2446894 0.4620979
WA-IN 0.4885664 0.3143477 0.6627850 0.0000001
WA-TN 0.4077776 0.2407146 0.5748405 0.0000013

Aaron Onufrak

Fall 2019