16S_Soil_Microbiome2017_FA19

R Packages
Project Summary
- Data Import
- Data Subsetting
Rarefaction and Singleton Removal
Relative Abundance Charts
Taxonomy Filtering and Merging
- Renaming of OTUs
Principal Coordinate Analysis
Richness and Diversity
- Richness
- Diversity

R Packages

The following code uses several packages including vegan for community ecology, tidyr and plyr for organizing data tables and manipulating strings of text, and ggplot for plotting data.

#In this chunk I load all of the requisite packages.
library(vegan)
library(ggplot2)
library(ggpubr)
library(tidyr)
library(plyr)
library(data.table)
library(RColorBrewer)
library(ape)
library(car)
library(Hmisc)
library(corrplot)
library(kableExtra)
library(plotly)
library(knitr)
library(DT)

Project Summary

The following code is used to analyze 16S data from bulk soils collected from 47 Juglans nigra trees. Raw reads were generated on the Illumina Miseq platform (2x250) and processed into OTUs using the bioinformatics pipeline Mothur. Following sequence processing in Mothur a total of 37,604 OTUs were generated from 181,195 unique sequences (2,256,393 sequences total from 2,933,074 reads prior to processing).Taxonomic assignments were made in Mothur using the SILVA database. OTUs were generated in MOTHUR using a 97% similarity threshold for sequences aligned to the V4 region of the 16S rRNA region.

Data Import

Below raw OTU tables (data has not yet been rarefied), metadata which indicates the state from which the sample was collected and clone ID, and the taxonomic classifications for each OTU are imported into R.

#First I import the whole OTU table I recieved from Mothur. Note at this point taxonomy is not included, it is just raw OTU counts (columns) by sample or group (rows). Below I indicate that the row.names are in column 2 of the imported file. 
bigfile.sample.soil<-read.table("/Volumes/AaronOnufrakMac/researchprojects_2019.2022_ajo/j.nigra_microbiome_2017_ajo_gmw/soil16sstatistics_jnigra17_2019_ajo/soil16smothur_jnigra17_microbiome/soil16sotutable_jnigra17_fa19_ajo.shared",header=TRUE,sep="\t",fill=TRUE,strip.white=TRUE, row.names=2)

#Next I import the metadata for the particular habitat of interest. In this case I am working with soils. I will use this metadata to subset the larger OTU file above. 
soil.met<-read.table("/Volumes/AaronOnufrakMac/researchprojects_2019.2022_ajo/j.nigra_microbiome_2017_ajo_gmw/soil16sstatistics_jnigra17_2019_ajo/soil16smothur_jnigra17_microbiome/soil16smetadata_jnigra17_su19_ajo.txt",header=TRUE,sep="\t",fill=TRUE,strip.white=TRUE)

#Next I import the taxonomy file. In earlier versions of this code, this contained taxonomic assignments for the whole data set and not just a single habitat. Current version is only 16S soil data.  
taxonomy.soil<-read.table("/Volumes/AaronOnufrakMac/researchprojects_2019.2022_ajo/j.nigra_microbiome_2017_ajo_gmw/soil16sstatistics_jnigra17_2019_ajo/soil16smothur_jnigra17_microbiome/soil16staxonomy_jnigra17_fa19_ajo.cons.taxonomy",header=TRUE,sep="\t",fill=TRUE,strip.white=TRUE,row.names=1)

soil.names<-soil.met$Group

Data Subsetting

Below I subset the OTU table to remove mothur inserted metadata which indicates the percent identity threshold and the number of OTUs for the whole habitat.

#Here I use the subset function to pull out just the samples belonging to soils. 
soil<-subset(bigfile.sample.soil, rownames(bigfile.sample.soil) %in% soil.names)

#Here I am removing any OTU that is not present in any samples. What the below function is doing and summing the columns (OTUs) and any OTU that equals 0 is removed from the dataset. This will hopefully speed up downstream analyses. This line is important when processing multiple habitats that were processed together but analyzed spearately
soil.pure<- soil[,colSums(soil)>0]

#Here I am further subsetting the data. I am removing some MOTHUR metadata that is useless.  
soil.pureotu<-soil.pure[,3:37606]

Rarefaction and Singleton Removal

To evaluate coverage and differences in sequencing depth between samples, I generate rarefaction curves with sequencing depth on the x-axis and number of OTUs on the y-axis. These curves can be used to evaluate where to set the rarefaction cut-off to reduce sampling biases when calculating alpha-diversity measures such as diversity and observed richness. Additionally, to reduce the effects of rare taxa on ordinations, I remove singletons here as well.

Rarefaction Curves

Below I generate the rarefaction curves using the rarecurve function from vegan. This will take sometime to run, particularly on larger OTU tables. The output is then subset so that individual samples can be plotted in ggplot.

#Below I create the rarefaction curves for each sample using the rarecurve function from vegan. 
library(vegan)
rarebacsoilcurve<-rarecurve(soil.pureotu)

#I then create an object that spans the range of the OTU richness. This will be used to plot the vertical line in my ggplot rarefaction curve. 
y<-(0:6500)

#I then create objects containing the x and y coordinates values for each sample. 

sequencesoil47<-attr(rarebacsoilcurve[[47]],which="Subsample")
sequencesoil46<-attr(rarebacsoilcurve[[46]],which="Subsample")
sequencesoil45<-attr(rarebacsoilcurve[[45]],which="Subsample")
sequencesoil44<-attr(rarebacsoilcurve[[44]],which="Subsample")
sequencesoil43<-attr(rarebacsoilcurve[[43]],which="Subsample") 
sequencesoil42<-attr(rarebacsoilcurve[[42]],which="Subsample")
sequencesoil41<-attr(rarebacsoilcurve[[41]],which="Subsample")
sequencesoil40<-attr(rarebacsoilcurve[[40]],which="Subsample")
sequencesoil39<-attr(rarebacsoilcurve[[39]],which="Subsample")
sequencesoil38<-attr(rarebacsoilcurve[[38]],which="Subsample")
sequencesoil37<-attr(rarebacsoilcurve[[37]],which="Subsample")
sequencesoil36<-attr(rarebacsoilcurve[[36]],which="Subsample")
sequencesoil35<-attr(rarebacsoilcurve[[35]],which="Subsample")
sequencesoil34<-attr(rarebacsoilcurve[[34]],which="Subsample")
sequencesoil33<-attr(rarebacsoilcurve[[33]],which="Subsample")
sequencesoil32<-attr(rarebacsoilcurve[[32]],which="Subsample")
sequencesoil31<-attr(rarebacsoilcurve[[31]],which="Subsample")
sequencesoil30<-attr(rarebacsoilcurve[[30]],which="Subsample")
sequencesoil29<-attr(rarebacsoilcurve[[29]],which="Subsample")
sequencesoil28<-attr(rarebacsoilcurve[[28]],which="Subsample")
sequencesoil27<-attr(rarebacsoilcurve[[27]],which="Subsample")
sequencesoil26<-attr(rarebacsoilcurve[[26]],which="Subsample")
sequencesoil25<-attr(rarebacsoilcurve[[25]],which="Subsample")
sequencesoil24<-attr(rarebacsoilcurve[[24]],which="Subsample")
sequencesoil23<-attr(rarebacsoilcurve[[23]],which="Subsample")
sequencesoil22<-attr(rarebacsoilcurve[[22]],which="Subsample")
sequencesoil21<-attr(rarebacsoilcurve[[21]],which="Subsample")
sequencesoil20<-attr(rarebacsoilcurve[[20]],which="Subsample")
sequencesoil19<-attr(rarebacsoilcurve[[19]],which="Subsample")
sequencesoil18<-attr(rarebacsoilcurve[[18]],which="Subsample")
sequencesoil17<-attr(rarebacsoilcurve[[17]],which="Subsample")
sequencesoil16<-attr(rarebacsoilcurve[[16]],which="Subsample")
sequencesoil15<-attr(rarebacsoilcurve[[15]],which="Subsample")
sequencesoil14<-attr(rarebacsoilcurve[[14]],which="Subsample")
sequencesoil13<-attr(rarebacsoilcurve[[13]],which="Subsample")
sequencesoil12<-attr(rarebacsoilcurve[[12]],which="Subsample")
sequencesoil11<-attr(rarebacsoilcurve[[11]],which="Subsample")
sequencesoil10<-attr(rarebacsoilcurve[[10]],which="Subsample")
sequencesoil9<-attr(rarebacsoilcurve[[9]],which="Subsample")
sequencesoil8<-attr(rarebacsoilcurve[[8]],which="Subsample")
sequencesoil7<-attr(rarebacsoilcurve[[7]],which="Subsample")
sequencesoil6<-attr(rarebacsoilcurve[[6]],which="Subsample")
sequencesoil5<-attr(rarebacsoilcurve[[5]],which="Subsample")
sequencesoil4<-attr(rarebacsoilcurve[[4]],which="Subsample")
sequencesoil3<-attr(rarebacsoilcurve[[3]],which="Subsample")
sequencesoil2<-attr(rarebacsoilcurve[[2]],which="Subsample")
sequencesoil1<-attr(rarebacsoilcurve[[1]],which="Subsample")

#Here I plot the rarefaction curves. 

library(ggplot2)
library(ggpubr)
soilrarecurv<-ggplot()+
  geom_line(aes(x=sequencesoil47,y=rarebacsoilcurve[[47]]), size=1)+
  geom_line(aes(x=sequencesoil46,y=rarebacsoilcurve[[46]]),size=1)+
  geom_line(aes(x=sequencesoil45,y=rarebacsoilcurve[[45]]),size=1 )+
  geom_line(aes(x=sequencesoil44,y=rarebacsoilcurve[[44]]),size=1)+
  geom_line(aes(x=sequencesoil43,y=rarebacsoilcurve[[43]]),size=1)+
  geom_line(aes(x=sequencesoil42,y=rarebacsoilcurve[[42]]),size=1)+
  geom_line(aes(x=sequencesoil41,y=rarebacsoilcurve[[41]]),size=1)+
  geom_line(aes(x=sequencesoil40,y=rarebacsoilcurve[[40]]),size=1)+
  geom_line(aes(x=sequencesoil39,y=rarebacsoilcurve[[39]]), size=1)+  
  geom_line(aes(x=sequencesoil38,y=rarebacsoilcurve[[38]]), size=1)+  
  geom_line(aes(x=sequencesoil37,y=rarebacsoilcurve[[37]]), size=1)+
  geom_line(aes(x=sequencesoil36,y=rarebacsoilcurve[[36]]),size=1)+
  geom_line(aes(x=sequencesoil35,y=rarebacsoilcurve[[35]]),size=1 )+
  geom_line(aes(x=sequencesoil34,y=rarebacsoilcurve[[34]]),size=1)+
  geom_line(aes(x=sequencesoil33,y=rarebacsoilcurve[[33]]),size=1)+
  geom_line(aes(x=sequencesoil32,y=rarebacsoilcurve[[32]]),size=1)+
  geom_line(aes(x=sequencesoil31,y=rarebacsoilcurve[[31]]),size=1)+
  geom_line(aes(x=sequencesoil30,y=rarebacsoilcurve[[30]]),size=1)+
  geom_line(aes(x=sequencesoil29,y=rarebacsoilcurve[[29]]), size=1)+  
  geom_line(aes(x=sequencesoil28,y=rarebacsoilcurve[[28]]), size=1)+
  geom_line(aes(x=sequencesoil27,y=rarebacsoilcurve[[27]]), size=1)+
  geom_line(aes(x=sequencesoil26,y=rarebacsoilcurve[[26]]),size=1)+
  geom_line(aes(x=sequencesoil25,y=rarebacsoilcurve[[25]]),size=1 )+
  geom_line(aes(x=sequencesoil24,y=rarebacsoilcurve[[24]]),size=1)+
  geom_line(aes(x=sequencesoil23,y=rarebacsoilcurve[[23]]),size=1)+
  geom_line(aes(x=sequencesoil22,y=rarebacsoilcurve[[22]]),size=1)+
  geom_line(aes(x=sequencesoil21,y=rarebacsoilcurve[[21]]),size=1)+
  geom_line(aes(x=sequencesoil20,y=rarebacsoilcurve[[20]]),size=1)+
  geom_line(aes(x=sequencesoil19,y=rarebacsoilcurve[[19]]), size=1)+
  geom_line(aes(x=sequencesoil18,y=rarebacsoilcurve[[18]]),size=1)+
  geom_line(aes(x=sequencesoil17,y=rarebacsoilcurve[[17]]),size=1)+
  geom_line(aes(x=sequencesoil16,y=rarebacsoilcurve[[16]]),size=1)+
  geom_line(aes(x=sequencesoil15,y=rarebacsoilcurve[[15]]),size=1)+
  geom_line(aes(x=sequencesoil14,y=rarebacsoilcurve[[14]]), size=1)+
  geom_line(aes(x=sequencesoil13,y=rarebacsoilcurve[[13]]), size=1)+
  geom_line(aes(x=sequencesoil12,y=rarebacsoilcurve[[12]]),size=1)+
  geom_line(aes(x=sequencesoil11,y=rarebacsoilcurve[[11]]),size=1 )+
  geom_line(aes(x=sequencesoil10,y=rarebacsoilcurve[[10]]),size=1)+
  geom_line(aes(x=sequencesoil9,y=rarebacsoilcurve[[9]]),size=1)+
  geom_line(aes(x=sequencesoil8,y=rarebacsoilcurve[[8]]),size=1)+
  geom_line(aes(x=sequencesoil7,y=rarebacsoilcurve[[7]]),size=1)+
  geom_line(aes(x=sequencesoil6,y=rarebacsoilcurve[[6]]),size=1)+
  geom_line(aes(x=sequencesoil5,y=rarebacsoilcurve[[5]]), size=1)+
  geom_line(aes(x=sequencesoil4,y=rarebacsoilcurve[[4]]),size=1)+
  geom_line(aes(x=sequencesoil3,y=rarebacsoilcurve[[3]]),size=1)+
  geom_line(aes(x=sequencesoil2,y=rarebacsoilcurve[[2]]),size=1)+
  geom_line(aes(x=sequencesoil1,y=rarebacsoilcurve[[1]]),size=1)+
  geom_path(aes(x=37329,y=y), linetype=5)+
  #scale_color_manual(values=c("#00CCCC","#336699","#336699","#336699","#663300","#663300","#00CCCC","#CC9933","#CC9933","#CC9933","#0000FF","#0000FF","#006600","#006600"))+
  theme(panel.grid.major=element_blank(),
       panel.grid.minor=element_blank(),
       panel.background = element_blank(),
  panel.border=element_rect(color="black", size=1, fill=NA))+
 theme(axis.title.x=element_text(size=14, face="bold"))+
  theme(axis.title.y=element_text(size=14, face="bold"))+
  theme(axis.text.x=element_text(size=12, face="bold"))+
  theme(axis.text.y=element_text(size=12, face="bold"))+ 
labs(x=("Sequencing Depth"), y=expression(bold(paste("OTU Richness", sep=""))))
soilrarecurv

#ggsave(filename = "16s_soil_rarecurve.png", plot(soilrarecurv),dpi=300)

Sample Rarefaction

Below I rarefy my samples using the rrarefy function from vegan. Rarefaction cut-offs are determined by reviewing the rarefaction curves and identifying the lowest number of sequences that can be retained that minimize sample loss but maximize sequencing depth across samples. Based on the above rarefaction curve, a cut-off of 37,329 sequences per samples.

#Here we are rarefying the data using the rrarefy function from vegan. I first use the sort function to determine the lowest number of sequences in a sample. 
sort(rowSums(soil.pureotu[,1:37604]))

##     TN272ASBAC   TNWTMB19SBAC      TN55BSBAC      TNLS2SBAC  INWTMCB29SBAC 
##          37329          37684          38265          38442          39419 
## WA130BNL20SBAC  INWTMCB33SBAC     TN272CSBAC  INWTMCB28SBAC   IN55MCB6SBAC 
##          39425          39683          39851          40200          40940 
##   WA130RN4SBAC   WA132RN8SBAC    WA55RN2SBAC IN272MCB10SBAC  WA272RN10SBAC 
##          41540          42393          42612          42632          42834 
##  IN130MCB2SBAC     TN130ASBAC   IN55MCB7SBAC   WA132RN7SBAC     TN132BSBAC 
##          43301          43475          43653          43660          43673 
##      TN55CSBAC  WA55BNL19SBAC     TN130BSBAC IN132MCB17SBAC IN272MCB11SBAC 
##          44219          44795          44871          44903          45177 
##   IN55MCB8SBAC      TN55ASBAC      TNLS3SBAC IN130MCB26SBAC     TN272BSBAC 
##          45805          45902          46124          46558          46730 
##     TN132ASBAC   WA132RN9SBAC  IN130MCB9SBAC      TNLS1SBAC WA272BNL17SBAC 
##          47123          47622          48478          48700          49595 
##   TNWTMB20SBAC  WAWTBNL22SBAC     TN132CSBAC IN132MCB16SBAC  WAWTBNL23SBAC 
##          49818          50268          50973          51569          51933 
##     TN130CSBAC    WA55RN1SBAC   TNWTMB21SBAC IN272MCB24SBAC  WAWTBNL21SBAC 
##          59111          59566          68835          68965          75457 
## IN132MCB27SBAC WA272BNL18SBAC 
##          75813          76472

#Then using the rrarefy function from vegan, I determine the best rarefy my samples. The rarefaction cutoff or "sample" value is determined by evaluating my rarefaction curves and choosing the number of sequences that minimizes sample loss but maximizes sampling depth. 
library(vegan)
soil.rareotu<-as.data.frame(rrarefy(soil.pureotu, sample=37329))

#Because the rrarefy function doesn't automatically remove samples from the dataframe I subset the rarefied dataframe to remove unrarefied samples (those less than the minimum sequencing depth). 
soil.rareotu<-as.data.frame(subset(soil.rareotu,rowSums(soil.rareotu)>=37329))

Data Subsetting

Below I remove unrarefied samples (those less than minimum sequencing depth) as well as singleton OTUs (those with only one sequence following rarefaction)

#First I create an object that contains the sample names of the samples that passed through the rarefaction step. This will be used to filter the metadata file. 
soil.samples<-row.names(soil.rareotu)
soil.met.rare<-subset(soil.met,soil.met$Group%in% soil.names)

#I am now removing OTUs present in the OTU table that were only present in samples removed following rarefaction. These OTUs will have column sums of 0. 
soil.rareotu<- soil.rareotu[,colSums(soil.rareotu)>0]

#write.table(soil.rareotu, "~/Documents/microbiome2017/16s_soil_otutable_rarefied_fa19_ajo.shared",sep="\t", row.names=TRUE)

#I am now removing singleton OTUs (OTUs with only 1 representative sequence following rarefaction). This is to limit the effects of rare species on our distance matrices used during ordination.  

soil.rareotu.nosingletons<-soil.rareotu[,colSums(soil.rareotu)>1]

#write.table(soil.rareotu.nosingletons, "~/Documents/microbiome2017/16s_soil_otutable_rarefied_nosingleton_fa19_ajo.shared",sep="\t", row.names=TRUE)

Relative Abundance Charts

To visualize differences in the relative abundance of different taxa, I reorganize the OTU table and taxonomy strings to create relative abundance stacked bar charts. Below, stacked bar charts are made for phylum, class, and order levels.

Taxonomy Subsetting

To begin, the taxonomy file is subsetted to include only taxonomy information for OTUs present in the rarefied OTU table with singletons removed. The OTU table is first transposed so that OTU IDs are the row names and sample names are the column names. Then the taxonomic assignments are added to the OTU table. Taxonomic strings are then split by semi-colon into individual columns by Phylum, Class, Order, Family, and Genus (SILVA does not go to species level).

#Now that samples have been rarefied I can perform alpha and beta diversity analyses on my data sets. To begin I first reformat my OTU table so that I can do some filtering and add taxonomy information.  
soil.rareotu.nosingles<-read.table("/Volumes/AaronOnufrakMac/researchprojects_2019.2022_ajo/j.nigra_microbiome_2017_ajo_gmw/soil16sstatistics_jnigra17_2019_ajo/soil16smothur_jnigra17_microbiome/soil16sotutable.rarefied.nosingleton_jnigra17_fa19_ajo.shared", header=TRUE, sep="\t")

#I now transpose my OTU table so that I can begin to add in taxonomy information. 
t.soil.rareotu.nosingles<-as.data.frame(t(soil.rareotu.nosingles))

#The below function is asking specifically for the OTU IDs. These are the row names. I will use these to subset the taxonomy file. 
t.soil.rareotu.nosingleslabs<-labels(t.soil.rareotu.nosingles)
soil.taxrare<-subset(taxonomy.soil, rownames(taxonomy.soil) %in% t.soil.rareotu.nosingleslabs[[1]])

#Now I am subsetting the new taxonomy file to include only the taxonomy information stored in column 2. 
soil.taxrareinfo<-(soil.taxrare[,2])

#Below I am using the separate function from tidyr to split the taxonomy strings into columns by semi-colon so that I can rename the OTUs to give them more meaning for downstream analyses.
library(tidyr)

#The SILVA database only provides classifications up to genus level, new columns are created only up to Genus level.
soil.taxonomylabs<-c("kingdom","phylum","class","order","family","genus")

#Below I create a new data frame with a column containing taxonomy information.
t.soil.rareotu.nosinglestax<-data.frame(taxonomy=soil.taxrareinfo,t.soil.rareotu.nosingles)

#I then separate the taxonomy strings into new columns labeled with the taxonomy labels saved in the taxonomylabs object. Strings are being separated based on the presence of semi-colons. 
t.soil.rareotu.nosinglestaxsep<-separate(t.soil.rareotu.nosinglestax,into=soil.taxonomylabs,col=taxonomy,sep=";")

Relative Abundance OTU Table

I then create a relative abundance OTU table. This table contains taxonomic information and can be searched to look at specific OTUs more easily.

#Now that samples have been rarefied I can perform alpha and beta diversity analyses on my data sets. To begin I first reformat my OTU table so that I can do some filtering and add taxonomy information.  
soil.rareotu.nosingles<-read.table("/Volumes/AaronOnufrakMac/researchprojects_2019.2022_ajo/j.nigra_microbiome_2017_ajo_gmw/soil16sstatistics_jnigra17_2019_ajo/soil16smothur_jnigra17_microbiome/soil16sotutable.rarefied.nosingleton_jnigra17_fa19_ajo.shared", header=TRUE, sep="\t")

soil.rareotu.nosingles.table<-decostand(soil.rareotu.nosingles,method="total")

#I now transpose my OTU table so that I can begin to add in taxonomy information. 
t.soil.rareotu.nosingles.table<-as.data.frame(t(soil.rareotu.nosingles.table))

#The below function is asking specifically for the OTU IDs. These are the row names. I will use these to subset the taxonomy file. 
t.soil.rareotu.nosingleslabs<-labels(t.soil.rareotu.nosingles.table)
soil.taxrare<-subset(taxonomy.soil, rownames(taxonomy.soil) %in% t.soil.rareotu.nosingleslabs[[1]])

#Now I am subsetting the new taxonomy file to include only the taxonomy information stored in column 2. 
soil.taxrareinfo<-(soil.taxrare[,2])

#Below I am using the separate function from tidyr to split the taxonomy strings into columns by semi-colon so that I can rename the OTUs to give them more meaning for downstream analyses.
library(tidyr)

#The SILVA database only provides classifications up to genus level, new columns are created only up to Genus level.
soil.taxonomylabs<-c("kingdom","phylum","class","order","family","genus")

#Below I create a new data frame with a column containing taxonomy information.
t.soil.rareotu.nosinglestax.table<-data.frame(taxonomy=soil.taxrareinfo,t.soil.rareotu.nosingles.table)

#I then separate the taxonomy strings into new columns labeled with the taxonomy labels saved in the taxonomylabs object. Strings are being separated based on the presence of semi-colons. 
t.soil.rareotu.nosinglestaxsep.table<-separate(t.soil.rareotu.nosinglestax.table,into=soil.taxonomylabs,col=taxonomy,sep=";")

library(DT)
datatable(t.soil.rareotu.nosinglestaxsep.table,fillContainer = T, height="500dpx")

Phylum Relative Abundance Chart

Below I generate a phylum relative abundance plot. I first extract the phylum information. I then clean the taxa names to remove assignment confidence values. OTU raw abundance is then summed by phylum and state, phyla that represent less than 1% of the total community are grouped into other, and relative abundance is calculated. Stacked bar charts are made using the melt function and ggplot.

#Below I create a data frame that consists of phylum assignment and the rarefied OTU table. 
t.soil.raretabphy<-data.frame(phylum=t.soil.rareotu.nosinglestaxsep$phylum,t.soil.rareotu.nosingles)

#I then remove the assignment confidence values contained in parentheses found in each taxonomic string using the sub function.
t.soil.raretabphy$phylum<-sub("\\(.*)","",x=t.soil.raretabphy$phylum)

#The following code is for the generation of phylum relative abundance stacked bar charts. I first sum each OTU by the phylum it belongs to.
library(plyr)
t.soil.rareotutabphy<-as.data.frame(ddply(t.soil.raretabphy, .(phylum),colwise(sum)))
rownames(t.soil.rareotutabphy)<-make.names(t.soil.rareotutabphy$phylum)
t.soil.rareotutabphy<-t.soil.rareotutabphy[,2:48]

#I then transpose the data frame so that sample names are now row names and column names are OTU names. 
soil.rareotutabphy<-as.data.frame(t(t.soil.rareotutabphy))
soil.rareotutabphy.state<-data.frame(state=soil.met.rare$State,soil.rareotutabphy)

#I then sum each OTU by state and subset to exclude sample names. This is because the following functions only work on matrices containing numeric data. 
soil.rareotutabphy.statesum<-as.data.frame(ddply(soil.rareotutabphy.state, .(state),colwise(sum)))
soil.phylumcols<-soil.rareotutabphy.statesum[,2:46]

#I am now creating an other category. Other includes those OTUs belonging to a phylum that comprises less than 1% of the total community. This new object is referred to as phyothers.
soil.phyoth<-soil.phylumcols[,colSums(soil.phylumcols)/sum(soil.phylumcols)<=0.01]
soil.phyother<-rowSums(soil.phyoth)

#I then create an object that contains all of the phyla that comprise more than 1% of the total community.
soil.phyreg<-soil.phylumcols[,colSums(soil.phylumcols)/sum(soil.phylumcols)>0.01]

#I then create a new dataframe containingt the state information, the other column, and the remaining phyla.
soil.phytot2<-data.frame(state=soil.rareotutabphy.statesum$state,soil.phyreg,Other=soil.phyother)

#These values are then converted to relative abundance using the decostand function from vegan.
soil.phyrelabund<-decostand(soil.phytot2[,2:13],method="total")
soil.phyrelabund<-data.frame(state=soil.rareotutabphy.statesum$state,soil.phyrelabund)

#Below I use the melt function from the data.table package to reformat the data into stacked bar graph format.
library(data.table)
library(ggplot2)
library(ggpubr)
library(RColorBrewer)

soil.phyrelabundmelt<-melt(soil.phyrelabund, id.vars="state", variable.name="Phylum")
soil.colors.n.phylum <- length(unique(soil.phyrelabundmelt[,'Phylum']))

soil.phyrel<-ggplot(soil.phyrelabundmelt, aes(x=state, y=value, fill=Phylum))+
  geom_bar(stat="identity", show.legend=TRUE, color="black")+
  scale_fill_manual(values=colorRampPalette(brewer.pal(9, 'Set1'))(soil.colors.n.phylum)) +
  xlab("State") +
  ylab("Relative Abundance") +
  theme(panel.border = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.text.x = element_text(size = 14), axis.ticks.x = element_blank(), axis.text.y = element_text(size = 14), axis.line.y.left = element_line(), axis.title.y = element_text(size = 14))+
   scale_y_continuous(breaks=c(0,0.25,0.50,0.75,1),limits=c(0,1.05))+
  geom_text(data=NULL,aes(x=0.50, y=1.05,label="16S Soil"),hjust=0,colour="black")
  
ggplotly(soil.phyrel,tooltips=c("Phylum"))

#ggsave(filename = "16s_soil_phylrelabund.png", plot(soil.phyrel),dpi=300)

Class Relative Abundance Chart

Below I generate a class relative abundance plot. I first extract the class information from the OTU table that contains the taxonomy information. I then clean the taxa names to remove assignment confidence values and other non-taxonomic associated string portions from the class assignment. I then create an unknown group which consists of uncultured bacteria and bacteria unclassified at the class level. OTU raw abundance is then summed by class and state, classes that represent less than 1% of the total community are grouped into other, and relative abundance is calculated. Stacked bar charts are made using the melt function from ggplot.

#The following chunk creates a relative abundance bar chart at the class level. 
library(plyr)
t.soil.raretabcls<-data.frame(class=t.soil.rareotu.nosinglestaxsep$class,t.soil.rareotu.nosingles)

#What I am doing below is cleaning up the class names. These include assignment confidence levels, _or, and _.
t.soil.raretabcls$class<-sub("\\(.*)","",x=t.soil.raretabcls$class)
t.soil.raretabcls$class<-sub("_or","",x=t.soil.raretabcls$class)
t.soil.raretabcls$class<-sub("_"," ",x=t.soil.raretabcls$class)

#I then sum the OTUs by class and make the row names the class name. 
t.soil.rareotutabcls<-as.data.frame(ddply(t.soil.raretabcls, .(class),colwise(sum)))
rownames(t.soil.rareotutabcls)<-make.names(t.soil.rareotutabcls$class)
t.soil.rareotutabcls<-t.soil.rareotutabcls[,2:48]

#I then transpose the data frame and sum each class by state. 
soil.rareotutabcls<-as.data.frame(t(t.soil.rareotutabcls))
soil.rareotutabcls.state<-data.frame(state=soil.met.rare$State,soil.rareotutabcls)
soil.rareotutabcls.statesum<-as.data.frame(ddply(soil.rareotutabcls.state, .(state),colwise(sum)))

#At the class level, SILVA may classify things as unknown class, phylum.unclassified, or uncultured. These don't provide a lot of information and thus are grouped into a knew group called unknown. 
soil.clsunknown<-soil.rareotutabcls.statesum[,grep(".unclassified|Unknown.class|uncultured",names(soil.rareotutabcls.statesum))]
soil.clsunknown<-rowSums(soil.clsunknown)

#I then use the same grep function as above to pull out the class with meaningful information. By setting invert=TRUE I select items lacking the strings in the grep command.
soil.clstot<-soil.rareotutabcls.statesum[,grep(".unclassified|Unknown.class|uncultured",names(soil.rareotutabcls.statesum), invert=TRUE)]

#I then subset the data so that only numerical data is present and creating a new class of objects, other, that includes OTUs from classes that represent 1% or less of the total community. 
soil.clscols<-soil.clstot[,2:125]
soil.clsoth<-soil.clscols[,colSums(soil.clscols)/sum(soil.clscols)<=0.01]
soil.clsother<-rowSums(soil.clsoth)
soil.clsreg<-soil.clscols[,colSums(soil.clscols)/sum(soil.clscols)>0.01]

#I then create a new data frame that contains all of the class data and determine relative abundance using the decostand function from vegan. 
soil.clstot2<-data.frame(state=soil.clstot$state,soil.clsreg,Unclassified=soil.clsunknown,Other=soil.clsother)
soil.clsrelabund<-decostand(soil.clstot2[,2:23],method="total")
soil.clsrelabund<-data.frame(state=soil.rareotutabcls.statesum$state,soil.clsrelabund)

#Below I use the melt function from the data.table package to reformat the data into stacked bar graph format.
library(data.table)
library(ggplot2)
library(ggpubr)
library(RColorBrewer)
soil.clsrelabundmelt<-melt(soil.clsrelabund, id.vars="state", variable.name="class")

colorCount<- length(unique(soil.clsrelabundmelt[,'class']))
soil.cls.getPalette=colorRampPalette(brewer.pal(9,"Set1"))

#Below I generate the relative abundance bar charts in ggplot
soil.clsrel<-ggplot(soil.clsrelabundmelt, aes(x=state, y=value, fill=class))+
  geom_bar(stat="identity",show.legend=TRUE,color="black")+
  scale_fill_manual(values=soil.cls.getPalette(colorCount))+
  xlab("State") +
  ylab("Relative Abundance") +
  theme(panel.border = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.text.x = element_text(size = 14), axis.ticks.x = element_blank(), axis.text.y = element_text(size = 14), axis.line.y.left = element_line(), axis.title.y = element_text(size = 14))+
  scale_y_continuous(breaks=c(0,0.25,0.50,0.75,1),limits=c(0,1.05))+
  geom_text(data=NULL,aes(x=0.50, y=1.05,label="16S Soil"),hjust=0,colour="black")
ggplotly(soil.clsrel,tooltip=c("class"))

#ggsave(filename = "16s_soil_clsrelabund.png", plot(soil.clsrel),dpi=300)

Order Relative Abundance Chart

Below I generate an order relative abundance plot. I first extract the order information from the OTU table that contains the taxonomy information. I then clean the taxa names to remove assignment confidence values and other non-taxonomic associated string portions from the order assignment. I then create an unknown group which consists of uncultured bacteria and bacteria unclassified at the order level. OTU raw abundance is then summed by order and state, orders that represent less than 1% of the total community are grouped into other, and relative abundance is calculated. Stacked bar charts are made using the melt function from ggplot.

#The following chunk creates a relative abundance bar chart at the order level.

#What I am doing below is cleaning up the order names. These include assignment confidence levels, _or, and _.
library(plyr)
t.soil.raretabord<-data.frame(order=t.soil.rareotu.nosinglestaxsep$order,t.soil.rareotu.nosingles)
t.soil.raretabord$order<-sub("\\(.*)","",x=t.soil.raretabord$order)
t.soil.raretabord$order<-sub("_or","",x=t.soil.raretabord$order)
t.soil.raretabord$order<-sub("_"," ",x=t.soil.raretabord$order)

#I then sum the OTUs by order and make the row names the order name. 
t.soil.rareotutabord<-as.data.frame(ddply(t.soil.raretabord, .(order),colwise(sum)))
rownames(t.soil.rareotutabord)<-make.names(t.soil.rareotutabord$order)
t.soil.rareotutabord<-t.soil.rareotutabord[,2:48]

#I then transpose the data frame and sum each order by state.
soil.rareotutabord<-as.data.frame(t(t.soil.rareotutabord))
soil.rareotutabord.state<-data.frame(state=soil.met.rare$State,soil.rareotutabord)
soil.rareotutabord.statesum<-as.data.frame(ddply(soil.rareotutabord.state, .(state),colwise(sum)))

#At the class level, SILVA may classify things as unknown class, phylum.unclassified, or uncultured. These don't provide a lot of information and thus are grouped into a knew group called unknown.
soil.ordunknown<-soil.rareotutabord.statesum[,grep(".unclassified|Unknown.Order|uncultured",names(soil.rareotutabord.statesum))]

#I then use the same grep function as above to pull out the order with meaningful information. By setting invert=TRUE I select items lacking the strings in the grep command. 
soil.ordunknownsum<-rowSums(soil.ordunknown)
soil.ordtot<-soil.rareotutabord.statesum[,grep(".unclassified|Unknown.Order|uncultured",names(soil.rareotutabord.statesum), invert=TRUE)]

#I then subset the data so that only numerical data is present and creating a new class of objects, other, that includes OTUs from order that represent 1% or less of the total community.
soil.ordcols<-soil.ordtot[,2:294]
soil.ordoth<-soil.ordcols[,colSums(soil.ordcols)/sum(soil.ordcols)<=0.01]
soil.ordother<-rowSums(soil.ordoth)
soil.ordreg<-soil.ordcols[,colSums(soil.ordcols)/sum(soil.ordcols)>0.01]
soil.pseud<-(soil.ordcols$Pseudomonadales)/sum(soil.ordcols)
#I then create a new data frame that contains all of the order data and determine relative abundance using the decostand function from vegan. 
soil.ordtot2<-data.frame(state=soil.ordtot$state,soil.ordreg,Other=soil.ordother,Unclassified=soil.ordunknownsum)
soil.ordrelabund<-decostand(soil.ordtot2[,2:24],method="total")
soil.ordrelabund<-data.frame(state=soil.rareotutabcls.statesum$state,soil.ordrelabund)

#Below I use the melt function from the data.table package to reformat the data into stacked bar graph format.
library(data.table)
library(ggplot2)
library(ggpubr)
library(RColorBrewer)
soil.ordrelabundmelt<-melt(soil.ordrelabund, id.vars="state", variable.name="order")

colorCount<- length(unique(soil.ordrelabundmelt[,'order']))
soil.ord.getPalette=colorRampPalette(brewer.pal(9,"Set1"))

#Below I generate the relative abundance bar charts in ggplot
soil.ordrel<-ggplot(soil.ordrelabundmelt, aes(x=state, y=value, fill=order))+
  geom_bar(stat="identity",show.legend=TRUE,color="black")+
  scale_fill_manual(values=soil.ord.getPalette(colorCount))+
  xlab("State") +
  ylab("Relative Abundance") +
  theme(panel.border = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.text.x = element_text(size = 14), axis.ticks.x = element_blank(), axis.text.y = element_text(size = 14), axis.line.y.left = element_line(), axis.title.y = element_text(size = 14))+
scale_y_continuous(breaks=c(0,0.25,0.50,0.75,1),limits=c(0,1.05))+
  geom_text(data=NULL,aes(x=0.50, y=1.05,label="16S Soil"),hjust=0,colour="black")
ggplotly(soil.ordrel,tooltip=c("order"))

#ggsave(filename = "16s_soil_ordrelabund.png", plot(ordrelsoil),dpi=300)

Relative Abundance Panel

#This chunk creates a panel containing phylum, class, and order relative abundance charts. 
library(ggpubr)
soil.relab.panel<-ggarrange(soil.phyrel,soil.clsrel,soil.ordrel, ncol=3,nrow=1,labels=c("A","B","C"))

Taxonomy Filtering and Merging

Taxonomy is first filtered to include only those OTUs present in the rarefied OTU table with no singletons. Then the OTU table is transposed so that OTU ID is the row name and sample ID is the column name. Taxonomy strings are then added to the OTU table and strings are split by semi-colon.

#Below I am reformatting the OTU table so that I can begin to filter and merge the taxonomy information. 
#I first transpose the rarefied OTU table with no singletons.
t.soil.rareotu.nosingles<-as.data.frame(t(soil.rareotu.nosingles))

#The below function is asking specifically for the OTU IDs. These are the row names. I will use these to subset the taxonomy file. 
t.soil.rareotu.nosingleslabs<-labels(t.soil.rareotu.nosingles)
soil.taxrare<-subset(taxonomy.soil, rownames(taxonomy.soil) %in% t.soil.rareotu.nosingleslabs[[1]])
soil.taxrareinfo<-(soil.taxrare[,2])


#Below I am using the separate function from tidyr to split the taxonomy strings into columns by semi-colon so that I can rename the OTUs to give them more meaning for downstream analyses.
library(tidyr)
soil.taxonomylabs<-c("kingdom","phylum","class","order","family","genus")
t.soil.rareotutabtax<-data.frame(taxonomy=soil.taxrareinfo,t.soil.rareotu.nosingles)
t.soil.rareotutabtaxsep<-separate(t.soil.rareotutabtax,into=soil.taxonomylabs,col=taxonomy,sep=";")

Renaming of OTUs

OTUs are then renamed. Prior to official renaming, taxonomy strings are cleaned so that assignment confidence values and taxon leaders and any _unclassifed tails are removed.

#Here I am renaming the OTUs using the make.names function. This will make the OTU name something more informative. I am using genus to do this. 
t.soil.rareotutabtaxsep$genus<-sub("\\(.*)","",x=t.soil.rareotutabtaxsep$genus)
t.soil.rareotutabtaxsep$genus<-sub("(_unclassified)","",x=t.soil.rareotutabtaxsep$genus)
t.soil.rareotutabtaxsep$genus<-sub("(_ge)","",x=t.soil.rareotutabtaxsep$genus)
rownames(t.soil.rareotutabtaxsep)<-make.names(t.soil.rareotutabtaxsep$genus,unique=TRUE)
t.soil.rareotutabtax.rename<-t.soil.rareotutabtaxsep[,7:53]
soil.rareotutabtax.rename<-as.data.frame(t(t.soil.rareotutabtax.rename))

Principal Coordinate Analysis

To visually assess differences in the community composition of soil bacterial communities by clone and state, I perform a principal coordinate analysis (PCoA).

Relative Abundance Calculation

Prior to conducting a PCoA I first convert my OTU table from raw OTU abundances to relative abundance using the decostand function from vegan.

#Below I convert raw abundances to relative abundance using the decostand function from vegan. 
library(vegan)
soil.relabund<-decostand(soil.rareotutabtax.rename, method="total")

Principal Coordinate Analysis

I then perform the PCoA using the pcoa function from the ape package. Prior to performing the PCoA, a distance matrix needs to be calculated for the OTU table. In this case, I use the bray-curtis method.

#Below I calculate the distance matrix using the bray-curtis method and the perform the principal coordinate analysis on the relativized OTU table. 
soil.bacdist<-vegdist(soil.relabund, method="bray")
library(ape)
soil.bac.pcoa<-pcoa(soil.bacdist)

Plotting of Principal Coordinate Analysis

Then to plot the PCoA, I use the ggplot function. To use ggplot I need to first pull out the site scores for each sample and generate ellipses.

#We need to first subset our metadata table to include only those samples that passed rarefaction.  
soil.met.rare<-subset(soil.met,soil.met$Group%in% soil.samples)

#Then we extract PCoA site scores (these will be the x and y coordinates for each sample)
soil.bac.pcoavec<-as.data.frame(soil.bac.pcoa$vectors)
soil.bac.pcoasitescores<-data.frame(PC1=soil.bac.pcoavec$Axis.1, PC2=soil.bac.pcoavec$Axis.2)

#Create a new dataframe that includes the site scores from above with metadata from your study. 
soil.bac.pcoagraph<-data.frame(soil.bac.pcoasitescores,PC1=soil.bac.pcoasitescores$PC1, PC2=soil.bac.pcoasitescores$PC2, State=soil.met.rare$State, Clone=soil.met.rare$Clone,group=soil.met$Group)

#This is where you make confidence ellipses. I don't know what all the code means. Just know where to plug in my objects. 
soil.bac.pcoaellipse<-ordiellipse(soil.bac.pcoasitescores,soil.bac.pcoagraph$State, display="sites", kind="sd",draw="none")

df_ell <- data.frame()
for(g in levels(soil.bac.pcoagraph$State)){
df_ell <- rbind(df_ell, cbind(as.data.frame(with(soil.bac.pcoagraph[soil.bac.pcoagraph$State==g,],                                                vegan:::veganCovEllipse(soil.bac.pcoaellipse[[g]]$cov,soil.bac.pcoaellipse[[g]]$center,soil.bac.pcoaellipse[[g]]$scale))) ,State=g))}

#Plot PCoA in ggplot
library(ggplot2)
soil.pcoa<-ggplot(soil.bac.pcoagraph, aes(PC1,PC2, colour=State))+
  #geom_text(aes(label=group))+
  geom_point(aes(shape=Clone), size=3.5)+
  geom_path(data=df_ell, aes(x=PC1, y=PC2, colour=State),size=1, linetype=2)+
 theme(axis.title.x=element_text(size=14, face="bold"))+
  theme(axis.title.y=element_text(size=14, face="bold"))+
  theme(axis.text.x=element_text(size=12, face="bold"))+
  theme(axis.text.y=element_text(size=12, face="bold"))+ 
  scale_color_manual(values=c("#006600","#3399FF","#FF9900"))+
  scale_shape_manual(values=c(16,10,8,7,0,2))+
  scale_x_continuous(breaks=c(-0.60,-0.30,0,0.30,0.60),limits=c(-0.60,0.60))+
  scale_y_continuous(breaks=c(-0.60,-0.30,0,0.30,0.60),limits=c(-.60,.60))+
  theme(panel.grid.major=element_blank(),
        panel.grid.minor=element_blank(),
    panel.border=element_rect(colour="black", size=1, fill=NA))+
  theme(panel.grid.major=element_blank(),
       panel.grid.minor=element_blank(),
       panel.background = element_blank(),
  panel.border=element_rect(color="black", size=1, fill=NA))+
  geom_text(data=NULL,aes(x=-0.60,y=0.60,label="16S Soil"),hjust=0,colour="black")
soil.pcoa

#ggsave(filename = "16s_soil_pcoa.png", plot(soil.pcoa),dpi=300)

PERMANOVA

To formally test how well state and clone explain differences in soil bacteria community composition, I perform a PERMANOVA using the adonis function from vegan.

#Below I perform permanovas by state and clone using the adonis function from vegan. 
library(vegan)
soil.bac.perm.state<-adonis(soil.relabund~soil.met.rare$State+soil.met.rare$Clone, method="bray",permutations=10000)
soil.bac.perm.state.table<-soil.bac.perm.state$aov.tab
row.names(soil.bac.perm.state.table)[[1]]<-make.names("State")
row.names(soil.bac.perm.state.table)[[2]]<-make.names("Clone")

kable(soil.bac.perm.state.table)

	Df	SumsOfSqs	MeanSqs	F.Model	R2	Pr(>F)
State	2	2.0599954	1.0299977	9.2177029	0.3014686	0.0001000
Clone	4	0.3035548	0.0758887	0.6791466	0.0444235	0.8984102
Residuals	40	4.4696503	0.1117413	NA	0.6541079	NA
Total	46	6.8332005	NA	NA	1.0000000	NA

Richness and Diversity

Then to assess the alpha diversity of our samples, I calculate the observed richness and shannon diversity index using the vegan package. This is done using the rarefied OTU table that contains singletons. Singletons are included to keep sampling depth standard between samples.

Richness

Observed OTU richness is first calculated for each sample. Then due to the unbalanced nature of our study design, we use a type 3 ANOVA.

#In this chunk we calculate richness and create a data frame including richness with metadata

#I first import my rarefied OTU table that contains singletons.
soil.rareotu.wsingles<-read.table("/Volumes/AaronOnufrakMac/researchprojects_2019.2022_ajo/j.nigra_microbiome_2017_ajo_gmw/soil16sstatistics_jnigra17_2019_ajo/soil16smothur_jnigra17_microbiome/soil16sotutable.rarefied_jnigra17_fa19_ajo.shared", header=TRUE, sep="\t")

#Below I calculate observed species richness using the specnumber function from vegan. I then create a new data frame to include sample metadata. 
library(vegan)
soil.bac.rich<-specnumber(soil.rareotu.wsingles)
soil.richnesstabmet<-data.frame(State=soil.met.rare$State, clone=soil.met.rare$Clone, sample=soil.met.rare$Group,Richness=soil.bac.rich)

#I first test for a significant interaction between state and clone. Due to the unbalanced design we use a type 3 ANOVA. 
library(car)
soil.anova.typ3.rich<-aov((Richness)~State*clone, data=soil.richnesstabmet)
soil.typ3aov.rich<-Anova(soil.anova.typ3.rich,type="III")
kable(soil.typ3aov.rich)

	Sum Sq	Df	F value	Pr(>F)
(Intercept)	58353120.3	1	271.0432117	0.0000000
State	1307072.2	2	3.0355963	0.0620663
clone	488691.1	4	0.5674778	0.6880388
State:clone	834679.0	8	0.4846229	0.8579215
Residuals	6889306.8	32	NA	NA

#The below two lines generate plots to evaluate regression assumptions. 
plot(soil.anova.typ3.rich)

hist(soil.anova.typ3.rich$residuals)

#There was no significant interaction between state and clone so we drop the interaction term. 
soil.anova.add.rich<-aov((Richness)~State+clone, data=soil.richnesstabmet)
soil.add.aov.rich<-Anova(soil.anova.add.rich,type="III")
kable(soil.add.aov.rich)

	Sum Sq	Df	F value	Pr(>F)
(Intercept)	123304417.80	1	638.5533064	0.0000000
State	5112004.75	2	13.2367015	0.0000387
clone	99047.17	4	0.1282332	0.9712996
Residuals	7723985.86	40	NA	NA

#Perform Tukey test
soil.anova.add.rich.tukey<-TukeyHSD(soil.anova.add.rich)
soil.anova.add.rich.tukey.table<-rbind(soil.anova.add.rich.tukey$State,soil.anova.add.rich.tukey$clone)
kable(soil.anova.add.rich.tukey.table)

	diff	lwr	upr	p adj
TN-IN	-153.02222	-526.9365	220.8920	0.5836065
WA-IN	630.81905	233.3656	1028.2725	0.0011434
WA-TN	783.84127	402.7126	1164.9699	0.0000341
132-130	110.58360	-499.2636	720.4308	0.9850583
272-130	29.13915	-580.7080	638.9864	0.9999183
55-130	122.69471	-487.1525	732.5419	0.9780435
WT-130	91.21111	-481.6406	664.0628	0.9908401
272-132	-81.44444	-673.0831	510.1942	0.9947458
55-132	12.11111	-579.5276	603.7498	0.9999972
WT-132	-19.37249	-572.7998	534.0548	0.9999764
55-272	93.55556	-498.0831	685.1942	0.9910765
WT-272	62.07196	-491.3554	615.4993	0.9976229
WT-55	-31.48360	-584.9109	521.9437	0.9998364

#The below two lines generate plots to evaluate regression assumptions. 
plot(soil.anova.add.rich)

hist(soil.anova.add.rich$residuals)

#There was no significant effect of clone so we drop and see just state alone. 
soil.anova.sing.rich<-aov((Richness)~State, data=soil.richnesstabmet)
soil.sing.aov.rich<-Anova(soil.anova.sing.rich,type="III")
kable(soil.sing.aov.rich)

	Sum Sq	Df	F value	Pr(>F)
(Intercept)	312378420	1	1756.94650	0.00e+00
State	5206781	2	14.64255	1.34e-05
Residuals	7823033	44	NA	NA

#Perform Tukey test
soil.anova.sing.rich.tukey<-TukeyHSD(soil.anova.sing.rich)
soil.anova.sing.rich.tukey.table<-rbind(soil.anova.sing.rich.tukey$State)
kable(soil.anova.sing.rich.tukey.table)

	diff	lwr	upr	p adj
TN-IN	-153.0222	-510.5703	204.5258	0.5571720
WA-IN	630.8190	250.7621	1010.8760	0.0006350
WA-TN	783.8413	419.3946	1148.2880	0.0000139

#The below two lines generate plots to evaluate regression assumptions. I keep these muted because they're annoying if you're bulk running code. 
plot(soil.anova.sing.rich)

hist(soil.anova.sing.rich$residuals)

#I then create a box plot for the richness data by state using ggplot. 
library(Hmisc)
library(ggpubr)
soil.richness.summarized<-summarize(soil.richnesstabmet$Richness,by=soil.richnesstabmet$State, FUN=max)
soil.richness.summarized<-data.frame(State=soil.richness.summarized$`soil.richnesstabmet$State`,Richness=soil.richness.summarized$`soil.richnesstabmet$Richness`)

library(ggplot2)
library(ggpubr)
#Below I create a boxplot for richness values.
soil.rich<-ggboxplot(soil.richnesstabmet, x="State", y="Richness", outlier.shape=NA)+
  geom_jitter(data=soil.richnesstabmet,position=position_jitter(0.2), aes(shape=clone),size=3)+
  scale_shape_manual(values=c(16,10,8,7,0,2))+
  #geom_text(data=richness.summarized, aes(x=State, y = 25 + Richness, label=Group, fontface="bold"))+
  geom_text(data=NULL,aes(x=1, y=5300, label="A"))+
  geom_text(data=NULL,aes(x=2, y=5300, label="A"))+
  geom_text(data=NULL,aes(x=3, y=6005, label="B"))+
  geom_text(data=NULL,aes(x=0.50,y=6100,label="16S Soil"),hjust=0)+
  theme(axis.line.x=element_blank(), 
        axis.text.x=element_blank(),
        axis.title.x=element_blank(),
        axis.ticks.x=element_blank())
soil.rich

Diversity

Shannon diversity is first calculated for each sample. Then due to the unbalanced nature of our study design, we use a type 3 ANOVA.

#In this chunk we calculate shannon diversity and create a data frame including diversity with metadata
#I first import my rarefied OTU table that contains singletons.
soil.rareotu.wsingles<-read.table("/Volumes/AaronOnufrakMac/researchprojects_2019.2022_ajo/j.nigra_microbiome_2017_ajo_gmw/soil16sstatistics_jnigra17_2019_ajo/soil16smothur_jnigra17_microbiome/soil16sotutable.rarefied_jnigra17_fa19_ajo.shared", header=TRUE, sep="\t")

#Calculate shannon diversity using the diversity function of vegan
library(vegan)
soil.bac.div<-diversity (soil.rareotu.wsingles, index="shannon")

#I then create a new data frame to include sample metadata.
soil.diversity.tabmet<-data.frame(State=soil.met.rare$State, clone=soil.met.rare$Clone, sample=soil.met.rare$Group,Shannon=soil.bac.div)

#I first test for a significant interaction between state and clone. Due to the unbalanced design we use a type 3 ANOVA. 
library(car)
soil.anova.typ3.div<-aov((Shannon)~State*clone, data=soil.diversity.tabmet)
soil.typ3aov.div<-Anova(soil.anova.typ3.div,type="III")
kable(soil.typ3aov.div)

	Sum Sq	Df	F value	Pr(>F)
(Intercept)	129.0868144	1	2951.0386914	0.0000000
State	0.5568037	2	6.3645125	0.0047094
clone	0.1237413	4	0.7072090	0.5929482
State:clone	0.1436735	8	0.4105628	0.9060315
Residuals	1.3997709	32	NA	NA

#The below two lines generate plots to evaluate regression assumptions.
plot(soil.anova.typ3.div)

hist(soil.anova.typ3.div$residuals)

#There was no significant interaction between state and clone so we drop the interaction term. 
soil.anova.add.div<-aov((Shannon)~State+clone, data=soil.diversity.tabmet)
soil.add.aov.div<-Anova(soil.anova.add.div,type="III")
kable(soil.add.aov.div)

	Sum Sq	Df	F value	Pr(>F)
(Intercept)	264.6935053	1	6859.813328	0.0000000
State	1.9610185	2	25.410939	0.0000001
clone	0.1004248	4	0.650654	0.6297693
Residuals	1.5434444	40	NA	NA

#Perform Tukey test
soil.anova.add.div.tukey<-TukeyHSD(soil.anova.add.div)
soil.anova.add.div.tukey.table<-rbind(soil.anova.add.div.tukey$State,soil.anova.add.div.tukey$Clone)
kable(soil.anova.add.div.tukey.table)

	diff	lwr	upr	p adj
TN-IN	0.0807888	-0.0863575	0.2479351	0.4739357
WA-IN	0.4885664	0.3108976	0.6662351	0.0000001
WA-TN	0.4077776	0.2374063	0.5781488	0.0000025

#The below two lines generate plots to evaluate regression assumptions. 
plot(soil.anova.add.div)

hist(soil.anova.add.div$residuals)

#There was no significant interaction between state and clone so we drop clone from the model. 
soil.anova.sing.div<-aov((Shannon)~State, data=soil.diversity.tabmet)
soil.sing.aov.div<-Anova(soil.anova.sing.div,type="III")
kable(soil.sing.aov.div)

	Sum Sq	Df	F value	Pr(>F)
(Intercept)	663.709482	1	17764.92771	0
State	1.995571	2	26.70685	0
Residuals	1.643869	44	NA	NA

#Perform Tukey test
soil.anova.sing.div.tukey<-TukeyHSD(soil.anova.sing.div)
soil.anova.sing.div.tukey.table<-rbind(soil.anova.sing.div.tukey$State)
kable(soil.anova.sing.div.tukey.table)

	diff	lwr	upr	p adj
TN-IN	0.0807888	-0.0831117	0.2446894	0.4620979
WA-IN	0.4885664	0.3143477	0.6627850	0.0000001
WA-TN	0.4077776	0.2407146	0.5748405	0.0000013

#The below two lines generate plots to evaluate regression assumptions. 
plot(soil.anova.sing.div)

hist(soil.anova.sing.div$residuals)

#I then create a box plot for the richness data by state using ggplot. 
library(Hmisc)
library(ggpubr)
soil.diversity.summarized<-summarize(soil.diversity.tabmet$Shannon,by=soil.diversity.tabmet$State, FUN=max)
soil.diversity.summarized<-data.frame(State=soil.diversity.summarized$`soil.diversity.tabmet$State`,Diversity=soil.diversity.summarized$`soil.diversity.tabmet$Shannon`)

library(ggplot2)
library(ggpubr)
#Below I create a boxplot for richness values.
soil.div<-ggboxplot(soil.diversity.tabmet, x="State", y="Shannon", outlier.shape=NA)+
  geom_jitter(data=soil.diversity.tabmet,position=position_jitter(0.2), aes(shape=clone),size=3)+
  scale_shape_manual(values=c(16,10,8,7,0,2))+
  #geom_text(data=diversity.summarized, aes(x=State, y = 0.5 + Diversity, label=group, fontface="bold"))+
  geom_text(data=NULL,aes(x=1, y=7.1, label="A"))+
  geom_text(data=NULL,aes(x=2, y=7.1, label="A"))+
  geom_text(data=NULL,aes(x=3, y=7.3, label="B"))+
   theme(axis.line.x=element_blank(), 
        axis.text.x=element_blank(),
        axis.title.x=element_blank(),
        axis.ticks.x=element_blank())+
  geom_text(data=NULL, aes(x=0.50,y=7.4,label="16S Soil"),hjust=0)
soil.div

soilalphdiv<-ggarrange(soil.rich,soil.div,ncol=1,nrow=2, align="v")
soilalphdiv

#ggsave(filename = "16s_soil_alphdiv.png", plot(soilalphdiv),dpi=300)