January 21 & 23, 2019
[]
and name-based $
notationis.human <- TRUE
is.human == TRUE
is.human == 10
read.csv("
")
data(iris)
s.l <- iris$Sepal.Length
s.l.deviation <- iris$Sepal.Length/mean(iris$Sepal.Length)
plot(iris$Sepal.Length ~ iris$Sepal.Width)
lm(iris$Sepal.Length ~ iris$Sepal.Width)
In this course, we’ll be working with phylogenetic trees and evolutionary relationships between species. Evolution and the relatedness between species is increasingly used in conservation biology to identify the evolutionary consequences of extinction and identify whether closely related species share traits that make them more or less susceptible to extinction.
Phylogenetic tree = a representation of species’ inter-relatedness; which species share ancestors and when their lineages diverged
Ultimately, we are interested in a few key pieces of information from the phylogeny:
Within the tree, we can represent the branching of lineages as a function of time or mutations.
We are now going to dive in to using R to work with phylogenies.
Packages are pre-arranged sets of data, functions, and other R programming components. Often, when we want to do something in R like perform a statistical test or make a figure, someone has done this before. If they have published their code online in a package, you can borrow the functions they have written in order to do your own work.
See the CRAN task view website, with specific focus on the Phylogenetics task view.
Any time you need to use a specialized set of functions, you can probably find a relevant library on CRAN or through another software source like GitHub. We are going to dive into a specific library that you’ll use throughout the rest of this course called picante
.
picante
librarypicante
provides functions for phylogenetic and community analysis. You’ll be using phylogenetic analysis later in the semester for your upcoming assignments.
The first step to using a package is installation. We’ll install the picante
package from an online source - for R, most of the time you will be installing packages from CRAN, the Comprehensive R Archive Network.
We can install packages in R using the install.packages()
command. After the package has been installed, we load it into our R session using the library()
command.
# Dependencies are other packages that can be borrowed from
install.packages("picante", dependencies = TRUE)
library(picante)
Note: You only have to install the package once; every time you start a new R session, however, you have to load the package with library()
.
Just like with functions, we can use the help()
command to find out more information about our newly loaded library in the Help
tab on the right-hand panel.
Let’s take a closer look at picante and the functions it adds to our library:
help(package=picante)
This shows us individual help pages for each function that picante
provides - you can see there are a lot of available functions that come with picante
.
Inside of the help panel, we can also look at something called a vignette - a friendly introductory tutorial that is designed to give you an easy entry point to using the package.
We’ll be working together through some of the materials presented in picante
’s vignette, in fact.
Finally, some packages come with data sets just like the iris
data set we looked at last week. picante
comes with a data set called phylocom
- it includes phylogenetic, community, and trait data.
You can load this with
data("phylocom")
and inspect the contents with
help(phylocom)
# or
summary(phylocom)
## Length Class Mode
## phylo 5 phylo list
## sample 150 -none- numeric
## traits 4 data.frame list
Notice that phylocom
has three named components that we can access one-at-a-time using the $
notation: phylo, sample, and traits. These are the phylogenetic tree, a data frame of which species are found where, and the traits that those species have.
For example:
# The phylogenetic tree:
phylocom$phylo
# The community data:
phylocom$sample
# And finally, trait data:
phylocom$traits
Let’s talk about what these data types are, because you’ll be using them in your assignments throughout the rest of this course.
picante
: the phylo
classIn picante
and other phylogenetic libraries in R, there is a special class of data called the phylo
class. This data type represents phylogenetic trees in a way that the computer and R can understand and manipulate.
Ignoring the phylocom
data set that is pre-built into picante
, we’re going to start with a simple example phylogeny that has four species of owl. Let’s create a phylo
class object using the read.tree()
command that was added to our library when we installed picante
.
We’re not going to build the tree from scratch, rather we’re going to read in a pre-saved phylogeny from the owls.tre
Newick file, indicated by the .tre
file type. This line will work only if your owls.tre
file is located inside of the folder where R is running. You can check this with the dir()
command.
owl.tree <- read.tree("owls.tre")
If you don’t know where your file is located and want to find it using a pop-up file selection window, you can also read in the .tre file with:
owl.tree <- read.tree(file.choose())
Take a look at the tree:
owl.tree
##
## Phylogenetic tree with 4 tips and 3 internal nodes.
##
## Tip labels:
## [1] "Strix_aluco" "Asio_otus" "Athene_noctua" "Tyto_alba"
##
## Rooted; includes branch lengths.
R tells us there are four tips and three internal nodes (or, four species and three internal branching points).
We can take a closer look using the str[ucture]
command like this:
str(owl.tree)
## List of 4
## $ edge : int [1:6, 1:2] 5 6 7 7 6 5 6 7 1 2 ...
## $ edge.length: num [1:6] 6.3 3.1 4.2 4.2 7.3 13.5
## $ Nnode : int 3
## $ tip.label : chr [1:4] "Strix_aluco" "Asio_otus" "Athene_noctua" "Tyto_alba"
## - attr(*, "class")= chr "phylo"
## - attr(*, "order")= chr "cladewise"
phylo
objectThere are four named components in our phylogenetic tree:
edge
: a 2-dimensional matrix where each row is an edge of the treeowl.tree$edge
## [,1] [,2]
## [1,] 5 6
## [2,] 6 7
## [3,] 7 1
## [4,] 7 2
## [5,] 6 3
## [6,] 5 4
# relationships between ancestors and tips, current species
This tells us that there are six edges in total in our phylogenetic tree. The first edge, [ 5 6 ] is the branch that runs from our root, node #5, to node #6, the splitting point.
Nnode
: 3, the number of internal nodes (does not include the tips)tip.label
: a vector of names that are used on the tips of the treeowl.tree$tip.label
## [1] "Strix_aluco" "Asio_otus" "Athene_noctua" "Tyto_alba"
edge.length
: a vector of numbers that correspond to the lengths of the edges in $edge
owl.tree$edge.length
## [1] 6.3 3.1 4.2 4.2 7.3 13.5
These are the components that go into a phylo
object.
phylo
object looks like underneathLet’s take a peek at the original file we read into R as a phylo object. We can do this by opening the text file directly in R.
owls(((Strix_aluco:4.2,Asio_otus:4.2):3.1,Athene_noctua:7.3):6.3,Tyto_alba:13.5);
owls
= the name of the tree
The colon :
separates the tip labels from the edge lengths, parentheses show us which species should be connected by that branch length.
Notice that the tips are shown from bottom to top - we can compare all of this to the results we saw when we called $edge
and $edge.lengths
in the phylo
object.
Everything we see in the phylogenetic tree comes from a relatively simple file.
Plotting a phylo
object is straightforward:
plot(owl.tree)
We can add modifications like adding tip labels
plot(owl.tree)
# Adding a horizontal scale bar:
add.scale.bar()
# Adding a text label to show the units for the scale bar
text(1,1.1,"MY")
# Adding labels to the intermediate nodes
nodelabels()
phylo
objectsThere are two ways to modify the tip names in the phylogenetic tree: you can directly edit the Newick text document or you can modify the names in R itself.
Be careful about how you write these names - R, like many programming languages, cannot handle spaces in names and will have issues if you provide spaces.
When changing the Newick file directly, let’s change the name of Strix aluco to Owl One, and save as a new file, owls2.tre
:
owls(((Owl_one:4.2,Asio_otus:4.2):3.1,Athene_noctua:7.3):6.3,Tyto_alba:13.5);
owl.tree.2 <- read.tree("owls2.tre")
plot(owl.tree.2)
We can also change the name using the $tip.label
component of the phylogenetic object.
# Save a pristine copy of owl.tree first, in case we make mistakes later on
owl.tree.3 <- owl.tree
owl.tree.3$tip.label
## [1] "Strix_aluco" "Asio_otus" "Athene_noctua" "Tyto_alba"
owl.tree.3$tip.label[1] <- "Owl_one"
plot(owl.tree.3)
Similar to how we changed the tip labels for the phylogenetic tree, we can also change the edge lengths of the branches in the tree. Let’s try this by changing the edge lengths equal to 1 million years.
To do this in the Newick file, change all numbers to 1
, save as a new file owls.branch1.tre
, and load into R:
owls(((Strix_aluco:1,Asio_otus:1):1,Athene_noctua:1):1,Tyto_alba:1);
owl.tree.4 <- read.tree("owls_branch1.tre")
plot(owl.tree.4)
Notice that the edge length leading to Tyto alba is now much shorter than it used to be - this is because the phylogenetic tree’s branches are all the same size. So if we know that Owl one and Asio otus are 1MY apart, and work our way back to the root node, the phylogenetic tree believes that Tyto alba’s lineage disappeared only one million years after the root node.
Of course, we can also make this change in R using $edge.length
owl.tree.4 <- owl.tree
# How many edges are there to replace with 1's?
n.edges <- nrow(owl.tree.4$edge)
owl.tree.4$edge.length <- rep(1, n.edges)
# Another way to do this:
owl.tree.4$edge.length[1:6] <- 1
plot(owl.tree.4)
picante
: community objectsOften when we work with phylogenetic data, we’re interested in comparing the phylogenetic history of the species observed in a species community. The picante
package also uses community data, records of observations of a species at different sites.
Community data is represented as a 2D object (data-frame/matrix) where sites/samples are placed in the rows and taxa in the columns.
So let’s make a new data-frame for our owls at three imaginary sites: SiteA
, B
, and C
:
Strix_aluco | Asio_otus | Athene_noctua | Tyto_alba | |
---|---|---|---|---|
SiteA | 0 | 1 | 1 | 0 |
SiteB | 0 | 0 | 1 | 1 |
SiteC | 1 | 0 | 1 | 1 |
To make this data frame in R, we could either read in a .csv file of community data or, because we’re using a simple example, we can also make a new data frame from scratch and tell R what to put in each column:
owl.comm <- data.frame(Strix_aluco = c(0,0,1),
Asio_otus = c(1,0,0),
Athene_noctua = c(1,1,1),
Tyto_alba = c(0,1,1)
)
##### Just create this in Excel then export to .csv and read it in
rownames(owl.comm) <- c("SiteA","SiteB","SiteC")
owl.comm
## Strix_aluco Asio_otus Athene_noctua Tyto_alba
## SiteA 0 1 1 0
## SiteB 0 0 1 1
## SiteC 1 0 1 1
Now we have a community data set, showing which species were present at each site.
What picante
can do now is match the column names to the tip labels in our phylogenetic tree. Note that the column names MUST MATCH the tip labels in the phylo
object!
While column names are used to match the species in each community to that species’ position in the phylogeny, row names can be used to give informative site names.
Now we have two data objects - a phylogenetic tree for four owl species and a community data-frame showing which species were in sites A, B, and C.
The next thing we are going to do is calculated phylogenetic diversity, also called PD. Phylogenetic diversity is a measure of how much phylogenetic time is captured by an observed species assemblage. It captures the total edge length of a phylogenetic tree that is represented by a community.
Of the three sites shown above, which do you think will have the greatest phylogenetic diversity?
Basically, when we calculate phylogenetic diversity, we are calculating how much evolutionary time is represented in that community, or the phylogenetic distinctiveness of its members. If all species are very closely related to one another, PD is low; conversely, if there is a lot of evolutionary differentiation between species present in a community, PD is high.
We’re now going to calculate the phylogenetic diversity of our owl community in two ways: first by hand, then using a pre-built function from picante
.
Steps:
##### Subset to Site A first, then create list of taxa that are present and absent, drop the tips of the absent species
siteA.absent <- colnames(owl.comm)[
which(owl.comm["SiteA",]==0) # Take the time to explain this step!!!!
]
owl.tree.siteA <- drop.tip(owl.tree, siteA.absent)
siteA.pd <- sum(owl.tree.siteA$edge.length)
plot(owl.tree.siteA)
edgelabels()
### Now repeat for sites B and C
siteB.absent <- colnames(owl.comm)[
which(owl.comm["SiteB",]==0) # Take the time to explain this step!!!!
]
owl.tree.siteB <- drop.tip(owl.tree, siteB.absent)
owl.tree.siteB
##
## Phylogenetic tree with 2 tips and 1 internal nodes.
##
## Tip labels:
## [1] "Athene_noctua" "Tyto_alba"
##
## Rooted; includes branch lengths.
siteB.pd <- sum(owl.tree.siteB$edge.length)
siteC.absent <- colnames(owl.comm)[
which(owl.comm["SiteC",]==0) # Take the time to explain this step!!!!
]
owl.tree.siteC <- drop.tip(owl.tree, siteC.absent)
siteC.pd <- sum(owl.tree.siteC$edge.length)
paste0("A:", siteA.pd, " B:", siteB.pd, " C:", siteC.pd)
## [1] "A:14.6 B:27.1 C:34.4"
Site A, where the only species present are “clumped” phylogenetically, has the lowest phylogenetic diversity while Site C has the highest.
picante
functionEven though calculating PD by hand is relatively easy, it can be tedious going site-by-site. Luckily, we can use picante
’s built-in pd()
function to calculate PD for each site automatically.
##### Mention what the include.root means - on the board
all.sites.pd <- pd(samp = owl.comm, tree = owl.tree, include.root = FALSE)
all.sites.pd
## PD SR
## SiteA 14.6 2
## SiteB 27.1 2
## SiteC 34.4 3
Using the built-in function, we get the same answers as we did through hand calculations.
Notice that the function gives us a data frame as output - we can now save this data.frame or use the values within for more calculations.