Phylogenetic trees in R

I wrote this little tutorial as an introductory chapter for the NESCent Academy on Macroevolution back in July 2014. It’s meant to provide a brief overview of the basic structure of tree objects in R and illustrate some of the tree manipulation and visualization options.

Given that I’ll be teaching a module on comparative methods in our ‘Research Methods in Organismal Biology’ course soon, I figured I may as well make all scripts and exercises accessible on this blog. Once you have mastered this, explore many more options in Emmanuel Paradis’ book (Analysis of Phylogenetics and Evolution with R) and Liam Revell’s blog.

Let’s first load some libraries that we need for this exercise.


Trees as “phylo” objects

Let’s begin by simulating a tree. There are many options for doing this. We’ll use the rtree() function of the ‘ape’ package. The tree is stored as an object called “phy”.

    phy <- rtree(n=10) 
    # n specifies the number of tips we want.

How does the tree look like? We can plot it with the following line.



index01So, now we wonder how the phylogenetic information is encoded. And no, it’s not a black box. By typing “phy” in the command line we can retrieve some basic information about the object.

## Phylogenetic tree with 10 tips and 9 internal nodes.
## Tip labels:
## t3, t8, t5, t4, t7, t2, ...
## Rooted; includes branch lengths.
# A tree with n tips and (n-1) nodes.
# There are tip labels, the tree is rooted, and we have branch lengths.

But how is this information organized within the phylo object? We can find out with the str() function, which displays the structure of an R object.

## List of 4
##  $ edge       : int [1:18, 1:2] 11 12 13 13 12 14 15 15 14 16 ...
##  $ tip.label  : chr [1:10] "t3" "t8" "t5" "t4" ...
##  $ edge.length: num [1:18] 0.944 0.996 0.842 0.542 0.627 ...
##  $ Nnode      : int 9
##  - attr(*, "class")= chr "phylo"
##  - attr(*, "order")= chr "cladewise"

The output tells us that a phylo object is a list of four components (+ two attributes): edge, tip.label, edge.length, and Nnode. We can display these items separately.

##       [,1] [,2]
##  [1,]   11   12
##  [2,]   12   13
##  [3,]   13    1
##  [4,]   13    2
##  [5,]   12   14
##  [6,]   14   15
##  [7,]   15    3
##  [8,]   15    4
##  [9,]   14   16
## [10,]   16    5
## [11,]   16   17
## [12,]   17   18
## [13,]   18    6
## [14,]   18    7
## [15,]   17    8
## [16,]   11   19
## [17,]   19    9
## [18,]   19   10
##  [1] "t3"  "t8"  "t5"  "t4"  "t7"  "t2"  "t10" "t1"  "t9"  "t6"
##  [1] 0.9437 0.9960 0.8423 0.5422 0.6265 0.6445 0.1634 0.7275 0.8058 0.8304
## [11] 0.9346 0.5120 0.9839 0.7717 0.2805 0.1774 0.8578 0.3754
## [1] 9

And of course, we can also store these components as new objects with names of your choice.

    branches <- phy$edge
    species <- phy$tip.label
    brlength <- phy$edge.length
    nodes <- phy$Nnode

But what does this all mean? Let’s explore this with a hand-written tree. Assume you have a tree with 6 species (A through F). The phylogenetic relationships of species A:F can be described in bracket form (“parenthetic format”, or Newick).

    mini.phy <- read.tree(text = "((((A,B), C), (D,E)),F);")

index02The structure of the object contains only three components this time because we didn’t provide branch length:

## List of 3
##  $ edge     : int [1:10, 1:2] 7 8 9 10 10 9 8 11 11 7 ...
##  $ tip.label: chr [1:6] "A" "B" "C" "D" ...
##  $ Nnode    : int 5
##  - attr(*, "class")= chr "phylo"
##  - attr(*, "order")= chr "cladewise"

But how are the edges defined? When we type…

##       [,1] [,2]
##  [1,]    7    8
##  [2,]    8    9
##  [3,]    9   10
##  [4,]   10    1
##  [5,]   10    2
##  [6,]    9    3
##  [7,]    8   11
##  [8,]   11    4
##  [9,]   11    5
## [10,]    7    6

… we see a matrix of 10 rows and 2 columns. This matrix represents unique combinations of node- and tip numbers, defining each branch segment of the tree.

    plot(mini.phy, label.offset=0.2)  
    # the label.offset argument moves the names a bit to the right
    nodelabels() # add node numbers
    tiplabels()  # add tip numbers


index03For example, the branch (or edge) leading to species “C” is identified by row 6: 9, 3. Both nodelabel() and tiplabel() function are quite flexible and afford the opportunity to visually anhance your tree plot. Here’s an example for mini.phy object. We begin by creating an vector of different colors, with the same length as number of species in the tree.

    mycol<-c("blue", "blue", "blue", "red", "red", "red")

Now we plot the tree, moving the taxon names a bit to the right, and add the tiplabels without text, using symbols instead.

    plot(mini.phy, adj=0, label.offset=0.75, lwd=2)
    tiplabels(pch=21, col="black", adj=1, bg=mycol, cex=2)


index04Swapping sisterclades, identifying clades/tips, dropping tips

It sometimes may be useful to rotate the tree about a specific node, i.e. swap sister clades. This can be carried out with the rotate() function. Let’s continue to work with mini.phy:

    plot(mini.phy, label.offset=0.2)  


index05How about we swap clades (D, E) and (A, B, C)? Their most recent common ancestor is found at node 8.

    rot.phy <- rotate(mini.phy, node=8)

And now let’s see what happenend:

    plot(rot.phy, label.offset=0.2)   


index06It will also be very helpful to select all tips in a given clade. This is implemented in the ‘geiger’ package; the tips() function finds all descendants of a node.

    cladeABC <- tips(rot.phy, node=9) 
    # node 9 defines the clade composed of (A, B, C)
## [1] "A" "B" "C"

Another helpful command allows for tree pruning, i.e. cutting of tips or entire clades. For example, we can delete the entire cladeABC group:

    pruned.phy <- drop.tip(rot.phy, cladeABC)
    plot(pruned.phy, label.offset=0.2)    


index07Or we can drop tips (1 or multiple) randomly. Liam Revell explained how to do this nicely on his blog. To prune tips, say m=2 random tips, enter:

    pruned.phy2 <- drop.tip(rot.phy, sample(rot.phy$tip.label)[1:m]) 
    # m=1 drops 1 single tip, of course!
    plot(pruned.phy2, label.offset=0.2)   

index08It may also be useful to select all branches in a specific group of tips. This is implemented in the ‘ape’ package; the which.edge() function finds all edges in a specified group. For example, let’s identify all branches of the cladeABC group as defined above.

    cladeABCbranches <- which.edge(rot.phy, cladeABC) 
    # cladeABC was defined earlier, using the tips() function
## [1] 6 7 8 9
    # this should be a numerical vector containing 6, 7, 8, 9

And as we can see, rows 6-9 of the edge.length matrix represent the expected branches. Let’s first plot the tree, and then look at the edge matrix for cross-checking.

    plot(rot.phy, label.offset=0.2)   


    rot.phy$edge # compare edge matrix and tree plot
##       [,1] [,2]
##  [1,]    7    8
##  [2,]    8   11
##  [3,]   11    4
##  [4,]   11    5
##  [5,]    8    9
##  [6,]    9   10
##  [7,]   10    1
##  [8,]   10    2
##  [9,]    9    3
## [10,]    7    6

Here would be a good opportunity to show you how to assign different branch colors. For example, how can we emphasize the branches of the clade formed by A, B, and C? We first create a color vector, only consisting of grey colors. Then we’ll assign black to all branches of clade ABC.

    clcolr <- rep("darkgrey", dim(rot.phy$edge)[1]) 
    clcolr[cladeABCbranches] <- "black"
    plot(rot.phy, lwd=3, edge.color=clcolr)


index10Let’s conclude this section with one last exercise: combining trees. Assume you have two different phylogenies, with two different sets of taxa (no overlap). Another assumption is that you have knowledge how the trees may fit together. Then the bind.tree() function of ‘ape’ package can help. The function takes in two phylo-objects. The position of where the trees are bound is defined by tip- or node number within the first tree. Note that you can also specify the “root” as binding position.

    tree1 <- rtree(n=10); plot(tree1); nodelabels(); tiplabels()


    tree2 <- rtree(n=10); plot(tree2)


    combined.tree <- bind.tree(tree1, tree2, where=1) 


Fully resolved and polytomous trees

All tree examples so far were fully resolved, i.e. each tree was fully binary. It’s very easy to access visually for small trees, but we can also do this more formally:

## [1] TRUE

Let’s make a tree that is not fully resolved.

    poly.phy <- read.tree(text = "(((A,B,C),(D,E)),F);")


  # Is this tree binary?

## [1] FALSE

OK. Many comparative methods require fully resolved trees. But what to do if that’s not the case? The multi2di() function can resolve polytomies, either randomly or in the order in which they appear in the tree. The default setting is to resolve polytomies randomly.

    resolved.phy <- multi2di(poly.phy)
## [1] TRUE
    plot(resolved.phy) # visual inspection.

index15If you repeat the above lines a few times you will see the effect of randomly resolving the polytomy.

Modifying tree shape and other plotting options

There are many options for formatting and beautifying trees in R. Here are some basics. Let’s begin by simulating a tree once more.

    phy <- rtree(n=10) # n specifies the number of tips we want.

    # The default plot produces a rightwards tree


index16 The tree orientation can be changed by modifying the “direction”- argument. Try it out!

    plot(phy, direction="upwards") 


    # other options are "rightwards" (default), "leftwards", and "downwards".

The font size of the tip labels (species names) can be changed with the cex argument.


index18Try a few different settings! If we don’t want the species names displayed, you can do the following:


index19And if you would like thicker branches, do this:


index20Color of the branches can be controlled by the edge.color argument:


index21Try a few different settings! There are many other options… Type ?plot.phylo in the console to read more.

Another useful command I’d like to introduce is the ladderize() function. It reorganizes the tree structure, normally yielding much more readable trees. Let’s make a bigger tree to really visualize the ladderize-effect.

    big.phy <- rtree(n=50)
    ladderized.phy <- ladderize(big.phy)

Let’s create a plot with two panels, contrasting ‘before” and ’after’. With par() we can include the mfrow option, which specifies number of rows and columns in the plot matrix.

    par(mfrow=(c(1,2))) # 1 row but 2 plot panels

    plot(big.phy, direction="upwards", 

    plot(ladderized.phy, direction="upwards", 


    par(mfrow=c(1,1)) #for subsequent single-panel plots

Finally, let’s change the type of the tree. This is done by modifying the “type” argument. The default is set to “phylogram.” Other options include “cladogram”, “fan”, “unrooted”, and “radial”. Try all of them!

    plot(ladderized.phy, type="fan", 


Tons of options! And kind of fun.

This entry was posted in R and tagged , , , , , . Bookmark the permalink.

One Response to Phylogenetic trees in R

  1. Pingback: Elegant scientific graphs: Learning from examples | rmf

Comments are closed.