Tutorial

This tutorial aims to make users familiar with the key components of the Whale library. The material on this page should enable a user familiar with Bayesian statistics or optimization to make use of the package for conducting inference for models of gene family evolution from gene trees.

Warning

This tutorial will not show how to get the input data (CCDs) for running analyses with Whale. Please consider the Introduction page.

0. Introduction

Whale is a library for conducting statistical inference for phylogenetic birth-death process models of gene family evolution and statistical gene tree – species tree reconciliation using the amalgamated likelihood approximation (Szöllősi et al. 2014) to the marginal sequence likelihood.

Let us unpack the above. Consider a sequence alignment $y$, a known species tree $S$, a model of sequence evolution with parameters $\phi$ and a model of gene family evolution with parameters $\theta$. The likelihood of a sequence alignment $y$, marginalizing over gene tree topologies, is given by:

\[p(y|\phi,\theta,S) = \sum_{G \in \mathcal{G}_y}{\int}p(y|G, t, \phi) p(G, t|S, \theta)dt\]

Where we sum over the set $\mathcal{G}_y$ of all possible gene trees on the alignment $y$ and integrate over all possible node ages in the tree $t$. Clearly, this represents a very challenging inference problem.

When we are mainly interested in inference under the model of gene family evolution (e.g. inference of reconciled trees, orthology relationships, duplication and loss rates, etc.), we can adopt the ALE approximation to this marginal likelihood, which takes the following form

\[p(y|\theta,S) = \sum_{G \in \mathcal{G}_y} p(y|G) p(G|S,\theta) \approx \sum_{G \in \mathcal{G}_y} \xi_y(G) p(G|S,\theta) := q(y|\theta,S)\]

(note that $p(y|G) = \int_\phi p(y,\phi|G)$)). Here $\xi_y(G)$ serves as an approximation to the sequence likelihood, and is assumed to be proportional to $p(y|G)$.

In the ALE approach (for amalgamated likelihood estimation), we obtain a $\xi_y(G)$ by constructing a conditional clade distribution (CCD) from a sample of the posterior distribution of gene trees under the model of sequence evolution, and using the probability mass function associated with this CCD. That is $\xi_y(G) \propto p(y|G)p(G)$ under the model of sequence evolution alone, which is $\propto p(y|G)$ when a uniform prior is used for $G$. The reason we restrict ourselves to a CCD as approximating family is that this allows for efficient computation of the potentially huge sum $\sum_{G \in \mathcal{G}_y} \xi_y(G) p(G|S,\theta)$, as was shown by Szöllősi et al.

With an approximation to $p(y|\theta,S)$ in hand, we can conduct inference for the parameters of the model of gene family evolution $\theta$ using either Bayesian inference or maximum likelihood estimation (MLE). The Whale library implements inferential tools for phylogenetic birth-death process (BDP) models of gene family evolution, potentially with ancient whole-genome duplication (WGDs). Parameter inferences under the models which include ancient WGDs can be used to assess WGD hypotheses based on gene trees. With a little more work, we can also estimate (or rather sample) gene tree topologies and their reconciliations under the model of gene family evolution and the ALE approximation. This then allows us to conduct probabilistic orthology assignments and ancestral gene family reconstructions. Viewed in this regard, the ALE and Whale approach provide a methodology to conduct model-based gene tree – species tree reconciliation and tree-based orthology inference, while taking into account the uncertainty in the gene tree topology.

1. Loading data and the `WhaleModel`

Before moving on to actual inference, we illustrate the key components of the library. We start by loading the data and constructing a simple model object.

The minimally required packages to do something useful are the following:

using Whale, NewickTree

We will also load Plots.jl to have some graphics.

using Plots
default(legend=false, grid=false, framestyle=:box, size=(500,300))

We will use the land plant example data set (which is available in the github repository). We shall need a species tree

datadir = joinpath(@__DIR__, "../data/landplant")
tree = readnw(readline(joinpath(datadir, "speciestree.nw")))

((mpo:4.752,ppa:4.752):0.292,(smo:4.457,((scu:0.89,afi:0.89):3.337,((atr:2.293,(vvi:1.302,bvu:1.302):0.991):1.225,((cpa:2.852,gbi:2.852):0.326,sgi:3.178):0.34):0.7097):0.2299):0.587);

... and a model object

rates = ConstantDLWGD(λ=0.1, μ=0.2, η=1/1.5)
model = WhaleModel(rates, tree, 0.01)

WhaleModel
——————————
⋅Parameterization:
ConstantDLWGD{Float64, Float64}
  λ: Float64 0.1
  μ: Float64 0.2
  q: Array{Float64}((0,)) Float64[]
  p: Array{Float64}((0,)) Float64[]
  η: Float64 0.6666666666666666

⋅Condition:
Whale.RootCondition

⋅Model structure:
21 nodes (11 leaves, 0 WGD nodes)
node_id,wgd_id,distance,Δt,n,subtree
1,0,4.752,0.01,476,"mpo;"
2,0,4.752,0.01,476,"ppa;"
3,0,4.457,0.01,446,"smo;"
4,0,0.89,0.01,89,"scu;"
5,0,0.89,0.01,89,"afi;"
6,0,2.293,0.01,230,"atr;"
7,0,1.302,0.0099,131,"vvi;"
8,0,1.302,0.0099,131,"bvu;"
9,0,2.852,0.01,286,"cpa;"
10,0,2.852,0.01,286,"gbi;"
11,0,3.178,0.01,318,"sgi;"
12,0,0.292,0.0097,30,"(mpo,ppa);"
13,0,3.337,0.01,334,"(scu,afi);"
14,0,0.991,0.0099,100,"(vvi,bvu);"
15,0,1.225,0.01,123,"(atr,(vvi,bvu));"
16,0,0.326,0.0099,33,"(cpa,gbi);"
17,0,0.34,0.01,34,"((cpa,gbi),sgi);"
18,0,0.7097,0.01,71,"((atr,(vvi,bvu)),((cpa,gbi),sgi));"
19,0,0.2299,0.01,23,"((scu,afi),((atr,(vvi,bvu)),((cpa,gbi),sgi)));"
20,0,0.587,0.0099,59,"(smo,((scu,afi),((atr,(vvi,bvu)),((cpa,gbi),sgi))));"
21,0,0.0,0.0,0,"((mpo,ppa),(smo,((scu,afi),((atr,(vvi,bvu)),((cpa,gbi),sgi)))));"

rates represents the model parameters: here we assume DLWGD (duplication, loss and WGD) model with constant duplication and loss rates across the species tree. We assume a duplication rate $\lambda = 0.1$ and loss rate $\mu = 0.2$, with the number of genes at the root of a gene family distributed according to a Geometric distribution with mean $1.5$ (i.e. $\eta = 1/1.5$). We combine the rates which parameterize the phylogenetic BDP model with the species tree in a WhaleModel object (model). The last argument to WhaleModel (here 0.01) determines the discretization of the species tree. This entails that we assume there to be at most one represented gene duplication event in each time slice of length $0.01$. This parameter should be adjusted relative to the branch lengths of the species tree.

Note that when the model object is printed to the screen, we get quite some information. In particular, under Model structure we get a CSV formatted table with header node_id, wgd_id, distance, Δt, n, subtree, here n shows the number of slices for the relevant branch of the species tree, i.e. the maximum number of represented gene duplication events along that branch. It is best to check whether these are large enough (at least 10 or so, except for the root).

Now we can load the gene tree distributions (CCDs)

data = read_ale(joinpath(datadir, "100fams"), model)

100-element Vector{CCD{UInt16, Float64}}:
 CCD{UInt16,Float64}(Γ=349, 𝓛=31)
 CCD{UInt16,Float64}(Γ=377, 𝓛=26)
 CCD{UInt16,Float64}(Γ=169, 𝓛=25)
 CCD{UInt16,Float64}(Γ=1025, 𝓛=24)
 CCD{UInt16,Float64}(Γ=161, 𝓛=23)
 CCD{UInt16,Float64}(Γ=125, 𝓛=22)
 CCD{UInt16,Float64}(Γ=117, 𝓛=22)
 CCD{UInt16,Float64}(Γ=141, 𝓛=21)
 CCD{UInt16,Float64}(Γ=265, 𝓛=21)
 CCD{UInt16,Float64}(Γ=173, 𝓛=20)
 ⋮
 CCD{UInt16,Float64}(Γ=65, 𝓛=7)
 CCD{UInt16,Float64}(Γ=19, 𝓛=6)
 CCD{UInt16,Float64}(Γ=53, 𝓛=6)
 CCD{UInt16,Float64}(Γ=23, 𝓛=6)
 CCD{UInt16,Float64}(Γ=23, 𝓛=5)
 CCD{UInt16,Float64}(Γ=15, 𝓛=4)
 CCD{UInt16,Float64}(Γ=15, 𝓛=4)
 CCD{UInt16,Float64}(Γ=15, 𝓛=4)
 CCD{UInt16,Float64}(Γ=15, 𝓛=4)

Note that to load the CCDs (gene tree distributions), we need a WhaleModel object!

2. The loglikelihood

We can now, for instance, compute the log-likelihood for the first family

ℓ = logpdf(model, data[1])

-72.07128227445621

Or, assuming iid data, for the full data

ℓ = logpdf(model, data)

-2642.4047948842563

One can easily reparameterize the model using the following syntax:

model = model((λ=0.2,μ=0.1))

WhaleModel
——————————
⋅Parameterization:
ConstantDLWGD{Float64, Float64}
  λ: Float64 0.2
  μ: Float64 0.1
  q: Array{Float64}((0,)) Float64[]
  p: Array{Float64}((0,)) Float64[]
  η: Float64 0.6666666666666666

⋅Condition:
Whale.RootCondition

⋅Model structure:
21 nodes (11 leaves, 0 WGD nodes)
node_id,wgd_id,distance,Δt,n,subtree
1,0,4.752,0.01,476,"mpo;"
2,0,4.752,0.01,476,"ppa;"
3,0,4.457,0.01,446,"smo;"
4,0,0.89,0.01,89,"scu;"
5,0,0.89,0.01,89,"afi;"
6,0,2.293,0.01,230,"atr;"
7,0,1.302,0.0099,131,"vvi;"
8,0,1.302,0.0099,131,"bvu;"
9,0,2.852,0.01,286,"cpa;"
10,0,2.852,0.01,286,"gbi;"
11,0,3.178,0.01,318,"sgi;"
12,0,0.292,0.0097,30,"(ppa,mpo);"
13,0,3.337,0.01,334,"(afi,scu);"
14,0,0.991,0.0099,100,"(bvu,vvi);"
15,0,1.225,0.01,123,"((bvu,vvi),atr);"
16,0,0.326,0.0099,33,"(gbi,cpa);"
17,0,0.34,0.01,34,"((gbi,cpa),sgi);"
18,0,0.7097,0.01,71,"(((gbi,cpa),sgi),((bvu,vvi),atr));"
19,0,0.2299,0.01,23,"((((gbi,cpa),sgi),((bvu,vvi),atr)),(afi,scu));"
20,0,0.587,0.0099,59,"(((((gbi,cpa),sgi),((bvu,vvi),atr)),(afi,scu)),smo);"
21,0,0.0,0.0,0,"((((((gbi,cpa),sgi),((bvu,vvi),atr)),(afi,scu)),smo),(ppa,mpo));"

So that, for instance, we can easily graph a likelihood surface in the following way

plot(0:0.01:2, λ->logpdf(model((λ=λ, μ=λ)), data[1]), ylabel="\$\\ell\$", xlabel="\$\\lambda\$")

(where we assumed $\lambda = \mu$). Note that the loglikelihood can be differentiated using forward-mode automatic differentiation (AD) with ForwardDiff.jl. The library is currently not compatible with reverse mode AD or other fancy stuff.

3. Backtracking reconciled trees

We can sample a reconciled tree under the phylogenetic BDP using stochastic backtracking. That is, we sample a reconciled gene tree conditional on the unreconciled gene tree topology distribution and the likelihood under the model of gene family evolution:

ℓ = logpdf!(model, data)
G = Whale.backtrack(model, data[1])

Node(1, σ=21, t=0.0, speciation)

Note that the logpdf! step is crucial: the data is modified during the likelihood computations to enable backtracking.

The reconciled tree consists of nodes which look like this:

prewalk(G)[1:10]

10-element Vector{NewickTree.Node{UInt16, Whale.RecData{UInt16, Float64}}}:
 Node(1, σ=21, t=0.0, speciation)
 Node(2, σ=12, t=0.29, speciation)
 Node(3, σ=1, t=4.75, mpo_mpo001366)
 Node(4, σ=2, t=2.63, duplication)
 Node(5, σ=2, t=2.12, ppa_ppa004716)
 Node(6, σ=2, t=2.01, duplication)
 Node(7, σ=2, t=0.12, ppa_ppa024310)
 Node(8, σ=2, t=0.12, ppa_ppa022959)
 Node(9, σ=20, t=0.59, speciation)
 Node(10, σ=3, t=2.49, duplication)

Here σ marks the species tree branch/node to which the gene tree node is reconciled. If a gene tree node is marked as a duplication with σ=3, this means for instance that this node represents a duplication event along the branch leading to node 3 of the species tree. t marks the time point along the branch of the species tree, going from present to past, where the gene tree node is reconciled to. Speciation events have t=0 because they are mapped to the nodes in the species tree, instead of the branches (as duplication events are).

There are some plotting functions available, some better than others. The following plots the reconciled tree inside the species tree (duplication nodes in red, speciation nodes in blue)

plot(model, G, sscale=50.)

toy around with sscale if the plot doesn't look good (or implement a new function and do a pull request!). Simply plotting the reconciled tree is also possible

plot(G, right_margin=30Plots.mm, size=(500,500))

4. Inference

With likelihoods available, one has many possibilities for conducting inference. I suggest using the Turing.jl library for probabilistic programming, potentially together with Optim.jl for maximum likelihood or maximum a posteriori estimation. You might want to get familiar with the basic syntax for specifying probabilistic models using Turing, please consult the relevant docs and tutorials at https://turing.ml.

using Turing, Optim

We can for instance consider the $\lambda = \mu$ model for the first family, i.e. the problem for which we computed the likelihood curve above.

@model simplemodel(M, y, ::Type{T}=Float64) where T = begin
    λ ~ Exponential(0.2)
    y ~ M((λ=λ, μ=λ, q=T[]))
end

simplemodel (generic function with 3 methods)

Here we specify a probabilistic model with an Exponential(1) prior for the duplication (=loss) rate using Turing.jl syntax. Some minor annoyances with type stability lead to the [...] ::Type{T}=Float64) where T part and the explicit passing of q=T[] in the model object, sorry.

We can obtain a sample from the posterior using the NUTS algorithm

chain = sample(simplemodel(model, data[1]), NUTS(), 200)

Chains MCMC chain (200×13×1 Array{Float64, 3}):

Iterations        = 101:1:300
Number of chains  = 1
Samples per chain = 200
Wall duration     = 19.88 seconds
Compute duration  = 19.88 seconds
parameters        = λ
internals         = lp, n_steps, is_accept, acceptance_rate, log_density, hamiltonian_energy, hamiltonian_energy_error, max_hamiltonian_energy_error, tree_depth, numerical_error, step_size, nom_step_size

Summary Statistics
  parameters      mean       std   naive_se      mcse       ess      rhat   es ⋯
      Symbol   Float64   Float64    Float64   Float64   Float64   Float64      ⋯

           λ    0.2005    0.0467     0.0033    0.0056   69.4199    1.0356      ⋯
                                                                1 column omitted

Quantiles
  parameters      2.5%     25.0%     50.0%     75.0%     97.5%
      Symbol   Float64   Float64   Float64   Float64   Float64

           λ    0.1133    0.1685    0.2006    0.2314    0.3062

That makes sense. Alternatively, we can conduct MLE using Optim and Turing:

optimize(simplemodel(model, data[1]), MLE())

ModeResult with maximized lp of -67.26
1-element Named Vector{Float64}
A  │ 
───┼─────────
:λ │ 0.184282

Using probabilistic programs we can construct complicated hierarchical models of gene family evolution. Here's a very slight elaboration of the previous model, where we now assign a hyperprior to th number of lineages at the root:

@model secondmodel(M, y, ::Type{T}=Float64) where T = begin
    λ ~ Exponential(0.2)
    η ~ Beta(4,2)
    y ~ M((λ=λ, μ=λ, q=T[], η=η))
end

secondmodel (generic function with 3 methods)

we consider the first 10 families as data this time

chain = sample(secondmodel(model, data[1:10]), NUTS(), 200)

Chains MCMC chain (200×14×1 Array{Float64, 3}):

Iterations        = 101:1:300
Number of chains  = 1
Samples per chain = 200
Wall duration     = 43.08 seconds
Compute duration  = 43.08 seconds
parameters        = λ, η
internals         = lp, n_steps, is_accept, acceptance_rate, log_density, hamiltonian_energy, hamiltonian_energy_error, max_hamiltonian_energy_error, tree_depth, numerical_error, step_size, nom_step_size

Summary Statistics
  parameters      mean       std   naive_se      mcse        ess      rhat   e ⋯
      Symbol   Float64   Float64    Float64   Float64    Float64   Float64     ⋯

           λ    0.1550    0.0162     0.0011    0.0011   160.9182    0.9986     ⋯
           η    0.6451    0.1068     0.0075    0.0184    38.2423    0.9996     ⋯
                                                                1 column omitted

Quantiles
  parameters      2.5%     25.0%     50.0%     75.0%     97.5%
      Symbol   Float64   Float64   Float64   Float64   Float64

           λ    0.1253    0.1433    0.1550    0.1638    0.1928
           η    0.3984    0.5859    0.6676    0.7150    0.8246

Let's compare the $\eta$ prior and posterior:

histogram(chain[:η], normalize=true, color=:white, xlabel="\$\\eta\$", ylabel="probability density")
plot!(0:0.01:1, x->pdf(Beta(4,2), x))

5. Sampling reconciled trees from the posterior

The interface for sampling reconciled trees from the posterior by stochastic backtracking is not so user-friendly yet and is due for some updates, but it is not difficult either. We first cast the chain as a data frame:

using DataFrames
df = DataFrame(chain);

We have to define a function which takes a model and a row from the dataframe (i.e. a sample from the posterior) and returns a model parameterized by that sample.

modelfun(M, x) = M((λ=x[:λ], μ=x[:λ], η=x[:η]))

modelfun (generic function with 1 method)

Next we define the tracker

tracker = TreeTracker(model, data[1:10], df, modelfun)

TreeTracker{Float64, CCD{UInt16, Float64}, DataFrames.DataFrame}(WhaleModel
——————————
⋅Parameterization:
ConstantDLWGD{Float64, Float64}
  λ: Float64 0.2
  μ: Float64 0.1
  q: Array{Float64}((0,)) Float64[]
  p: Array{Float64}((0,)) Float64[]
  η: Float64 0.6666666666666666

⋅Condition:
Whale.RootCondition

⋅Model structure:
21 nodes (11 leaves, 0 WGD nodes)
node_id,wgd_id,distance,Δt,n,subtree
1,0,4.752,0.01,476,"mpo;"
2,0,4.752,0.01,476,"ppa;"
3,0,4.457,0.01,446,"smo;"
4,0,0.89,0.01,89,"scu;"
5,0,0.89,0.01,89,"afi;"
6,0,2.293,0.01,230,"atr;"
7,0,1.302,0.0099,131,"vvi;"
8,0,1.302,0.0099,131,"bvu;"
9,0,2.852,0.01,286,"cpa;"
10,0,2.852,0.01,286,"gbi;"
11,0,3.178,0.01,318,"sgi;"
12,0,0.292,0.0097,30,"(ppa,mpo);"
13,0,3.337,0.01,334,"(afi,scu);"
14,0,0.991,0.0099,100,"(bvu,vvi);"
15,0,1.225,0.01,123,"((bvu,vvi),atr);"
16,0,0.326,0.0099,33,"(gbi,cpa);"
17,0,0.34,0.01,34,"((gbi,cpa),sgi);"
18,0,0.7097,0.01,71,"(((gbi,cpa),sgi),((bvu,vvi),atr));"
19,0,0.2299,0.01,23,"((((gbi,cpa),sgi),((bvu,vvi),atr)),(afi,scu));"
20,0,0.587,0.0099,59,"(((((gbi,cpa),sgi),((bvu,vvi),atr)),(afi,scu)),smo);"
21,0,0.0,0.0,0,"((((((gbi,cpa),sgi),((bvu,vvi),atr)),(afi,scu)),smo),(ppa,mpo));"
, CCD{UInt16, Float64}[CCD{UInt16,Float64}(Γ=349, 𝓛=31), CCD{UInt16,Float64}(Γ=377, 𝓛=26), CCD{UInt16,Float64}(Γ=169, 𝓛=25), CCD{UInt16,Float64}(Γ=1025, 𝓛=24), CCD{UInt16,Float64}(Γ=161, 𝓛=23), CCD{UInt16,Float64}(Γ=125, 𝓛=22), CCD{UInt16,Float64}(Γ=117, 𝓛=22), CCD{UInt16,Float64}(Γ=141, 𝓛=21), CCD{UInt16,Float64}(Γ=265, 𝓛=21), CCD{UInt16,Float64}(Γ=173, 𝓛=20)], 200×16 DataFrame
 Row │ iteration  chain  λ         η         lp        n_steps  is_accept  acc ⋯
     │ Int64      Int64  Float64   Float64   Float64   Float64  Float64    Flo ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │       101      1  0.193876  0.570272  -498.082     11.0        1.0      ⋯
   2 │       102      1  0.170136  0.570952  -496.142      1.0        1.0
   3 │       103      1  0.189298  0.731714  -497.64      11.0        1.0
   4 │       104      1  0.179792  0.740302  -496.827      1.0        1.0
   5 │       105      1  0.140922  0.658847  -495.796     11.0        1.0      ⋯
   6 │       106      1  0.17125   0.696388  -495.979      7.0        1.0
   7 │       107      1  0.162896  0.380863  -498.693     15.0        1.0
   8 │       108      1  0.143386  0.396983  -498.551      3.0        1.0
  ⋮  │     ⋮        ⋮       ⋮         ⋮         ⋮         ⋮         ⋮          ⋱
 194 │       294      1  0.158166  0.685923  -495.47       5.0        1.0      ⋯
 195 │       295      1  0.167266  0.759077  -496.186     15.0        1.0
 196 │       296      1  0.17039   0.758052  -496.337      1.0        1.0
 197 │       297      1  0.13907   0.686631  -495.93       7.0        1.0
 198 │       298      1  0.147638  0.607302  -495.635     15.0        1.0      ⋯
 199 │       299      1  0.15568   0.536849  -496.032      7.0        1.0
 200 │       300      1  0.131305  0.554953  -497.109      3.0        1.0
                                                  9 columns and 185 rows omitted, Main.modelfun)

and we sample for each family 100 reconciled trees from the posterior, using the posterior sample in df.

out = track(tracker, 100)

10-element Vector{Whale.RecSummary}:
 RecSummary(# unique trees = 34)
 RecSummary(# unique trees = 83)
 RecSummary(# unique trees = 80)
 RecSummary(# unique trees = 90)
 RecSummary(# unique trees = 16)
 RecSummary(# unique trees = 8)
 RecSummary(# unique trees = 11)
 RecSummary(# unique trees = 37)
 RecSummary(# unique trees = 35)
 RecSummary(# unique trees = 18)

Each family has a RecSummary object, which among other things stores the sampled reconciled trees

out[1].trees

34-element Vector{NamedTuple}:
 (freq = 0.32, tree = Node(1, σ=21, t=0.0, speciation))
 (freq = 0.08, tree = Node(1, σ=21, t=0.0, duplication))
 (freq = 0.06, tree = Node(1, σ=21, t=0.0, speciation))
 (freq = 0.06, tree = Node(1, σ=21, t=0.0, speciation))
 (freq = 0.06, tree = Node(1, σ=21, t=0.0, speciation))
 (freq = 0.05, tree = Node(1, σ=21, t=0.0, speciation))
 (freq = 0.04, tree = Node(1, σ=21, t=0.0, speciation))
 (freq = 0.03, tree = Node(1, σ=21, t=0.0, speciation))
 (freq = 0.02, tree = Node(1, σ=21, t=0.0, speciation))
 (freq = 0.02, tree = Node(1, σ=21, t=0.0, duplication))
 ⋮
 (freq = 0.01, tree = Node(1, σ=21, t=0.0, speciation))
 (freq = 0.01, tree = Node(1, σ=21, t=0.0, speciation))
 (freq = 0.01, tree = Node(1, σ=21, t=0.0, speciation))
 (freq = 0.01, tree = Node(1, σ=21, t=0.0, duplication))
 (freq = 0.01, tree = Node(1, σ=21, t=0.0, duplication))
 (freq = 0.01, tree = Node(1, σ=21, t=0.0, duplication))
 (freq = 0.01, tree = Node(1, σ=21, t=0.0, duplication))
 (freq = 0.01, tree = Node(1, σ=21, t=0.0, duplication))
 (freq = 0.01, tree = Node(1, σ=21, t=0.0, duplication))

This is the MAP tree of the first family:

out[1].trees[1]

(freq = 0.32, tree = Node(1, σ=21, t=0.0, speciation))

We plot the MAP tree for the first family

plot(out[1].trees[1].tree, cred=true, size=(500,500), right_margin=30Plots.mm)

The numbers associated with each node indicate the estimated posterior probability that the reconciled tree contains this split reconciled as a duplication/speciation on that particular branch of the species tree.

6. Events and orthology

We can summarize the number of events on each branch of the species tree using the following function:

smry = Whale.summarize(out)
smry.sum

21 rows × 26 columns

	node	clade	loss_mean	wgd_mean	wgdloss_mean	duplication_mean	sploss_mean	speciation_mean	loss_std	wgd_std	wgdloss_std	duplication_std	sploss_std	speciation_std	loss_q1	wgd_q1	wgdloss_q1	duplication_q1	sploss_q1	speciation_q1	loss_q2	wgd_q2	wgdloss_q2	duplication_q2	sploss_q2	speciation_q2
	UInt16	String	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64
1	1	mpo	0.337	0.0	0.0	0.201	0.0	1.3	0.526717	0.0	0.0	0.43428	0.0	0.640312	0.0	0.0	0.0	0.0	0.0	1.0	1.0	0.0	0.0	1.0	0.0	3.0
2	2	ppa	0.411	0.0	0.0	1.175	0.0	2.2	0.630935	0.0	0.0	1.13946	0.0	1.16619	0.0	0.0	0.0	0.0	0.0	1.0	2.0	0.0	0.0	3.0	0.0	4.0
3	3	smo	0.567	0.0	0.0	0.44	0.0	1.5	0.528688	0.0	0.0	0.496387	0.0	0.5	0.0	0.0	0.0	0.0	0.0	1.0	1.0	0.0	0.0	1.0	0.0	2.0
4	4	scu	0.468	0.0	0.0	0.141	0.0	1.7	0.583932	0.0	0.0	0.348022	0.0	1.00499	0.0	0.0	0.0	0.0	0.0	0.0	2.0	0.0	0.0	1.0	0.0	4.0
5	5	afi	0.109	0.0	0.0	0.682	0.0	2.6	0.32729	0.0	0.0	0.644109	0.0	0.8	0.0	0.0	0.0	0.0	0.0	1.0	1.0	0.0	0.0	2.0	0.0	4.0
6	6	atr	0.311	0.0	0.0	0.0	0.0	1.4	0.467203	0.0	0.0	0.0	0.0	0.489898	0.0	0.0	0.0	0.0	0.0	1.0	1.0	0.0	0.0	0.0	0.0	2.0
7	7	vvi	0.008	0.0	0.0	0.531	0.0	2.5	0.0890842	0.0	0.0	0.91271	0.0	1.20416	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	3.0	0.0	5.0
8	8	bvu	0.377	0.0	0.0	0.5	0.0	2.1	0.490786	0.0	0.0	1.0247	0.0	1.13578	0.0	0.0	0.0	0.0	0.0	1.0	1.0	0.0	0.0	3.0	0.0	5.0
9	9	cpa	0.786	0.0	0.0	0.595	0.0	2.3	1.07991	0.0	0.0	0.795597	0.0	1.1	0.0	0.0	0.0	0.0	0.0	1.0	4.0	0.0	0.0	2.0	0.0	4.0
10	10	gbi	0.55	0.0	0.0	0.559	0.0	2.5	0.679338	0.0	0.0	0.816406	0.0	1.74642	0.0	0.0	0.0	0.0	0.0	0.0	2.0	0.0	0.0	2.0	0.0	5.0
11	11	sgi	0.519	0.0	0.0	1.324	0.0	3.4	0.511507	0.0	0.0	2.02015	0.0	2.2	0.0	0.0	0.0	0.0	0.0	1.0	1.0	0.0	0.0	6.0	0.0	7.0
12	12	ppa,mpo	0.184	0.0	0.0	0.013	0.748	0.688	0.395151	0.0	0.0	0.113274	1.10295	0.46331	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	3.0	1.0
13	13	afi,scu	0.325	0.0	0.0	0.756	0.577	1.45	0.487211	0.0	0.0	0.80403	0.629342	0.750666	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	3.0	2.0	3.0
14	14	bvu,vvi	0.132	0.0	0.0	0.398	0.385	1.592	0.344349	0.0	0.0	0.493554	0.514563	0.495516	0.0	0.0	0.0	0.0	0.0	1.0	1.0	0.0	0.0	1.0	1.0	2.0
15	15	bvu,atr	0.615	0.0	0.0	0.383	0.443	1.268	0.693379	0.0	0.0	0.631119	0.600626	0.502171	0.0	0.0	0.0	0.0	0.0	0.0	2.0	0.0	0.0	2.0	2.0	2.0
16	16	gbi,cpa	0.124	0.0	0.0	0.02	1.336	1.155	0.329582	0.0	0.0	0.146969	1.12121	0.887116	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	4.0	3.0
17	17	gbi,sgi	0.002	0.0	0.0	0.654	0.643	1.952	0.0446766	0.0	0.0	0.66954	0.671975	0.998847	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	2.0	2.0	5.0
18	18	gbi,bvu	0.071	0.0	0.0	0.418	0.617	1.326	0.264498	0.0	0.0	0.750517	0.695925	0.468747	0.0	0.0	0.0	0.0	0.0	1.0	1.0	0.0	0.0	2.025	2.0	2.0
19	19	gbi,afi	0.097	0.0	0.0	0.066	0.396	1.2	0.295958	0.0	0.0	0.248282	0.622241	0.507937	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	2.0	2.0
20	20	gbi,smo	0.106	0.0	0.0	0.126	0.664	0.963	0.341707	0.0	0.0	0.337822	0.587456	0.188762	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	2.0	1.0
21	21	gbi,ppa	0.0	0.0	0.0	0.607	0.29	1.317	0.0	0.0	0.0	0.693218	0.525262	0.469586	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	2.0	2.0	2.0

Here we summary[1,"duplication_mean"] will for instance record the posterior expected number of duplications per family along the branch leading to node 1.

Another useful output are the posterior probabilities for pairs of genes to be derived from certain events in the species tree

pairs = Whale.getpairs(out, model);

For instance, we can take a look at the first 10 gene pairs in this data frame. We first filter out all irrelevant columns:

idx = map(x->!all(x .== 0), eachcol(pairs[1:10,:]))
subset = pairs[1:10,idx]

10 rows × 6 columns

	21_speciation	12_speciation	2_duplication	21_duplication	family	pair
	Float64	Float64	Float64	Float64	String	String
1	0.0	0.78	0.0	0.22	OG0001262.fasta.mafft.msa.treesample.ale	mpo_mpo001366__ppa_ppa004716
2	0.0	0.78	0.0	0.22	OG0001262.fasta.mafft.msa.treesample.ale	mpo_mpo001366__ppa_ppa024310
3	0.0	0.0	1.0	0.0	OG0001262.fasta.mafft.msa.treesample.ale	ppa_ppa004716__ppa_ppa024310
4	0.0	0.78	0.0	0.22	OG0001262.fasta.mafft.msa.treesample.ale	mpo_mpo001366__ppa_ppa022959
5	0.0	0.0	1.0	0.0	OG0001262.fasta.mafft.msa.treesample.ale	ppa_ppa004716__ppa_ppa022959
6	0.0	0.0	1.0	0.0	OG0001262.fasta.mafft.msa.treesample.ale	ppa_ppa022959__ppa_ppa024310
7	0.96	0.0	0.0	0.04	OG0001262.fasta.mafft.msa.treesample.ale	mpo_mpo001366__smo_smo017511
8	0.76	0.0	0.0	0.24	OG0001262.fasta.mafft.msa.treesample.ale	ppa_ppa004716__smo_smo017511
9	0.76	0.0	0.0	0.24	OG0001262.fasta.mafft.msa.treesample.ale	ppa_ppa024310__smo_smo017511
10	0.76	0.0	0.0	0.24	OG0001262.fasta.mafft.msa.treesample.ale	ppa_ppa022959__smo_smo017511

The estimated posterior probabilty that gene pair

subset[1,"pair"]

"mpo_mpo001366__ppa_ppa004716"

... is derived from a duplication along the branch leading to node 21 is

subset[1,"21_duplication"]

0.22

7. Going further

For more details with regard to, for instance, the inference of ancient WGDs, or fitting more complicated model, consider having a look at the examples. If there are questions, comments, issues, or anything else, feel free to open issues on the gitub repository.

This page was generated using Literate.jl.