Bayesian inference for the DLWGD model

Note

Here it is assumed the reader is already familiar with the material outlined in the Tutorial.

using Whale, NewickTree, Distributions, Turing, DataFrames, LinearAlgebra
using Plots, StatsPlots
using Random; Random.seed!(7137);

plotting defaults

default(grid=false, size=(500,800), titlefontsize=9, title_loc=:left, guidefont=8)

Using a constant-rates model

First we will do inference for a simple constant-rates model (i.e. assuming a single duplication and loss rate for the entire species tree). First we load the species tree (using the example tree available in the Whale library)

t = deepcopy(Whale.extree)
n = length(postwalk(t))  # number of internal nodes
17

Now we add two WGD nodes to the tree. We do this by specifying the last common ancestor node for the lineages that share the hypothetical WGD. By default, the added node is halfway between the specified node and its parent.

insertnode!(getlca(t, "PPAT", "PPAT"), name="wgd_1")
insertnode!(getlca(t, "ATHA", "ATRI"), name="wgd_2")
(((OSAT:1.555,(ATHA:0.5548,CPAP:0.5548):1.0002):0.738,ATRI:2.293):0.6125):0.6125;

and we obtain a reference model object, using the constant-rates model with two WGDs

θ = ConstantDLWGD(λ=0.1, μ=0.2, q=[0.2, 0.1], η=0.9)
w = WhaleModel(θ, t, .1, minn=10, maxn=20)
WhaleModel
——————————
⋅Parameterization:
ConstantDLWGD{Float64, Float64}
  λ: Float64 0.1
  μ: Float64 0.2
  q: Array{Float64}((2,)) [0.2, 0.1]
  p: Array{Float64}((0,)) Float64[]
  η: Float64 0.9

⋅Condition:
Whale.RootCondition

⋅Model structure:
19 nodes (9 leaves, 2 WGD nodes)
node_id,wgd_id,distance,Δt,n,subtree
1,0,4.752,0.2376,20,"MPOL;"
2,0,2.376,0.1188,20,"PPAT;"
3,0,4.457,0.2228,20,"SMOE;"
4,0,3.178,0.1589,20,"GBIL;"
5,0,3.178,0.1589,20,"PABI;"
6,0,1.555,0.0972,16,"OSAT;"
7,0,0.5548,0.0555,10,"ATHA;"
8,0,0.5548,0.0555,10,"CPAP;"
9,0,2.293,0.1146,20,"ATRI;"
18,1,2.376,0.1188,20,"(PPAT);"
10,0,0.292,0.0292,10,"(MPOL,(PPAT));"
11,0,0.34,0.034,10,"(GBIL,PABI);"
12,0,1.0002,0.0909,11,"(ATHA,CPAP);"
13,0,0.738,0.0738,10,"(OSAT,(ATHA,CPAP));"
14,0,0.6125,0.0613,10,"((OSAT,(ATHA,CPAP)),ATRI);"
19,2,0.6125,0.0613,10,"(((OSAT,(ATHA,CPAP)),ATRI));"
15,0,0.939,0.0939,10,"((GBIL,PABI),(((OSAT,(ATHA,CPAP)),ATRI)));"
16,0,0.587,0.0587,10,"(SMOE,((GBIL,PABI),(((OSAT,(ATHA,CPAP)),ATRI))));"
17,0,0.0,0.0,0,"((MPOL,(PPAT)),(SMOE,((GBIL,PABI),(((OSAT,(ATHA,CPAP)),ATRI)))));"

next we get the data (we need a model object for that)

data = joinpath(@__DIR__, "../../example/example-1/ale")
ccd = read_ale(data, w)
12-element Vector{CCD{UInt16, Float64}}:
 CCD{UInt16,Float64}(Γ=83, 𝓛=13)
 CCD{UInt16,Float64}(Γ=55, 𝓛=13)
 CCD{UInt16,Float64}(Γ=89, 𝓛=13)
 CCD{UInt16,Float64}(Γ=131, 𝓛=13)
 CCD{UInt16,Float64}(Γ=107, 𝓛=13)
 CCD{UInt16,Float64}(Γ=59, 𝓛=13)
 CCD{UInt16,Float64}(Γ=53, 𝓛=13)
 CCD{UInt16,Float64}(Γ=83, 𝓛=13)
 CCD{UInt16,Float64}(Γ=59, 𝓛=13)
 CCD{UInt16,Float64}(Γ=95, 𝓛=13)
 CCD{UInt16,Float64}(Γ=67, 𝓛=13)
 CCD{UInt16,Float64}(Γ=65, 𝓛=13)

Now we define the Turing model

@model constantrates(model, ccd) = begin
    λ  ~ Exponential()
    μ  ~ Exponential()
    η  ~ Beta(3,1)
    q1 ~ Beta()
    q2 ~ Beta()
    ccd ~ model((λ=λ, μ=μ, η=η, q=[q1, q2]))
end
constantrates (generic function with 2 methods)

In this model we have line by line:

  • λ and μ: the duplication and loss rate, for which we assume Exponential priors.
  • η: the parameter of the geometric prior distribution on the number of genes at the root (i.e. the Whale likelihood is integrated over a geometric prior for the number of ancestral genes)
  • q1: the retention rate for the first WGD (with a uniform prior on [0,1], i.e. Beta(1,1))
  • q2: as q1
  • ccd ~ model(...): here we specify the likelihood, we assume the data (ccd) is iid from the duplication+loss+WGD model with the relevant parameters.
chain0 = sample(constantrates(w, ccd), NUTS(), 200)
Chains MCMC chain (200×17×1 Array{Float64, 3}):

Iterations        = 101:1:300
Number of chains  = 1
Samples per chain = 200
Wall duration     = 28.86 seconds
Compute duration  = 28.86 seconds
parameters        = λ, μ, η, q1, q2
internals         = lp, n_steps, is_accept, acceptance_rate, log_density, hamiltonian_energy, hamiltonian_energy_error, max_hamiltonian_energy_error, tree_depth, numerical_error, step_size, nom_step_size

Summary Statistics
  parameters      mean       std   naive_se      mcse        ess      rhat   e ⋯
      Symbol   Float64   Float64    Float64   Float64    Float64   Float64     ⋯

           λ    0.1275    0.0182     0.0013    0.0013   205.3097    0.9956     ⋯
           μ    0.1564    0.0268     0.0019    0.0011   197.3678    0.9971     ⋯
           η    0.7905    0.1097     0.0078    0.0088   161.5741    1.0020     ⋯
          q1    0.3450    0.2316     0.0164    0.0173   117.0962    0.9955     ⋯
          q2    0.0871    0.0838     0.0059    0.0077    61.8036    0.9981     ⋯
                                                                1 column omitted

Quantiles
  parameters      2.5%     25.0%     50.0%     75.0%     97.5%
      Symbol   Float64   Float64   Float64   Float64   Float64

           λ    0.0944    0.1161    0.1276    0.1388    0.1628
           μ    0.1031    0.1367    0.1564    0.1755    0.2031
           η    0.5614    0.7004    0.8128    0.8771    0.9449
          q1    0.0053    0.1575    0.3225    0.5135    0.8697
          q2    0.0018    0.0288    0.0664    0.1103    0.3253

Making some trace plots is straightforward using tools from the Turing probabilistic programming ecosystem

plot(chain0, size=(700,900))

We can compute Bayes factors for the WGD hypotheses

summarize(chain0[[:q1,:q2]], Whale.bayesfactor)

  parameters   bayesfactor
      Symbol        String

          q1           0.5
          q2           0.9

This is the log10 Bayes factor in favor of the $q = 0$ model. A Bayes factor smaller than -2 could be considered as evidence in favor of the $q \ne 0$ model compared to the $q=0$ model. This in itself need not say much, as it says nothing about how well the model actually fits the data.

Warning

Of course the chain should be run longer than in this example! Here a short chain is presented to ensure reasonable build times for this documentation. Generally, one should at least strive for ESS values exceeding at least 100, although short chains may be good for exploring and testing different models.

Now let's obtain reconciled trees

posterior = DataFrame(chain0)
ffun = (m, x)->m((λ=x[:λ], μ=x[:μ], η=x[:η], q=[x[:q1], x[:q2]]))
tt = TreeTracker(w, ccd, posterior, ffun)
trees = track(tt, 1000)
12-element Vector{Whale.RecSummary}:
 RecSummary(# unique trees = 96)
 RecSummary(# unique trees = 28)
 RecSummary(# unique trees = 77)
 RecSummary(# unique trees = 210)
 RecSummary(# unique trees = 67)
 RecSummary(# unique trees = 64)
 RecSummary(# unique trees = 58)
 RecSummary(# unique trees = 77)
 RecSummary(# unique trees = 43)
 RecSummary(# unique trees = 49)
 RecSummary(# unique trees = 84)
 RecSummary(# unique trees = 97)

Note that fun is a function that takes the model object and a row from the posterior data frame, returning a model parameterized by the posterior sample in the row x.

Let's have a look at the first family

trees[1].trees
96-element Vector{NamedTuple}:
 (freq = 0.451, tree = Node(1, σ=17, t=0.0, speciation))
 (freq = 0.088, tree = Node(1, σ=17, t=0.0, speciation))
 (freq = 0.059, tree = Node(1, σ=17, t=0.0, duplication))
 (freq = 0.041, tree = Node(1, σ=17, t=0.0, speciation))
 (freq = 0.034, tree = Node(1, σ=17, t=0.0, speciation))
 (freq = 0.029, tree = Node(1, σ=17, t=0.0, speciation))
 (freq = 0.026, tree = Node(1, σ=17, t=0.0, speciation))
 (freq = 0.026, tree = Node(1, σ=17, t=0.0, speciation))
 (freq = 0.022, tree = Node(1, σ=17, t=0.0, speciation))
 (freq = 0.016, tree = Node(1, σ=17, t=0.0, duplication))
 ⋮
 (freq = 0.001, tree = Node(1, σ=17, t=0.0, duplication))
 (freq = 0.001, tree = Node(1, σ=17, t=0.0, duplication))
 (freq = 0.001, tree = Node(1, σ=17, t=0.0, speciation))
 (freq = 0.001, tree = Node(1, σ=17, t=0.0, duplication))
 (freq = 0.001, tree = Node(1, σ=17, t=0.0, duplication))
 (freq = 0.001, tree = Node(1, σ=17, t=0.0, speciation))
 (freq = 0.001, tree = Node(1, σ=17, t=0.0, speciation))
 (freq = 0.001, tree = Node(1, σ=17, t=0.0, speciation))
 (freq = 0.001, tree = Node(1, σ=17, t=0.0, speciation))

We can write these to a file using Whale.writetrees("filename.nw", trees[1].trees) if we would want that. The maximum a posterior tree for this family is

map1 = trees[1].trees[1]
nwstr(map1.tree)
"(((PPAT_Pp3c9_4950V3.1_Pp3c9_4950:2.3760000000000017)1.0:2.375999999999999,MPOL_Mapoly0036s0119.1_Mapoly0036s0119:4.752000000000001)0.846:0.2919999999999998,(SMOE_SMO111G0185.1_SMO111G0185:4.457000000000001,(((PABI_PAB00009793.1_PAB00009793:3.178000000000001,loss_1172007" ⋯ 731 bytes ⋯ "0.601:0.40917272727272735)0.652:0.5910272727272726,loss_4985017985652007989:1.5550000000000002)0.922:0.4796999999999998)0.964:0.2583000000000002)0.989:0.6125000000000003)0.991:0.6125000000000003)0.988:0.70425)0.952:0.23475000000000001)0.852:0.5869999999999997)0.841:0.0;"

with posterior probability

map1.freq
0.451

Or maybe all the gene pairs

ps = Whale.getpairs(trees, w);

Now let's look at the gene pairs which have a non-zero posterior probability of being derived from WGD node 18 (the P. patens WGD, execute @show w to check the model structure)

p = filter(x->x[Symbol("18_wgd")] > 0.0, ps)[!,:pair]
17-element Vector{String}:
 "PPAT_Pp3c18_4330V3.2_Pp3c18_4330__PPAT_Pp3c25_11450V3.1_Pp3c25_11450"
 "PPAT_Pp3c22_17550V3.2_Pp3c22_17550__PPAT_Pp3c25_11450V3.1_Pp3c25_11450"
 "PPAT_Pp3c18_4330V3.2_Pp3c18_4330__PPAT_Pp3c22_17550V3.2_Pp3c22_17550"
 "PPAT_Pp3c21_20710V3.3_Pp3c21_20710__PPAT_Pp3c25_11450V3.1_Pp3c25_11450"
 "PPAT_Pp3c18_4330V3.2_Pp3c18_4330__PPAT_Pp3c21_20710V3.3_Pp3c21_20710"
 "PPAT_Pp3c21_20710V3.3_Pp3c21_20710__PPAT_Pp3c22_17550V3.2_Pp3c22_17550"
 "PPAT_Pp3c25_11450V3.1_Pp3c25_11450__PPAT_Pp3c6_2690V3.3_Pp3c6_2690"
 "PPAT_Pp3c18_4330V3.2_Pp3c18_4330__PPAT_Pp3c6_2690V3.3_Pp3c6_2690"
 "PPAT_Pp3c22_17550V3.2_Pp3c22_17550__PPAT_Pp3c6_2690V3.3_Pp3c6_2690"
 "PPAT_Pp3c21_20710V3.3_Pp3c21_20710__PPAT_Pp3c6_2690V3.3_Pp3c6_2690"
 "PPAT_Pp3c11_14610V3.1_Pp3c11_14610__PPAT_Pp3c8_17600V3.1_Pp3c8_17600"
 "PPAT_Pp3c24_11620V3.3_Pp3c24_11620__PPAT_Pp3c8_17600V3.1_Pp3c8_17600"
 "PPAT_Pp3c11_14610V3.1_Pp3c11_14610__PPAT_Pp3c24_11620V3.3_Pp3c24_11620"
 "PPAT_Pp3c12_12390V3.2_Pp3c12_12390__PPAT_Pp3c8_2050V3.1_Pp3c8_2050"
 "PPAT_Pp3c12_12390V3.2_Pp3c12_12390__PPAT_Pp3c3_35300V3.2_Pp3c3_35300"
 "PPAT_Pp3c3_35300V3.2_Pp3c3_35300__PPAT_Pp3c8_2050V3.1_Pp3c8_2050"
 "PPAT_Pp3c15_24010V3.1_Pp3c15_24010__PPAT_Pp3c9_24230V3.2_Pp3c9_24230"

Now we can look at the complete posterior reconciliation distribution for these gene pairs

df18 = filter(x->x[:pair] ∈ p, ps)
df18[:,[!(all(col .== 0)) for col in eachcol(df18)]]  # filter out all zero columns...

17 rows × 7 columns

18_duplication10_duplication2_duplication18_wgd17_duplicationfamilypair
Float64Float64Float64Float64Float64StringString
10.2750.00.3630.360.002OG0004533.fasta.nex.treesample.alePPAT_Pp3c18_4330V3.2_Pp3c18_4330__PPAT_Pp3c25_11450V3.1_Pp3c25_11450
20.2570.00.4020.3390.002OG0004533.fasta.nex.treesample.alePPAT_Pp3c22_17550V3.2_Pp3c22_17550__PPAT_Pp3c25_11450V3.1_Pp3c25_11450
30.0690.00.8010.130.0OG0004533.fasta.nex.treesample.alePPAT_Pp3c18_4330V3.2_Pp3c18_4330__PPAT_Pp3c22_17550V3.2_Pp3c22_17550
40.6180.1120.0390.0990.132OG0004533.fasta.nex.treesample.alePPAT_Pp3c21_20710V3.3_Pp3c21_20710__PPAT_Pp3c25_11450V3.1_Pp3c25_11450
50.6180.1120.0390.0990.132OG0004533.fasta.nex.treesample.alePPAT_Pp3c18_4330V3.2_Pp3c18_4330__PPAT_Pp3c21_20710V3.3_Pp3c21_20710
60.6180.1120.0390.0990.132OG0004533.fasta.nex.treesample.alePPAT_Pp3c21_20710V3.3_Pp3c21_20710__PPAT_Pp3c22_17550V3.2_Pp3c22_17550
70.6180.1120.0390.0990.132OG0004533.fasta.nex.treesample.alePPAT_Pp3c25_11450V3.1_Pp3c25_11450__PPAT_Pp3c6_2690V3.3_Pp3c6_2690
80.6180.1120.0390.0990.132OG0004533.fasta.nex.treesample.alePPAT_Pp3c18_4330V3.2_Pp3c18_4330__PPAT_Pp3c6_2690V3.3_Pp3c6_2690
90.6180.1120.0390.0990.132OG0004533.fasta.nex.treesample.alePPAT_Pp3c22_17550V3.2_Pp3c22_17550__PPAT_Pp3c6_2690V3.3_Pp3c6_2690
100.130.0010.5230.3410.005OG0004533.fasta.nex.treesample.alePPAT_Pp3c21_20710V3.3_Pp3c21_20710__PPAT_Pp3c6_2690V3.3_Pp3c6_2690
110.4460.00.2110.3430.0OG0004544.fasta.nex.treesample.alePPAT_Pp3c11_14610V3.1_Pp3c11_14610__PPAT_Pp3c8_17600V3.1_Pp3c8_17600
120.4460.00.2110.3430.0OG0004544.fasta.nex.treesample.alePPAT_Pp3c24_11620V3.3_Pp3c24_11620__PPAT_Pp3c8_17600V3.1_Pp3c8_17600
130.0620.00.7790.1590.0OG0004544.fasta.nex.treesample.alePPAT_Pp3c11_14610V3.1_Pp3c11_14610__PPAT_Pp3c24_11620V3.3_Pp3c24_11620
140.4370.0010.1990.3610.002OG0004548.fasta.nex.treesample.alePPAT_Pp3c12_12390V3.2_Pp3c12_12390__PPAT_Pp3c8_2050V3.1_Pp3c8_2050
150.4370.0010.1990.3610.002OG0004548.fasta.nex.treesample.alePPAT_Pp3c12_12390V3.2_Pp3c12_12390__PPAT_Pp3c3_35300V3.2_Pp3c3_35300
160.0530.00.7780.1690.0OG0004548.fasta.nex.treesample.alePPAT_Pp3c3_35300V3.2_Pp3c3_35300__PPAT_Pp3c8_2050V3.1_Pp3c8_2050
170.2240.0010.3940.3810.0OG0004585.fasta.nex.treesample.alePPAT_Pp3c15_24010V3.1_Pp3c15_24010__PPAT_Pp3c9_24230V3.2_Pp3c9_24230

The column names are <branch id>_<event>, and the entries are the posterior probability that the gene pair is reconciled to the relevant branch + event combination.

Here we have for each WGD event in the tree all gene pairs that have non-zero posterior probability (as measured by the frequency in the posterior sample) to be reconciled to the relevant WGD event.

A table summarizing events for each branch can be obtained as well

smry = Whale.summarize(trees)
smry.sum

19 rows × 26 columns

nodecladeloss_meanwgd_meanwgdloss_meanduplication_meansploss_meanspeciation_meanloss_stdwgd_stdwgdloss_stdduplication_stdsploss_stdspeciation_stdloss_q1wgd_q1wgdloss_q1duplication_q1sploss_q1speciation_q1loss_q2wgd_q2wgdloss_q2duplication_q2sploss_q2speciation_q2
UInt16StringFloat64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
11MPOL0.2510830.00.00.4115830.01.416670.4376530.00.00.6350980.00.9537940.00.00.00.00.00.01.00.00.02.00.04.0
22PPAT0.00.00.00.3450830.01.666670.00.00.00.7005960.01.312330.00.00.00.00.00.00.00.00.02.00.05.0
33SMOE0.42450.00.00.16650.01.083330.7029460.00.00.5523690.00.7592030.00.00.00.00.00.02.00.00.02.00.03.0
44GBIL0.2441670.00.00.08850.01.333330.4334540.00.00.2840210.00.623610.00.00.00.00.01.01.00.00.01.00.03.0
55PABI0.05316670.00.00.64750.02.083330.2247370.00.00.6907080.00.8620070.00.00.00.00.01.01.00.00.02.00.03.0
66OSAT0.3544170.00.00.4984170.01.583330.5131330.00.00.4999970.00.4930070.00.00.00.00.01.01.00.00.01.00.02.0
77ATHA0.3359170.00.00.3243330.01.50.6432290.00.00.4681250.00.50.00.00.00.00.01.02.00.00.01.00.02.0
88CPAP0.3434170.00.00.08183330.01.250.4936070.00.00.3051830.00.9242110.00.00.00.00.00.01.00.00.01.00.04.0
99ATRI0.517250.00.00.166250.01.083330.6646070.00.00.5517650.00.6400950.00.00.00.00.00.02.00.00.02.00.03.0
1010MPOL,PPAT0.08491670.00.00.01816670.569250.6869170.3372030.00.00.1354130.8626730.4637480.00.00.00.00.00.01.00.00.00.03.01.0
1111GBIL,PABI0.3561670.00.00.1270.2973331.191670.6308550.00.00.3367060.5014240.4456420.00.00.00.00.01.02.00.00.01.01.02.0
1212ATHA,CPAP0.09183330.00.00.1640830.6793330.832250.2916620.00.00.3964340.8833680.3736440.00.00.00.00.00.01.00.00.01.02.01.0
1313OSAT,ATHA0.090250.00.00.095250.446250.9930830.286830.00.00.3071550.6838210.4161760.00.00.00.00.00.01.00.00.01.02.02.0
1414OSAT,ATRI0.00.00.00.02783330.60750.8268330.00.00.00.1650010.8895380.3783910.00.00.00.00.00.00.00.00.01.03.01.0
1515GBIL,OSAT0.07191670.00.00.448750.7213330.9968330.2596370.00.00.6899570.9400420.4116920.00.00.00.00.00.01.00.00.03.03.02.0
1616SMOE,GBIL0.03733330.00.00.055750.4964170.8449170.1994310.00.00.2492830.8085920.4518840.00.00.00.00.00.01.00.00.01.03.02.0
1717MPOL,SMOE0.00.00.00.3229170.122251.200670.00.00.00.620060.4210760.496890.00.00.00.00.01.00.00.00.02.02.03.0
1818wgd_10.3181670.1930830.9354170.19050.00.00.5676880.4394720.6314630.4836080.00.00.00.00.00.00.00.02.01.03.02.00.00.0
1919wgd_20.3651670.029751.3470.023750.00.00.4859560.1698970.6717570.1549820.00.00.00.00.00.00.00.01.01.03.00.00.00.0

here we have the expected number of duplications, losses, etc. per family for the different branches in the species tree (the row associated with a node corresponds to the branch leading to that node)

Maximum-likelihood estimation

using Optim
result = optimize(constantrates(w, ccd), MLE())
ModeResult with maximized lp of -327.04
5-element Named Vector{Float64}
A   │ 
────┼────────────
:λ  │    0.126987
:μ  │    0.150525
:η  │    0.796657
:q1 │    0.201446
:q2 │ 4.22784e-19

One could now compute the likelihood using the model without WGD 1 and/or 2 and compare the likelihoods using a likelihood ratio test as in Rabier et al. (2014) to assess whether the data is compatible with the hypothesis $q = 0$ (which should represent absence of a WGD).

Using a branch-specific rates model

Now we will consider a model with branch-specific duplication and loss rates, using a more complicated hierarchical model with an uncorrelated relaxed clock model. We'll use the same tree as above. The relevant model now is the DLWGD model:

θ = DLWGD(λ=zeros(n), μ=zeros(n), q=rand(2), η=rand())
w = WhaleModel(θ, t, 0.1)
ccd = read_ale(data, w)
12-element Vector{CCD{UInt16, Float64}}:
 CCD{UInt16,Float64}(Γ=83, 𝓛=13)
 CCD{UInt16,Float64}(Γ=55, 𝓛=13)
 CCD{UInt16,Float64}(Γ=89, 𝓛=13)
 CCD{UInt16,Float64}(Γ=131, 𝓛=13)
 CCD{UInt16,Float64}(Γ=107, 𝓛=13)
 CCD{UInt16,Float64}(Γ=59, 𝓛=13)
 CCD{UInt16,Float64}(Γ=53, 𝓛=13)
 CCD{UInt16,Float64}(Γ=83, 𝓛=13)
 CCD{UInt16,Float64}(Γ=59, 𝓛=13)
 CCD{UInt16,Float64}(Γ=95, 𝓛=13)
 CCD{UInt16,Float64}(Γ=67, 𝓛=13)
 CCD{UInt16,Float64}(Γ=65, 𝓛=13)

Note that the duplication and loss rates should here be specified on a log-scale for the DLWGD model. We assume a Normal prior for the mean duplication and loss rates, and assume the log-scale branch-specific rates to be distributed according to a multivariate normal with diagonal covariance matrix $\tau I$. We assume duplication and loss rates to be independent.

@model branchrates(model, n, ccd, ::Type{T}=Float64) where T = begin
    η ~ Beta(3,1)
    λ̄ ~ Normal(log(0.15), 2)
    μ̄ ~ Normal(log(0.15), 2)
    τ ~ Exponential(0.1)
    λ ~ MvNormal(fill(λ̄, n-1), τ)
    μ ~ MvNormal(fill(μ̄, n-1), τ)
    q1 ~ Beta()
    q2 ~ Beta()
    ccd ~ model((λ=λ, μ=μ, η=η, q=[q1, q2]))
end
branchrates (generic function with 3 methods)

... and sample (this takes a bit longer!)

chain1 = sample(branchrates(w, n, ccd), NUTS(), 200)
Chains MCMC chain (200×50×1 Array{Float64, 3}):

Iterations        = 101:1:300
Number of chains  = 1
Samples per chain = 200
Wall duration     = 61.85 seconds
Compute duration  = 61.85 seconds
parameters        = η, λ̄, μ̄, τ, λ[1], λ[2], λ[3], λ[4], λ[5], λ[6], λ[7], λ[8], λ[9], λ[10], λ[11], λ[12], λ[13], λ[14], λ[15], λ[16], μ[1], μ[2], μ[3], μ[4], μ[5], μ[6], μ[7], μ[8], μ[9], μ[10], μ[11], μ[12], μ[13], μ[14], μ[15], μ[16], q1, q2
internals         = lp, n_steps, is_accept, acceptance_rate, log_density, hamiltonian_energy, hamiltonian_energy_error, max_hamiltonian_energy_error, tree_depth, numerical_error, step_size, nom_step_size

Summary Statistics
  parameters      mean       std   naive_se      mcse       ess      rhat   es ⋯
      Symbol   Float64   Float64    Float64   Float64   Float64   Float64      ⋯

           η    0.7624    0.0088     0.0006    0.0022    5.0093    1.1250      ⋯
           λ̄   -2.4873    0.0326     0.0023    0.0054   17.8020    1.0396      ⋯
           μ̄   -1.8039    0.0282     0.0020    0.0062    2.2817    1.4026      ⋯
           τ    0.0546    0.0044     0.0003    0.0012    1.3286    2.1096      ⋯
        λ[1]   -2.4700    0.0681     0.0048    0.0172    6.8383    1.0380      ⋯
        λ[2]   -2.4520    0.0482     0.0034    0.0118    5.9925    1.1298      ⋯
        λ[3]   -2.4764    0.0373     0.0026    0.0086   11.7615    1.0227      ⋯
        λ[4]   -2.5010    0.0495     0.0035    0.0111    2.1071    1.4507      ⋯
        λ[5]   -2.5082    0.0521     0.0037    0.0131    8.7465    1.0463      ⋯
        λ[6]   -2.4845    0.0338     0.0024    0.0083    7.9660    1.1339      ⋯
        λ[7]   -2.4415    0.0458     0.0032    0.0112    1.5246    1.7847      ⋯
        λ[8]   -2.4669    0.0468     0.0033    0.0098   13.3540    1.1634      ⋯
        λ[9]   -2.5028    0.0510     0.0036    0.0119    3.8259    1.2367      ⋯
       λ[10]   -2.4905    0.0542     0.0038    0.0139   10.3745    1.0277      ⋯
       λ[11]   -2.4677    0.0422     0.0030    0.0101    8.8714    1.0836      ⋯
       λ[12]   -2.4920    0.0377     0.0027    0.0092   10.6115    1.0324      ⋯
       λ[13]   -2.4785    0.0639     0.0045    0.0167    6.4747    1.0838      ⋯
      ⋮           ⋮         ⋮         ⋮          ⋮         ⋮         ⋮         ⋱
                                                    1 column and 21 rows omitted

Quantiles
  parameters      2.5%     25.0%     50.0%     75.0%     97.5%
      Symbol   Float64   Float64   Float64   Float64   Float64

           η    0.7474    0.7575    0.7638    0.7678    0.7753
           λ̄   -2.5367   -2.5124   -2.4845   -2.4731   -2.4111
           μ̄   -1.8448   -1.8231   -1.8135   -1.7888   -1.7339
           τ    0.0486    0.0509    0.0528    0.0577    0.0635
        λ[1]   -2.5987   -2.5301   -2.4645   -2.4167   -2.3462
        λ[2]   -2.5379   -2.5034   -2.4298   -2.4147   -2.3750
        λ[3]   -2.5539   -2.5046   -2.4742   -2.4484   -2.4117
        λ[4]   -2.5791   -2.5355   -2.5126   -2.4698   -2.3962
        λ[5]   -2.6020   -2.5547   -2.5175   -2.4661   -2.4251
        λ[6]   -2.5501   -2.5055   -2.4825   -2.4619   -2.4106
        λ[7]   -2.5260   -2.4640   -2.4429   -2.4029   -2.3648
        λ[8]   -2.5320   -2.5088   -2.4776   -2.4367   -2.3612
        λ[9]   -2.5959   -2.5361   -2.5146   -2.4588   -2.4032
       λ[10]   -2.5782   -2.5141   -2.4917   -2.4503   -2.3990
       λ[11]   -2.5420   -2.5039   -2.4590   -2.4472   -2.4020
       λ[12]   -2.5637   -2.5133   -2.4826   -2.4724   -2.4323
       λ[13]   -2.6138   -2.5140   -2.4743   -2.4406   -2.3621
      ⋮           ⋮         ⋮         ⋮         ⋮         ⋮
                                                 21 rows omitted

Make a plot for the retention parameters

plot(chain1[[:q1,:q2]], size=(700,300))

Posterior predictive simulations

We can do posterior predictive simulations to assess model fit. There are of course many possible posterior predictive observables that we may employ to do so. The approach below is but one that is (partially) implemented. Here we compare simulated number of events for each branch with reconstructed number of events for each branch. That is, for $N$ samples from the posterior, we (1) simulate a data set of the size of our empirical data set and (2) sample a reconciled tree for each gene family. We then compare, for instance, the number of duplications on the branch leading to node $m$ in the two simulated sets. If the model fits these should be similar.

We need a function to get a parameterized model from a chain iterate:

function mfun(M, x)
    q1 = get(x, :q1).q1[1]
    q2 = get(x, :q2).q2[1]
    λ = vec(vcat(get(x, :λ).λ...))
    μ = vec(vcat(get(x, :μ).μ...))
    η = get(x, :η).η[1]
    M((λ=λ, μ=μ, q=[q1,q2], η=η))
end
mfun (generic function with 1 method)

The following will then do 100 posterior predictive simulations

pps = Whale.ppsims(chain1, mfun, w, ccd, 100);

and we make a plot for duplication events

lab = "duplication"
ps = map(w.order) do mnode
    l = "$(id(mnode))_$lab"
    dots = map(x->(x[1][l], x[2][l]), pps)
    xmn, xmx = extrema(vcat(first.(dots), first.(dots)))
    scatter(dots, color=:black, ms=2, alpha=0.5, legend=false,
            xlabel="\$y\$", ylabel="\$\\tilde{y}\$",
            title=Whale.cladelabel(mnode),
            xlim=(xmn-0.5,xmx+0.5), xticks=xmn:1:xmx)
    plot!(x->x, color=:lightgray)
end
plot(ps..., size=(700,600))

one can of course also compare loss events, speciation events, WGD-derived duplicates etc.

The above is somewhat more informative when analyzing more data. Using this approach on chain0 and comparing against the results displayed here, one could for instance check whether the constant rates assumption is strongly violated or not. If the dots in these plots are not systematically above or below the 1-1 line, at least this aspect of the data is explained well by the model.

Note

I know that this, and the sampling methods for reconciled trees, should be implemented in a somewhat more consistent and user-friendly style. It's on the to-do list.


This page was generated using Literate.jl.