Bayesian inference for the DLWGD model
Here it is assumed the reader is already familiar with the material outlined in the Tutorial.
using Whale, NewickTree, Distributions, Turing, DataFrames, LinearAlgebra
using Plots, StatsPlots
using Random; Random.seed!(7137);
plotting defaults
default(grid=false, size=(500,800), titlefontsize=9, title_loc=:left, guidefont=8)
Using a constant-rates model
First we will do inference for a simple constant-rates model (i.e. assuming a single duplication and loss rate for the entire species tree). First we load the species tree (using the example tree available in the Whale library)
t = deepcopy(Whale.extree)
n = length(postwalk(t)) # number of internal nodes
17
Now we add two WGD nodes to the tree. We do this by specifying the last common ancestor node for the lineages that share the hypothetical WGD. By default, the added node is halfway between the specified node and its parent.
insertnode!(getlca(t, "PPAT", "PPAT"), name="wgd_1")
insertnode!(getlca(t, "ATHA", "ATRI"), name="wgd_2")
(((OSAT:1.555,(ATHA:0.5548,CPAP:0.5548):1.0002):0.738,ATRI:2.293):0.6125):0.6125;
and we obtain a reference model object, using the constant-rates model with two WGDs
θ = ConstantDLWGD(λ=0.1, μ=0.2, q=[0.2, 0.1], η=0.9)
w = WhaleModel(θ, t, .1, minn=10, maxn=20)
WhaleModel
——————————
⋅Parameterization:
ConstantDLWGD{Float64, Float64}
λ: Float64 0.1
μ: Float64 0.2
q: Array{Float64}((2,)) [0.2, 0.1]
p: Array{Float64}((0,)) Float64[]
η: Float64 0.9
⋅Condition:
Whale.RootCondition
⋅Model structure:
19 nodes (9 leaves, 2 WGD nodes)
node_id,wgd_id,distance,Δt,n,subtree
1,0,4.752,0.2376,20,"MPOL;"
2,0,2.376,0.1188,20,"PPAT;"
3,0,4.457,0.2228,20,"SMOE;"
4,0,3.178,0.1589,20,"GBIL;"
5,0,3.178,0.1589,20,"PABI;"
6,0,1.555,0.0972,16,"OSAT;"
7,0,0.5548,0.0555,10,"ATHA;"
8,0,0.5548,0.0555,10,"CPAP;"
9,0,2.293,0.1146,20,"ATRI;"
18,1,2.376,0.1188,20,"(PPAT);"
10,0,0.292,0.0292,10,"(MPOL,(PPAT));"
11,0,0.34,0.034,10,"(GBIL,PABI);"
12,0,1.0002,0.0909,11,"(ATHA,CPAP);"
13,0,0.738,0.0738,10,"(OSAT,(ATHA,CPAP));"
14,0,0.6125,0.0613,10,"((OSAT,(ATHA,CPAP)),ATRI);"
19,2,0.6125,0.0613,10,"(((OSAT,(ATHA,CPAP)),ATRI));"
15,0,0.939,0.0939,10,"((GBIL,PABI),(((OSAT,(ATHA,CPAP)),ATRI)));"
16,0,0.587,0.0587,10,"(SMOE,((GBIL,PABI),(((OSAT,(ATHA,CPAP)),ATRI))));"
17,0,0.0,0.0,0,"((MPOL,(PPAT)),(SMOE,((GBIL,PABI),(((OSAT,(ATHA,CPAP)),ATRI)))));"
next we get the data (we need a model object for that)
data = joinpath(@__DIR__, "../../example/example-1/ale")
ccd = read_ale(data, w)
12-element Vector{CCD{UInt16, Float64}}:
CCD{UInt16,Float64}(Γ=83, 𝓛=13)
CCD{UInt16,Float64}(Γ=55, 𝓛=13)
CCD{UInt16,Float64}(Γ=89, 𝓛=13)
CCD{UInt16,Float64}(Γ=131, 𝓛=13)
CCD{UInt16,Float64}(Γ=107, 𝓛=13)
CCD{UInt16,Float64}(Γ=59, 𝓛=13)
CCD{UInt16,Float64}(Γ=53, 𝓛=13)
CCD{UInt16,Float64}(Γ=83, 𝓛=13)
CCD{UInt16,Float64}(Γ=59, 𝓛=13)
CCD{UInt16,Float64}(Γ=95, 𝓛=13)
CCD{UInt16,Float64}(Γ=67, 𝓛=13)
CCD{UInt16,Float64}(Γ=65, 𝓛=13)
Now we define the Turing model
@model constantrates(model, ccd) = begin
λ ~ Exponential()
μ ~ Exponential()
η ~ Beta(3,1)
q1 ~ Beta()
q2 ~ Beta()
ccd ~ model((λ=λ, μ=μ, η=η, q=[q1, q2]))
end
constantrates (generic function with 2 methods)
In this model we have line by line:
λ
andμ
: the duplication and loss rate, for which we assume Exponential priors.η
: the parameter of the geometric prior distribution on the number of genes at the root (i.e. the Whale likelihood is integrated over a geometric prior for the number of ancestral genes)q1
: the retention rate for the first WGD (with a uniform prior on [0,1], i.e. Beta(1,1))q2
: asq1
ccd ~ model(...)
: here we specify the likelihood, we assume the data (ccd
) is iid from the duplication+loss+WGD model with the relevant parameters.
chain0 = sample(constantrates(w, ccd), NUTS(), 200)
Chains MCMC chain (200×17×1 Array{Float64, 3}):
Iterations = 101:1:300
Number of chains = 1
Samples per chain = 200
Wall duration = 28.86 seconds
Compute duration = 28.86 seconds
parameters = λ, μ, η, q1, q2
internals = lp, n_steps, is_accept, acceptance_rate, log_density, hamiltonian_energy, hamiltonian_energy_error, max_hamiltonian_energy_error, tree_depth, numerical_error, step_size, nom_step_size
Summary Statistics
parameters mean std naive_se mcse ess rhat e ⋯
Symbol Float64 Float64 Float64 Float64 Float64 Float64 ⋯
λ 0.1275 0.0182 0.0013 0.0013 205.3097 0.9956 ⋯
μ 0.1564 0.0268 0.0019 0.0011 197.3678 0.9971 ⋯
η 0.7905 0.1097 0.0078 0.0088 161.5741 1.0020 ⋯
q1 0.3450 0.2316 0.0164 0.0173 117.0962 0.9955 ⋯
q2 0.0871 0.0838 0.0059 0.0077 61.8036 0.9981 ⋯
1 column omitted
Quantiles
parameters 2.5% 25.0% 50.0% 75.0% 97.5%
Symbol Float64 Float64 Float64 Float64 Float64
λ 0.0944 0.1161 0.1276 0.1388 0.1628
μ 0.1031 0.1367 0.1564 0.1755 0.2031
η 0.5614 0.7004 0.8128 0.8771 0.9449
q1 0.0053 0.1575 0.3225 0.5135 0.8697
q2 0.0018 0.0288 0.0664 0.1103 0.3253
Making some trace plots is straightforward using tools from the Turing probabilistic programming ecosystem
plot(chain0, size=(700,900))
We can compute Bayes factors for the WGD hypotheses
summarize(chain0[[:q1,:q2]], Whale.bayesfactor)
parameters bayesfactor
Symbol String
q1 0.5
q2 0.9
This is the log10 Bayes factor in favor of the $q = 0$ model. A Bayes factor smaller than -2 could be considered as evidence in favor of the $q \ne 0$ model compared to the $q=0$ model. This in itself need not say much, as it says nothing about how well the model actually fits the data.
Of course the chain should be run longer than in this example! Here a short chain is presented to ensure reasonable build times for this documentation. Generally, one should at least strive for ESS values exceeding at least 100, although short chains may be good for exploring and testing different models.
Now let's obtain reconciled trees
posterior = DataFrame(chain0)
ffun = (m, x)->m((λ=x[:λ], μ=x[:μ], η=x[:η], q=[x[:q1], x[:q2]]))
tt = TreeTracker(w, ccd, posterior, ffun)
trees = track(tt, 1000)
12-element Vector{Whale.RecSummary}:
RecSummary(# unique trees = 96)
RecSummary(# unique trees = 28)
RecSummary(# unique trees = 77)
RecSummary(# unique trees = 210)
RecSummary(# unique trees = 67)
RecSummary(# unique trees = 64)
RecSummary(# unique trees = 58)
RecSummary(# unique trees = 77)
RecSummary(# unique trees = 43)
RecSummary(# unique trees = 49)
RecSummary(# unique trees = 84)
RecSummary(# unique trees = 97)
Note that fun
is a function that takes the model object and a row from the posterior data frame, returning a model parameterized by the posterior sample in the row x
.
Let's have a look at the first family
trees[1].trees
96-element Vector{NamedTuple}:
(freq = 0.451, tree = Node(1, σ=17, t=0.0, speciation))
(freq = 0.088, tree = Node(1, σ=17, t=0.0, speciation))
(freq = 0.059, tree = Node(1, σ=17, t=0.0, duplication))
(freq = 0.041, tree = Node(1, σ=17, t=0.0, speciation))
(freq = 0.034, tree = Node(1, σ=17, t=0.0, speciation))
(freq = 0.029, tree = Node(1, σ=17, t=0.0, speciation))
(freq = 0.026, tree = Node(1, σ=17, t=0.0, speciation))
(freq = 0.026, tree = Node(1, σ=17, t=0.0, speciation))
(freq = 0.022, tree = Node(1, σ=17, t=0.0, speciation))
(freq = 0.016, tree = Node(1, σ=17, t=0.0, duplication))
⋮
(freq = 0.001, tree = Node(1, σ=17, t=0.0, duplication))
(freq = 0.001, tree = Node(1, σ=17, t=0.0, duplication))
(freq = 0.001, tree = Node(1, σ=17, t=0.0, speciation))
(freq = 0.001, tree = Node(1, σ=17, t=0.0, duplication))
(freq = 0.001, tree = Node(1, σ=17, t=0.0, duplication))
(freq = 0.001, tree = Node(1, σ=17, t=0.0, speciation))
(freq = 0.001, tree = Node(1, σ=17, t=0.0, speciation))
(freq = 0.001, tree = Node(1, σ=17, t=0.0, speciation))
(freq = 0.001, tree = Node(1, σ=17, t=0.0, speciation))
We can write these to a file using Whale.writetrees("filename.nw", trees[1].trees)
if we would want that. The maximum a posterior tree for this family is
map1 = trees[1].trees[1]
nwstr(map1.tree)
"(((PPAT_Pp3c9_4950V3.1_Pp3c9_4950:2.3760000000000017)1.0:2.375999999999999,MPOL_Mapoly0036s0119.1_Mapoly0036s0119:4.752000000000001)0.846:0.2919999999999998,(SMOE_SMO111G0185.1_SMO111G0185:4.457000000000001,(((PABI_PAB00009793.1_PAB00009793:3.178000000000001,loss_1172007" ⋯ 731 bytes ⋯ "0.601:0.40917272727272735)0.652:0.5910272727272726,loss_4985017985652007989:1.5550000000000002)0.922:0.4796999999999998)0.964:0.2583000000000002)0.989:0.6125000000000003)0.991:0.6125000000000003)0.988:0.70425)0.952:0.23475000000000001)0.852:0.5869999999999997)0.841:0.0;"
with posterior probability
map1.freq
0.451
Or maybe all the gene pairs
ps = Whale.getpairs(trees, w);
Now let's look at the gene pairs which have a non-zero posterior probability of being derived from WGD node 18 (the P. patens WGD, execute @show w
to check the model structure)
p = filter(x->x[Symbol("18_wgd")] > 0.0, ps)[!,:pair]
17-element Vector{String}:
"PPAT_Pp3c18_4330V3.2_Pp3c18_4330__PPAT_Pp3c25_11450V3.1_Pp3c25_11450"
"PPAT_Pp3c22_17550V3.2_Pp3c22_17550__PPAT_Pp3c25_11450V3.1_Pp3c25_11450"
"PPAT_Pp3c18_4330V3.2_Pp3c18_4330__PPAT_Pp3c22_17550V3.2_Pp3c22_17550"
"PPAT_Pp3c21_20710V3.3_Pp3c21_20710__PPAT_Pp3c25_11450V3.1_Pp3c25_11450"
"PPAT_Pp3c18_4330V3.2_Pp3c18_4330__PPAT_Pp3c21_20710V3.3_Pp3c21_20710"
"PPAT_Pp3c21_20710V3.3_Pp3c21_20710__PPAT_Pp3c22_17550V3.2_Pp3c22_17550"
"PPAT_Pp3c25_11450V3.1_Pp3c25_11450__PPAT_Pp3c6_2690V3.3_Pp3c6_2690"
"PPAT_Pp3c18_4330V3.2_Pp3c18_4330__PPAT_Pp3c6_2690V3.3_Pp3c6_2690"
"PPAT_Pp3c22_17550V3.2_Pp3c22_17550__PPAT_Pp3c6_2690V3.3_Pp3c6_2690"
"PPAT_Pp3c21_20710V3.3_Pp3c21_20710__PPAT_Pp3c6_2690V3.3_Pp3c6_2690"
"PPAT_Pp3c11_14610V3.1_Pp3c11_14610__PPAT_Pp3c8_17600V3.1_Pp3c8_17600"
"PPAT_Pp3c24_11620V3.3_Pp3c24_11620__PPAT_Pp3c8_17600V3.1_Pp3c8_17600"
"PPAT_Pp3c11_14610V3.1_Pp3c11_14610__PPAT_Pp3c24_11620V3.3_Pp3c24_11620"
"PPAT_Pp3c12_12390V3.2_Pp3c12_12390__PPAT_Pp3c8_2050V3.1_Pp3c8_2050"
"PPAT_Pp3c12_12390V3.2_Pp3c12_12390__PPAT_Pp3c3_35300V3.2_Pp3c3_35300"
"PPAT_Pp3c3_35300V3.2_Pp3c3_35300__PPAT_Pp3c8_2050V3.1_Pp3c8_2050"
"PPAT_Pp3c15_24010V3.1_Pp3c15_24010__PPAT_Pp3c9_24230V3.2_Pp3c9_24230"
Now we can look at the complete posterior reconciliation distribution for these gene pairs
df18 = filter(x->x[:pair] ∈ p, ps)
df18[:,[!(all(col .== 0)) for col in eachcol(df18)]] # filter out all zero columns...
17 rows × 7 columns
18_duplication | 10_duplication | 2_duplication | 18_wgd | 17_duplication | family | pair | |
---|---|---|---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | Float64 | String | String | |
1 | 0.275 | 0.0 | 0.363 | 0.36 | 0.002 | OG0004533.fasta.nex.treesample.ale | PPAT_Pp3c18_4330V3.2_Pp3c18_4330__PPAT_Pp3c25_11450V3.1_Pp3c25_11450 |
2 | 0.257 | 0.0 | 0.402 | 0.339 | 0.002 | OG0004533.fasta.nex.treesample.ale | PPAT_Pp3c22_17550V3.2_Pp3c22_17550__PPAT_Pp3c25_11450V3.1_Pp3c25_11450 |
3 | 0.069 | 0.0 | 0.801 | 0.13 | 0.0 | OG0004533.fasta.nex.treesample.ale | PPAT_Pp3c18_4330V3.2_Pp3c18_4330__PPAT_Pp3c22_17550V3.2_Pp3c22_17550 |
4 | 0.618 | 0.112 | 0.039 | 0.099 | 0.132 | OG0004533.fasta.nex.treesample.ale | PPAT_Pp3c21_20710V3.3_Pp3c21_20710__PPAT_Pp3c25_11450V3.1_Pp3c25_11450 |
5 | 0.618 | 0.112 | 0.039 | 0.099 | 0.132 | OG0004533.fasta.nex.treesample.ale | PPAT_Pp3c18_4330V3.2_Pp3c18_4330__PPAT_Pp3c21_20710V3.3_Pp3c21_20710 |
6 | 0.618 | 0.112 | 0.039 | 0.099 | 0.132 | OG0004533.fasta.nex.treesample.ale | PPAT_Pp3c21_20710V3.3_Pp3c21_20710__PPAT_Pp3c22_17550V3.2_Pp3c22_17550 |
7 | 0.618 | 0.112 | 0.039 | 0.099 | 0.132 | OG0004533.fasta.nex.treesample.ale | PPAT_Pp3c25_11450V3.1_Pp3c25_11450__PPAT_Pp3c6_2690V3.3_Pp3c6_2690 |
8 | 0.618 | 0.112 | 0.039 | 0.099 | 0.132 | OG0004533.fasta.nex.treesample.ale | PPAT_Pp3c18_4330V3.2_Pp3c18_4330__PPAT_Pp3c6_2690V3.3_Pp3c6_2690 |
9 | 0.618 | 0.112 | 0.039 | 0.099 | 0.132 | OG0004533.fasta.nex.treesample.ale | PPAT_Pp3c22_17550V3.2_Pp3c22_17550__PPAT_Pp3c6_2690V3.3_Pp3c6_2690 |
10 | 0.13 | 0.001 | 0.523 | 0.341 | 0.005 | OG0004533.fasta.nex.treesample.ale | PPAT_Pp3c21_20710V3.3_Pp3c21_20710__PPAT_Pp3c6_2690V3.3_Pp3c6_2690 |
11 | 0.446 | 0.0 | 0.211 | 0.343 | 0.0 | OG0004544.fasta.nex.treesample.ale | PPAT_Pp3c11_14610V3.1_Pp3c11_14610__PPAT_Pp3c8_17600V3.1_Pp3c8_17600 |
12 | 0.446 | 0.0 | 0.211 | 0.343 | 0.0 | OG0004544.fasta.nex.treesample.ale | PPAT_Pp3c24_11620V3.3_Pp3c24_11620__PPAT_Pp3c8_17600V3.1_Pp3c8_17600 |
13 | 0.062 | 0.0 | 0.779 | 0.159 | 0.0 | OG0004544.fasta.nex.treesample.ale | PPAT_Pp3c11_14610V3.1_Pp3c11_14610__PPAT_Pp3c24_11620V3.3_Pp3c24_11620 |
14 | 0.437 | 0.001 | 0.199 | 0.361 | 0.002 | OG0004548.fasta.nex.treesample.ale | PPAT_Pp3c12_12390V3.2_Pp3c12_12390__PPAT_Pp3c8_2050V3.1_Pp3c8_2050 |
15 | 0.437 | 0.001 | 0.199 | 0.361 | 0.002 | OG0004548.fasta.nex.treesample.ale | PPAT_Pp3c12_12390V3.2_Pp3c12_12390__PPAT_Pp3c3_35300V3.2_Pp3c3_35300 |
16 | 0.053 | 0.0 | 0.778 | 0.169 | 0.0 | OG0004548.fasta.nex.treesample.ale | PPAT_Pp3c3_35300V3.2_Pp3c3_35300__PPAT_Pp3c8_2050V3.1_Pp3c8_2050 |
17 | 0.224 | 0.001 | 0.394 | 0.381 | 0.0 | OG0004585.fasta.nex.treesample.ale | PPAT_Pp3c15_24010V3.1_Pp3c15_24010__PPAT_Pp3c9_24230V3.2_Pp3c9_24230 |
The column names are <branch id>_<event>
, and the entries are the posterior probability that the gene pair is reconciled to the relevant branch + event combination.
Here we have for each WGD event in the tree all gene pairs that have non-zero posterior probability (as measured by the frequency in the posterior sample) to be reconciled to the relevant WGD event.
A table summarizing events for each branch can be obtained as well
smry = Whale.summarize(trees)
smry.sum
19 rows × 26 columns
node | clade | loss_mean | wgd_mean | wgdloss_mean | duplication_mean | sploss_mean | speciation_mean | loss_std | wgd_std | wgdloss_std | duplication_std | sploss_std | speciation_std | loss_q1 | wgd_q1 | wgdloss_q1 | duplication_q1 | sploss_q1 | speciation_q1 | loss_q2 | wgd_q2 | wgdloss_q2 | duplication_q2 | sploss_q2 | speciation_q2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
UInt16 | String | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 1 | MPOL | 0.251083 | 0.0 | 0.0 | 0.411583 | 0.0 | 1.41667 | 0.437653 | 0.0 | 0.0 | 0.635098 | 0.0 | 0.953794 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 2.0 | 0.0 | 4.0 |
2 | 2 | PPAT | 0.0 | 0.0 | 0.0 | 0.345083 | 0.0 | 1.66667 | 0.0 | 0.0 | 0.0 | 0.700596 | 0.0 | 1.31233 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 5.0 |
3 | 3 | SMOE | 0.4245 | 0.0 | 0.0 | 0.1665 | 0.0 | 1.08333 | 0.702946 | 0.0 | 0.0 | 0.552369 | 0.0 | 0.759203 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 | 2.0 | 0.0 | 3.0 |
4 | 4 | GBIL | 0.244167 | 0.0 | 0.0 | 0.0885 | 0.0 | 1.33333 | 0.433454 | 0.0 | 0.0 | 0.284021 | 0.0 | 0.62361 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 3.0 |
5 | 5 | PABI | 0.0531667 | 0.0 | 0.0 | 0.6475 | 0.0 | 2.08333 | 0.224737 | 0.0 | 0.0 | 0.690708 | 0.0 | 0.862007 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 2.0 | 0.0 | 3.0 |
6 | 6 | OSAT | 0.354417 | 0.0 | 0.0 | 0.498417 | 0.0 | 1.58333 | 0.513133 | 0.0 | 0.0 | 0.499997 | 0.0 | 0.493007 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 2.0 |
7 | 7 | ATHA | 0.335917 | 0.0 | 0.0 | 0.324333 | 0.0 | 1.5 | 0.643229 | 0.0 | 0.0 | 0.468125 | 0.0 | 0.5 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 2.0 | 0.0 | 0.0 | 1.0 | 0.0 | 2.0 |
8 | 8 | CPAP | 0.343417 | 0.0 | 0.0 | 0.0818333 | 0.0 | 1.25 | 0.493607 | 0.0 | 0.0 | 0.305183 | 0.0 | 0.924211 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 4.0 |
9 | 9 | ATRI | 0.51725 | 0.0 | 0.0 | 0.16625 | 0.0 | 1.08333 | 0.664607 | 0.0 | 0.0 | 0.551765 | 0.0 | 0.640095 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 | 2.0 | 0.0 | 3.0 |
10 | 10 | MPOL,PPAT | 0.0849167 | 0.0 | 0.0 | 0.0181667 | 0.56925 | 0.686917 | 0.337203 | 0.0 | 0.0 | 0.135413 | 0.862673 | 0.463748 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 3.0 | 1.0 |
11 | 11 | GBIL,PABI | 0.356167 | 0.0 | 0.0 | 0.127 | 0.297333 | 1.19167 | 0.630855 | 0.0 | 0.0 | 0.336706 | 0.501424 | 0.445642 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 2.0 | 0.0 | 0.0 | 1.0 | 1.0 | 2.0 |
12 | 12 | ATHA,CPAP | 0.0918333 | 0.0 | 0.0 | 0.164083 | 0.679333 | 0.83225 | 0.291662 | 0.0 | 0.0 | 0.396434 | 0.883368 | 0.373644 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 2.0 | 1.0 |
13 | 13 | OSAT,ATHA | 0.09025 | 0.0 | 0.0 | 0.09525 | 0.44625 | 0.993083 | 0.28683 | 0.0 | 0.0 | 0.307155 | 0.683821 | 0.416176 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 2.0 | 2.0 |
14 | 14 | OSAT,ATRI | 0.0 | 0.0 | 0.0 | 0.0278333 | 0.6075 | 0.826833 | 0.0 | 0.0 | 0.0 | 0.165001 | 0.889538 | 0.378391 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 3.0 | 1.0 |
15 | 15 | GBIL,OSAT | 0.0719167 | 0.0 | 0.0 | 0.44875 | 0.721333 | 0.996833 | 0.259637 | 0.0 | 0.0 | 0.689957 | 0.940042 | 0.411692 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 3.0 | 3.0 | 2.0 |
16 | 16 | SMOE,GBIL | 0.0373333 | 0.0 | 0.0 | 0.05575 | 0.496417 | 0.844917 | 0.199431 | 0.0 | 0.0 | 0.249283 | 0.808592 | 0.451884 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 3.0 | 2.0 |
17 | 17 | MPOL,SMOE | 0.0 | 0.0 | 0.0 | 0.322917 | 0.12225 | 1.20067 | 0.0 | 0.0 | 0.0 | 0.62006 | 0.421076 | 0.49689 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 2.0 | 2.0 | 3.0 |
18 | 18 | wgd_1 | 0.318167 | 0.193083 | 0.935417 | 0.1905 | 0.0 | 0.0 | 0.567688 | 0.439472 | 0.631463 | 0.483608 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 1.0 | 3.0 | 2.0 | 0.0 | 0.0 |
19 | 19 | wgd_2 | 0.365167 | 0.02975 | 1.347 | 0.02375 | 0.0 | 0.0 | 0.485956 | 0.169897 | 0.671757 | 0.154982 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 3.0 | 0.0 | 0.0 | 0.0 |
here we have the expected number of duplications, losses, etc. per family for the different branches in the species tree (the row associated with a node corresponds to the branch leading to that node)
Maximum-likelihood estimation
using Optim
result = optimize(constantrates(w, ccd), MLE())
ModeResult with maximized lp of -327.04
5-element Named Vector{Float64}
A │
────┼────────────
:λ │ 0.126987
:μ │ 0.150525
:η │ 0.796657
:q1 │ 0.201446
:q2 │ 4.22784e-19
One could now compute the likelihood using the model without WGD 1 and/or 2 and compare the likelihoods using a likelihood ratio test as in Rabier et al. (2014) to assess whether the data is compatible with the hypothesis $q = 0$ (which should represent absence of a WGD).
Using a branch-specific rates model
Now we will consider a model with branch-specific duplication and loss rates, using a more complicated hierarchical model with an uncorrelated relaxed clock model. We'll use the same tree as above. The relevant model now is the DLWGD model:
θ = DLWGD(λ=zeros(n), μ=zeros(n), q=rand(2), η=rand())
w = WhaleModel(θ, t, 0.1)
ccd = read_ale(data, w)
12-element Vector{CCD{UInt16, Float64}}:
CCD{UInt16,Float64}(Γ=83, 𝓛=13)
CCD{UInt16,Float64}(Γ=55, 𝓛=13)
CCD{UInt16,Float64}(Γ=89, 𝓛=13)
CCD{UInt16,Float64}(Γ=131, 𝓛=13)
CCD{UInt16,Float64}(Γ=107, 𝓛=13)
CCD{UInt16,Float64}(Γ=59, 𝓛=13)
CCD{UInt16,Float64}(Γ=53, 𝓛=13)
CCD{UInt16,Float64}(Γ=83, 𝓛=13)
CCD{UInt16,Float64}(Γ=59, 𝓛=13)
CCD{UInt16,Float64}(Γ=95, 𝓛=13)
CCD{UInt16,Float64}(Γ=67, 𝓛=13)
CCD{UInt16,Float64}(Γ=65, 𝓛=13)
Note that the duplication and loss rates should here be specified on a log-scale for the DLWGD model. We assume a Normal prior for the mean duplication and loss rates, and assume the log-scale branch-specific rates to be distributed according to a multivariate normal with diagonal covariance matrix $\tau I$. We assume duplication and loss rates to be independent.
@model branchrates(model, n, ccd, ::Type{T}=Float64) where T = begin
η ~ Beta(3,1)
λ̄ ~ Normal(log(0.15), 2)
μ̄ ~ Normal(log(0.15), 2)
τ ~ Exponential(0.1)
λ ~ MvNormal(fill(λ̄, n-1), τ)
μ ~ MvNormal(fill(μ̄, n-1), τ)
q1 ~ Beta()
q2 ~ Beta()
ccd ~ model((λ=λ, μ=μ, η=η, q=[q1, q2]))
end
branchrates (generic function with 3 methods)
... and sample (this takes a bit longer!)
chain1 = sample(branchrates(w, n, ccd), NUTS(), 200)
Chains MCMC chain (200×50×1 Array{Float64, 3}):
Iterations = 101:1:300
Number of chains = 1
Samples per chain = 200
Wall duration = 61.85 seconds
Compute duration = 61.85 seconds
parameters = η, λ̄, μ̄, τ, λ[1], λ[2], λ[3], λ[4], λ[5], λ[6], λ[7], λ[8], λ[9], λ[10], λ[11], λ[12], λ[13], λ[14], λ[15], λ[16], μ[1], μ[2], μ[3], μ[4], μ[5], μ[6], μ[7], μ[8], μ[9], μ[10], μ[11], μ[12], μ[13], μ[14], μ[15], μ[16], q1, q2
internals = lp, n_steps, is_accept, acceptance_rate, log_density, hamiltonian_energy, hamiltonian_energy_error, max_hamiltonian_energy_error, tree_depth, numerical_error, step_size, nom_step_size
Summary Statistics
parameters mean std naive_se mcse ess rhat es ⋯
Symbol Float64 Float64 Float64 Float64 Float64 Float64 ⋯
η 0.7624 0.0088 0.0006 0.0022 5.0093 1.1250 ⋯
λ̄ -2.4873 0.0326 0.0023 0.0054 17.8020 1.0396 ⋯
μ̄ -1.8039 0.0282 0.0020 0.0062 2.2817 1.4026 ⋯
τ 0.0546 0.0044 0.0003 0.0012 1.3286 2.1096 ⋯
λ[1] -2.4700 0.0681 0.0048 0.0172 6.8383 1.0380 ⋯
λ[2] -2.4520 0.0482 0.0034 0.0118 5.9925 1.1298 ⋯
λ[3] -2.4764 0.0373 0.0026 0.0086 11.7615 1.0227 ⋯
λ[4] -2.5010 0.0495 0.0035 0.0111 2.1071 1.4507 ⋯
λ[5] -2.5082 0.0521 0.0037 0.0131 8.7465 1.0463 ⋯
λ[6] -2.4845 0.0338 0.0024 0.0083 7.9660 1.1339 ⋯
λ[7] -2.4415 0.0458 0.0032 0.0112 1.5246 1.7847 ⋯
λ[8] -2.4669 0.0468 0.0033 0.0098 13.3540 1.1634 ⋯
λ[9] -2.5028 0.0510 0.0036 0.0119 3.8259 1.2367 ⋯
λ[10] -2.4905 0.0542 0.0038 0.0139 10.3745 1.0277 ⋯
λ[11] -2.4677 0.0422 0.0030 0.0101 8.8714 1.0836 ⋯
λ[12] -2.4920 0.0377 0.0027 0.0092 10.6115 1.0324 ⋯
λ[13] -2.4785 0.0639 0.0045 0.0167 6.4747 1.0838 ⋯
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
1 column and 21 rows omitted
Quantiles
parameters 2.5% 25.0% 50.0% 75.0% 97.5%
Symbol Float64 Float64 Float64 Float64 Float64
η 0.7474 0.7575 0.7638 0.7678 0.7753
λ̄ -2.5367 -2.5124 -2.4845 -2.4731 -2.4111
μ̄ -1.8448 -1.8231 -1.8135 -1.7888 -1.7339
τ 0.0486 0.0509 0.0528 0.0577 0.0635
λ[1] -2.5987 -2.5301 -2.4645 -2.4167 -2.3462
λ[2] -2.5379 -2.5034 -2.4298 -2.4147 -2.3750
λ[3] -2.5539 -2.5046 -2.4742 -2.4484 -2.4117
λ[4] -2.5791 -2.5355 -2.5126 -2.4698 -2.3962
λ[5] -2.6020 -2.5547 -2.5175 -2.4661 -2.4251
λ[6] -2.5501 -2.5055 -2.4825 -2.4619 -2.4106
λ[7] -2.5260 -2.4640 -2.4429 -2.4029 -2.3648
λ[8] -2.5320 -2.5088 -2.4776 -2.4367 -2.3612
λ[9] -2.5959 -2.5361 -2.5146 -2.4588 -2.4032
λ[10] -2.5782 -2.5141 -2.4917 -2.4503 -2.3990
λ[11] -2.5420 -2.5039 -2.4590 -2.4472 -2.4020
λ[12] -2.5637 -2.5133 -2.4826 -2.4724 -2.4323
λ[13] -2.6138 -2.5140 -2.4743 -2.4406 -2.3621
⋮ ⋮ ⋮ ⋮ ⋮ ⋮
21 rows omitted
Make a plot for the retention parameters
plot(chain1[[:q1,:q2]], size=(700,300))
Posterior predictive simulations
We can do posterior predictive simulations to assess model fit. There are of course many possible posterior predictive observables that we may employ to do so. The approach below is but one that is (partially) implemented. Here we compare simulated number of events for each branch with reconstructed number of events for each branch. That is, for $N$ samples from the posterior, we (1) simulate a data set of the size of our empirical data set and (2) sample a reconciled tree for each gene family. We then compare, for instance, the number of duplications on the branch leading to node $m$ in the two simulated sets. If the model fits these should be similar.
We need a function to get a parameterized model from a chain iterate:
function mfun(M, x)
q1 = get(x, :q1).q1[1]
q2 = get(x, :q2).q2[1]
λ = vec(vcat(get(x, :λ).λ...))
μ = vec(vcat(get(x, :μ).μ...))
η = get(x, :η).η[1]
M((λ=λ, μ=μ, q=[q1,q2], η=η))
end
mfun (generic function with 1 method)
The following will then do 100 posterior predictive simulations
pps = Whale.ppsims(chain1, mfun, w, ccd, 100);
and we make a plot for duplication events
lab = "duplication"
ps = map(w.order) do mnode
l = "$(id(mnode))_$lab"
dots = map(x->(x[1][l], x[2][l]), pps)
xmn, xmx = extrema(vcat(first.(dots), first.(dots)))
scatter(dots, color=:black, ms=2, alpha=0.5, legend=false,
xlabel="\$y\$", ylabel="\$\\tilde{y}\$",
title=Whale.cladelabel(mnode),
xlim=(xmn-0.5,xmx+0.5), xticks=xmn:1:xmx)
plot!(x->x, color=:lightgray)
end
plot(ps..., size=(700,600))
one can of course also compare loss events, speciation events, WGD-derived duplicates etc.
The above is somewhat more informative when analyzing more data. Using this approach on chain0
and comparing against the results displayed here, one could for instance check whether the constant rates assumption is strongly violated or not. If the dots in these plots are not systematically above or below the 1-1 line, at least this aspect of the data is explained well by the model.
I know that this, and the sampling methods for reconciled trees, should be implemented in a somewhat more consistent and user-friendly style. It's on the to-do list.
This page was generated using Literate.jl.