Bayesian inference for the DLWGD model

Note

Here it is assumed the reader is already familiar with the material outlined in the Tutorial.

using Whale, NewickTree, Distributions, Turing, DataFrames, LinearAlgebra
using Plots, StatsPlots
using Random; Random.seed!(7137);

plotting defaults

default(grid=false, size=(500,800), titlefontsize=9, title_loc=:left, guidefont=8)

Using a constant-rates model

First we will do inference for a simple constant-rates model (i.e. assuming a single duplication and loss rate for the entire species tree). First we load the species tree (using the example tree available in the Whale library)

t = deepcopy(Whale.extree)
n = length(postwalk(t))  # number of internal nodes

Now we add two WGD nodes to the tree. We do this by specifying the last common ancestor node for the lineages that share the hypothetical WGD. By default, the added node is halfway between the specified node and its parent.

insertnode!(getlca(t, "PPAT", "PPAT"), name="wgd_1")
insertnode!(getlca(t, "ATHA", "ATRI"), name="wgd_2")

(((OSAT:1.555,(ATHA:0.5548,CPAP:0.5548):1.0002):0.738,ATRI:2.293):0.6125):0.6125;

and we obtain a reference model object, using the constant-rates model with two WGDs

θ = ConstantDLWGD(λ=0.1, μ=0.2, q=[0.2, 0.1], η=0.9)
w = WhaleModel(θ, t, .1, minn=10, maxn=20)

WhaleModel
——————————
⋅Parameterization:
ConstantDLWGD{Float64, Float64}
  λ: Float64 0.1
  μ: Float64 0.2
  q: Array{Float64}((2,)) [0.2, 0.1]
  p: Array{Float64}((0,)) Float64[]
  η: Float64 0.9

⋅Condition:
Whale.RootCondition

⋅Model structure:
19 nodes (9 leaves, 2 WGD nodes)
node_id,wgd_id,distance,Δt,n,subtree
1,0,4.752,0.2376,20,"MPOL;"
2,0,2.376,0.1188,20,"PPAT;"
3,0,4.457,0.2228,20,"SMOE;"
4,0,3.178,0.1589,20,"GBIL;"
5,0,3.178,0.1589,20,"PABI;"
6,0,1.555,0.0972,16,"OSAT;"
7,0,0.5548,0.0555,10,"ATHA;"
8,0,0.5548,0.0555,10,"CPAP;"
9,0,2.293,0.1146,20,"ATRI;"
18,1,2.376,0.1188,20,"(PPAT);"
10,0,0.292,0.0292,10,"(MPOL,(PPAT));"
11,0,0.34,0.034,10,"(GBIL,PABI);"
12,0,1.0002,0.0909,11,"(ATHA,CPAP);"
13,0,0.738,0.0738,10,"(OSAT,(ATHA,CPAP));"
14,0,0.6125,0.0613,10,"((OSAT,(ATHA,CPAP)),ATRI);"
19,2,0.6125,0.0613,10,"(((OSAT,(ATHA,CPAP)),ATRI));"
15,0,0.939,0.0939,10,"((GBIL,PABI),(((OSAT,(ATHA,CPAP)),ATRI)));"
16,0,0.587,0.0587,10,"(SMOE,((GBIL,PABI),(((OSAT,(ATHA,CPAP)),ATRI))));"
17,0,0.0,0.0,0,"((MPOL,(PPAT)),(SMOE,((GBIL,PABI),(((OSAT,(ATHA,CPAP)),ATRI)))));"

next we get the data (we need a model object for that)

data = joinpath(@__DIR__, "../../example/example-1/ale")
ccd = read_ale(data, w)

12-element Vector{CCD{UInt16, Float64}}:
 CCD{UInt16,Float64}(Γ=83, 𝓛=13)
 CCD{UInt16,Float64}(Γ=55, 𝓛=13)
 CCD{UInt16,Float64}(Γ=89, 𝓛=13)
 CCD{UInt16,Float64}(Γ=131, 𝓛=13)
 CCD{UInt16,Float64}(Γ=107, 𝓛=13)
 CCD{UInt16,Float64}(Γ=59, 𝓛=13)
 CCD{UInt16,Float64}(Γ=53, 𝓛=13)
 CCD{UInt16,Float64}(Γ=83, 𝓛=13)
 CCD{UInt16,Float64}(Γ=59, 𝓛=13)
 CCD{UInt16,Float64}(Γ=95, 𝓛=13)
 CCD{UInt16,Float64}(Γ=67, 𝓛=13)
 CCD{UInt16,Float64}(Γ=65, 𝓛=13)

Now we define the Turing model

@model constantrates(model, ccd) = begin
    λ  ~ Exponential()
    μ  ~ Exponential()
    η  ~ Beta(3,1)
    q1 ~ Beta()
    q2 ~ Beta()
    ccd ~ model((λ=λ, μ=μ, η=η, q=[q1, q2]))
end

constantrates (generic function with 2 methods)

In this model we have line by line:

λ and μ: the duplication and loss rate, for which we assume Exponential priors.
η: the parameter of the geometric prior distribution on the number of genes at the root (i.e. the Whale likelihood is integrated over a geometric prior for the number of ancestral genes)
q1: the retention rate for the first WGD (with a uniform prior on [0,1], i.e. Beta(1,1))
q2: as q1
ccd ~ model(...): here we specify the likelihood, we assume the data (ccd) is iid from the duplication+loss+WGD model with the relevant parameters.

chain0 = sample(constantrates(w, ccd), NUTS(), 200)

Chains MCMC chain (200×17×1 Array{Float64, 3}):

Iterations        = 101:1:300
Number of chains  = 1
Samples per chain = 200
Wall duration     = 28.86 seconds
Compute duration  = 28.86 seconds
parameters        = λ, μ, η, q1, q2
internals         = lp, n_steps, is_accept, acceptance_rate, log_density, hamiltonian_energy, hamiltonian_energy_error, max_hamiltonian_energy_error, tree_depth, numerical_error, step_size, nom_step_size

Summary Statistics
  parameters      mean       std   naive_se      mcse        ess      rhat   e ⋯
      Symbol   Float64   Float64    Float64   Float64    Float64   Float64     ⋯

           λ    0.1275    0.0182     0.0013    0.0013   205.3097    0.9956     ⋯
           μ    0.1564    0.0268     0.0019    0.0011   197.3678    0.9971     ⋯
           η    0.7905    0.1097     0.0078    0.0088   161.5741    1.0020     ⋯
          q1    0.3450    0.2316     0.0164    0.0173   117.0962    0.9955     ⋯
          q2    0.0871    0.0838     0.0059    0.0077    61.8036    0.9981     ⋯
                                                                1 column omitted

Quantiles
  parameters      2.5%     25.0%     50.0%     75.0%     97.5%
      Symbol   Float64   Float64   Float64   Float64   Float64

           λ    0.0944    0.1161    0.1276    0.1388    0.1628
           μ    0.1031    0.1367    0.1564    0.1755    0.2031
           η    0.5614    0.7004    0.8128    0.8771    0.9449
          q1    0.0053    0.1575    0.3225    0.5135    0.8697
          q2    0.0018    0.0288    0.0664    0.1103    0.3253

Making some trace plots is straightforward using tools from the Turing probabilistic programming ecosystem

plot(chain0, size=(700,900))

We can compute Bayes factors for the WGD hypotheses

summarize(chain0[[:q1,:q2]], Whale.bayesfactor)


  parameters   bayesfactor
      Symbol        String

          q1           0.5
          q2           0.9

This is the log10 Bayes factor in favor of the $q = 0$ model. A Bayes factor smaller than -2 could be considered as evidence in favor of the $q \ne 0$ model compared to the $q=0$ model. This in itself need not say much, as it says nothing about how well the model actually fits the data.

Warning

Of course the chain should be run longer than in this example! Here a short chain is presented to ensure reasonable build times for this documentation. Generally, one should at least strive for ESS values exceeding at least 100, although short chains may be good for exploring and testing different models.

Now let's obtain reconciled trees

posterior = DataFrame(chain0)
ffun = (m, x)->m((λ=x[:λ], μ=x[:μ], η=x[:η], q=[x[:q1], x[:q2]]))
tt = TreeTracker(w, ccd, posterior, ffun)
trees = track(tt, 1000)

12-element Vector{Whale.RecSummary}:
 RecSummary(# unique trees = 96)
 RecSummary(# unique trees = 28)
 RecSummary(# unique trees = 77)
 RecSummary(# unique trees = 210)
 RecSummary(# unique trees = 67)
 RecSummary(# unique trees = 64)
 RecSummary(# unique trees = 58)
 RecSummary(# unique trees = 77)
 RecSummary(# unique trees = 43)
 RecSummary(# unique trees = 49)
 RecSummary(# unique trees = 84)
 RecSummary(# unique trees = 97)

Note that fun is a function that takes the model object and a row from the posterior data frame, returning a model parameterized by the posterior sample in the row x.

Let's have a look at the first family

trees[1].trees

96-element Vector{NamedTuple}:
 (freq = 0.451, tree = Node(1, σ=17, t=0.0, speciation))
 (freq = 0.088, tree = Node(1, σ=17, t=0.0, speciation))
 (freq = 0.059, tree = Node(1, σ=17, t=0.0, duplication))
 (freq = 0.041, tree = Node(1, σ=17, t=0.0, speciation))
 (freq = 0.034, tree = Node(1, σ=17, t=0.0, speciation))
 (freq = 0.029, tree = Node(1, σ=17, t=0.0, speciation))
 (freq = 0.026, tree = Node(1, σ=17, t=0.0, speciation))
 (freq = 0.026, tree = Node(1, σ=17, t=0.0, speciation))
 (freq = 0.022, tree = Node(1, σ=17, t=0.0, speciation))
 (freq = 0.016, tree = Node(1, σ=17, t=0.0, duplication))
 ⋮
 (freq = 0.001, tree = Node(1, σ=17, t=0.0, duplication))
 (freq = 0.001, tree = Node(1, σ=17, t=0.0, duplication))
 (freq = 0.001, tree = Node(1, σ=17, t=0.0, speciation))
 (freq = 0.001, tree = Node(1, σ=17, t=0.0, duplication))
 (freq = 0.001, tree = Node(1, σ=17, t=0.0, duplication))
 (freq = 0.001, tree = Node(1, σ=17, t=0.0, speciation))
 (freq = 0.001, tree = Node(1, σ=17, t=0.0, speciation))
 (freq = 0.001, tree = Node(1, σ=17, t=0.0, speciation))
 (freq = 0.001, tree = Node(1, σ=17, t=0.0, speciation))

We can write these to a file using Whale.writetrees("filename.nw", trees[1].trees) if we would want that. The maximum a posterior tree for this family is

map1 = trees[1].trees[1]
nwstr(map1.tree)

"(((PPAT_Pp3c9_4950V3.1_Pp3c9_4950:2.3760000000000017)1.0:2.375999999999999,MPOL_Mapoly0036s0119.1_Mapoly0036s0119:4.752000000000001)0.846:0.2919999999999998,(SMOE_SMO111G0185.1_SMO111G0185:4.457000000000001,(((PABI_PAB00009793.1_PAB00009793:3.178000000000001,loss_1172007" ⋯ 731 bytes ⋯ "0.601:0.40917272727272735)0.652:0.5910272727272726,loss_4985017985652007989:1.5550000000000002)0.922:0.4796999999999998)0.964:0.2583000000000002)0.989:0.6125000000000003)0.991:0.6125000000000003)0.988:0.70425)0.952:0.23475000000000001)0.852:0.5869999999999997)0.841:0.0;"

with posterior probability

map1.freq

0.451

Or maybe all the gene pairs

ps = Whale.getpairs(trees, w);

Now let's look at the gene pairs which have a non-zero posterior probability of being derived from WGD node 18 (the P. patens WGD, execute @show w to check the model structure)

p = filter(x->x[Symbol("18_wgd")] > 0.0, ps)[!,:pair]

17-element Vector{String}:
 "PPAT_Pp3c18_4330V3.2_Pp3c18_4330__PPAT_Pp3c25_11450V3.1_Pp3c25_11450"
 "PPAT_Pp3c22_17550V3.2_Pp3c22_17550__PPAT_Pp3c25_11450V3.1_Pp3c25_11450"
 "PPAT_Pp3c18_4330V3.2_Pp3c18_4330__PPAT_Pp3c22_17550V3.2_Pp3c22_17550"
 "PPAT_Pp3c21_20710V3.3_Pp3c21_20710__PPAT_Pp3c25_11450V3.1_Pp3c25_11450"
 "PPAT_Pp3c18_4330V3.2_Pp3c18_4330__PPAT_Pp3c21_20710V3.3_Pp3c21_20710"
 "PPAT_Pp3c21_20710V3.3_Pp3c21_20710__PPAT_Pp3c22_17550V3.2_Pp3c22_17550"
 "PPAT_Pp3c25_11450V3.1_Pp3c25_11450__PPAT_Pp3c6_2690V3.3_Pp3c6_2690"
 "PPAT_Pp3c18_4330V3.2_Pp3c18_4330__PPAT_Pp3c6_2690V3.3_Pp3c6_2690"
 "PPAT_Pp3c22_17550V3.2_Pp3c22_17550__PPAT_Pp3c6_2690V3.3_Pp3c6_2690"
 "PPAT_Pp3c21_20710V3.3_Pp3c21_20710__PPAT_Pp3c6_2690V3.3_Pp3c6_2690"
 "PPAT_Pp3c11_14610V3.1_Pp3c11_14610__PPAT_Pp3c8_17600V3.1_Pp3c8_17600"
 "PPAT_Pp3c24_11620V3.3_Pp3c24_11620__PPAT_Pp3c8_17600V3.1_Pp3c8_17600"
 "PPAT_Pp3c11_14610V3.1_Pp3c11_14610__PPAT_Pp3c24_11620V3.3_Pp3c24_11620"
 "PPAT_Pp3c12_12390V3.2_Pp3c12_12390__PPAT_Pp3c8_2050V3.1_Pp3c8_2050"
 "PPAT_Pp3c12_12390V3.2_Pp3c12_12390__PPAT_Pp3c3_35300V3.2_Pp3c3_35300"
 "PPAT_Pp3c3_35300V3.2_Pp3c3_35300__PPAT_Pp3c8_2050V3.1_Pp3c8_2050"
 "PPAT_Pp3c15_24010V3.1_Pp3c15_24010__PPAT_Pp3c9_24230V3.2_Pp3c9_24230"

Now we can look at the complete posterior reconciliation distribution for these gene pairs

df18 = filter(x->x[:pair] ∈ p, ps)
df18[:,[!(all(col .== 0)) for col in eachcol(df18)]]  # filter out all zero columns...

17 rows × 7 columns

	18_duplication	10_duplication	2_duplication	18_wgd	17_duplication	family	pair
	Float64	Float64	Float64	Float64	Float64	String	String
1	0.275	0.0	0.363	0.36	0.002	OG0004533.fasta.nex.treesample.ale	PPAT_Pp3c18_4330V3.2_Pp3c18_4330__PPAT_Pp3c25_11450V3.1_Pp3c25_11450
2	0.257	0.0	0.402	0.339	0.002	OG0004533.fasta.nex.treesample.ale	PPAT_Pp3c22_17550V3.2_Pp3c22_17550__PPAT_Pp3c25_11450V3.1_Pp3c25_11450
3	0.069	0.0	0.801	0.13	0.0	OG0004533.fasta.nex.treesample.ale	PPAT_Pp3c18_4330V3.2_Pp3c18_4330__PPAT_Pp3c22_17550V3.2_Pp3c22_17550
4	0.618	0.112	0.039	0.099	0.132	OG0004533.fasta.nex.treesample.ale	PPAT_Pp3c21_20710V3.3_Pp3c21_20710__PPAT_Pp3c25_11450V3.1_Pp3c25_11450
5	0.618	0.112	0.039	0.099	0.132	OG0004533.fasta.nex.treesample.ale	PPAT_Pp3c18_4330V3.2_Pp3c18_4330__PPAT_Pp3c21_20710V3.3_Pp3c21_20710
6	0.618	0.112	0.039	0.099	0.132	OG0004533.fasta.nex.treesample.ale	PPAT_Pp3c21_20710V3.3_Pp3c21_20710__PPAT_Pp3c22_17550V3.2_Pp3c22_17550
7	0.618	0.112	0.039	0.099	0.132	OG0004533.fasta.nex.treesample.ale	PPAT_Pp3c25_11450V3.1_Pp3c25_11450__PPAT_Pp3c6_2690V3.3_Pp3c6_2690
8	0.618	0.112	0.039	0.099	0.132	OG0004533.fasta.nex.treesample.ale	PPAT_Pp3c18_4330V3.2_Pp3c18_4330__PPAT_Pp3c6_2690V3.3_Pp3c6_2690
9	0.618	0.112	0.039	0.099	0.132	OG0004533.fasta.nex.treesample.ale	PPAT_Pp3c22_17550V3.2_Pp3c22_17550__PPAT_Pp3c6_2690V3.3_Pp3c6_2690
10	0.13	0.001	0.523	0.341	0.005	OG0004533.fasta.nex.treesample.ale	PPAT_Pp3c21_20710V3.3_Pp3c21_20710__PPAT_Pp3c6_2690V3.3_Pp3c6_2690
11	0.446	0.0	0.211	0.343	0.0	OG0004544.fasta.nex.treesample.ale	PPAT_Pp3c11_14610V3.1_Pp3c11_14610__PPAT_Pp3c8_17600V3.1_Pp3c8_17600
12	0.446	0.0	0.211	0.343	0.0	OG0004544.fasta.nex.treesample.ale	PPAT_Pp3c24_11620V3.3_Pp3c24_11620__PPAT_Pp3c8_17600V3.1_Pp3c8_17600
13	0.062	0.0	0.779	0.159	0.0	OG0004544.fasta.nex.treesample.ale	PPAT_Pp3c11_14610V3.1_Pp3c11_14610__PPAT_Pp3c24_11620V3.3_Pp3c24_11620
14	0.437	0.001	0.199	0.361	0.002	OG0004548.fasta.nex.treesample.ale	PPAT_Pp3c12_12390V3.2_Pp3c12_12390__PPAT_Pp3c8_2050V3.1_Pp3c8_2050
15	0.437	0.001	0.199	0.361	0.002	OG0004548.fasta.nex.treesample.ale	PPAT_Pp3c12_12390V3.2_Pp3c12_12390__PPAT_Pp3c3_35300V3.2_Pp3c3_35300
16	0.053	0.0	0.778	0.169	0.0	OG0004548.fasta.nex.treesample.ale	PPAT_Pp3c3_35300V3.2_Pp3c3_35300__PPAT_Pp3c8_2050V3.1_Pp3c8_2050
17	0.224	0.001	0.394	0.381	0.0	OG0004585.fasta.nex.treesample.ale	PPAT_Pp3c15_24010V3.1_Pp3c15_24010__PPAT_Pp3c9_24230V3.2_Pp3c9_24230

The column names are <branch id>_<event>, and the entries are the posterior probability that the gene pair is reconciled to the relevant branch + event combination.

Here we have for each WGD event in the tree all gene pairs that have non-zero posterior probability (as measured by the frequency in the posterior sample) to be reconciled to the relevant WGD event.

A table summarizing events for each branch can be obtained as well

smry = Whale.summarize(trees)
smry.sum

19 rows × 26 columns

	node	clade	loss_mean	wgd_mean	wgdloss_mean	duplication_mean	sploss_mean	speciation_mean	loss_std	wgd_std	wgdloss_std	duplication_std	sploss_std	speciation_std	loss_q1	wgd_q1	wgdloss_q1	duplication_q1	sploss_q1	speciation_q1	loss_q2	wgd_q2	wgdloss_q2	duplication_q2	sploss_q2	speciation_q2
	UInt16	String	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64
1	1	MPOL	0.251083	0.0	0.0	0.411583	0.0	1.41667	0.437653	0.0	0.0	0.635098	0.0	0.953794	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	2.0	0.0	4.0
2	2	PPAT	0.0	0.0	0.0	0.345083	0.0	1.66667	0.0	0.0	0.0	0.700596	0.0	1.31233	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	2.0	0.0	5.0
3	3	SMOE	0.4245	0.0	0.0	0.1665	0.0	1.08333	0.702946	0.0	0.0	0.552369	0.0	0.759203	0.0	0.0	0.0	0.0	0.0	0.0	2.0	0.0	0.0	2.0	0.0	3.0
4	4	GBIL	0.244167	0.0	0.0	0.0885	0.0	1.33333	0.433454	0.0	0.0	0.284021	0.0	0.62361	0.0	0.0	0.0	0.0	0.0	1.0	1.0	0.0	0.0	1.0	0.0	3.0
5	5	PABI	0.0531667	0.0	0.0	0.6475	0.0	2.08333	0.224737	0.0	0.0	0.690708	0.0	0.862007	0.0	0.0	0.0	0.0	0.0	1.0	1.0	0.0	0.0	2.0	0.0	3.0
6	6	OSAT	0.354417	0.0	0.0	0.498417	0.0	1.58333	0.513133	0.0	0.0	0.499997	0.0	0.493007	0.0	0.0	0.0	0.0	0.0	1.0	1.0	0.0	0.0	1.0	0.0	2.0
7	7	ATHA	0.335917	0.0	0.0	0.324333	0.0	1.5	0.643229	0.0	0.0	0.468125	0.0	0.5	0.0	0.0	0.0	0.0	0.0	1.0	2.0	0.0	0.0	1.0	0.0	2.0
8	8	CPAP	0.343417	0.0	0.0	0.0818333	0.0	1.25	0.493607	0.0	0.0	0.305183	0.0	0.924211	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	4.0
9	9	ATRI	0.51725	0.0	0.0	0.16625	0.0	1.08333	0.664607	0.0	0.0	0.551765	0.0	0.640095	0.0	0.0	0.0	0.0	0.0	0.0	2.0	0.0	0.0	2.0	0.0	3.0
10	10	MPOL,PPAT	0.0849167	0.0	0.0	0.0181667	0.56925	0.686917	0.337203	0.0	0.0	0.135413	0.862673	0.463748	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	3.0	1.0
11	11	GBIL,PABI	0.356167	0.0	0.0	0.127	0.297333	1.19167	0.630855	0.0	0.0	0.336706	0.501424	0.445642	0.0	0.0	0.0	0.0	0.0	1.0	2.0	0.0	0.0	1.0	1.0	2.0
12	12	ATHA,CPAP	0.0918333	0.0	0.0	0.164083	0.679333	0.83225	0.291662	0.0	0.0	0.396434	0.883368	0.373644	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	2.0	1.0
13	13	OSAT,ATHA	0.09025	0.0	0.0	0.09525	0.44625	0.993083	0.28683	0.0	0.0	0.307155	0.683821	0.416176	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	2.0	2.0
14	14	OSAT,ATRI	0.0	0.0	0.0	0.0278333	0.6075	0.826833	0.0	0.0	0.0	0.165001	0.889538	0.378391	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	3.0	1.0
15	15	GBIL,OSAT	0.0719167	0.0	0.0	0.44875	0.721333	0.996833	0.259637	0.0	0.0	0.689957	0.940042	0.411692	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	3.0	3.0	2.0
16	16	SMOE,GBIL	0.0373333	0.0	0.0	0.05575	0.496417	0.844917	0.199431	0.0	0.0	0.249283	0.808592	0.451884	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	3.0	2.0
17	17	MPOL,SMOE	0.0	0.0	0.0	0.322917	0.12225	1.20067	0.0	0.0	0.0	0.62006	0.421076	0.49689	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	2.0	2.0	3.0
18	18	wgd_1	0.318167	0.193083	0.935417	0.1905	0.0	0.0	0.567688	0.439472	0.631463	0.483608	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	2.0	1.0	3.0	2.0	0.0	0.0
19	19	wgd_2	0.365167	0.02975	1.347	0.02375	0.0	0.0	0.485956	0.169897	0.671757	0.154982	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	1.0	3.0	0.0	0.0	0.0

here we have the expected number of duplications, losses, etc. per family for the different branches in the species tree (the row associated with a node corresponds to the branch leading to that node)

Maximum-likelihood estimation

using Optim
result = optimize(constantrates(w, ccd), MLE())

ModeResult with maximized lp of -327.04
5-element Named Vector{Float64}
A   │ 
────┼────────────
:λ  │    0.126987
:μ  │    0.150525
:η  │    0.796657
:q1 │    0.201446
:q2 │ 4.22784e-19

One could now compute the likelihood using the model without WGD 1 and/or 2 and compare the likelihoods using a likelihood ratio test as in Rabier et al. (2014) to assess whether the data is compatible with the hypothesis $q = 0$ (which should represent absence of a WGD).

Using a branch-specific rates model

Now we will consider a model with branch-specific duplication and loss rates, using a more complicated hierarchical model with an uncorrelated relaxed clock model. We'll use the same tree as above. The relevant model now is the DLWGD model:

θ = DLWGD(λ=zeros(n), μ=zeros(n), q=rand(2), η=rand())
w = WhaleModel(θ, t, 0.1)
ccd = read_ale(data, w)

12-element Vector{CCD{UInt16, Float64}}:
 CCD{UInt16,Float64}(Γ=83, 𝓛=13)
 CCD{UInt16,Float64}(Γ=55, 𝓛=13)
 CCD{UInt16,Float64}(Γ=89, 𝓛=13)
 CCD{UInt16,Float64}(Γ=131, 𝓛=13)
 CCD{UInt16,Float64}(Γ=107, 𝓛=13)
 CCD{UInt16,Float64}(Γ=59, 𝓛=13)
 CCD{UInt16,Float64}(Γ=53, 𝓛=13)
 CCD{UInt16,Float64}(Γ=83, 𝓛=13)
 CCD{UInt16,Float64}(Γ=59, 𝓛=13)
 CCD{UInt16,Float64}(Γ=95, 𝓛=13)
 CCD{UInt16,Float64}(Γ=67, 𝓛=13)
 CCD{UInt16,Float64}(Γ=65, 𝓛=13)

Note that the duplication and loss rates should here be specified on a log-scale for the DLWGD model. We assume a Normal prior for the mean duplication and loss rates, and assume the log-scale branch-specific rates to be distributed according to a multivariate normal with diagonal covariance matrix $\tau I$. We assume duplication and loss rates to be independent.

@model branchrates(model, n, ccd, ::Type{T}=Float64) where T = begin
    η ~ Beta(3,1)
    λ̄ ~ Normal(log(0.15), 2)
    μ̄ ~ Normal(log(0.15), 2)
    τ ~ Exponential(0.1)
    λ ~ MvNormal(fill(λ̄, n-1), τ)
    μ ~ MvNormal(fill(μ̄, n-1), τ)
    q1 ~ Beta()
    q2 ~ Beta()
    ccd ~ model((λ=λ, μ=μ, η=η, q=[q1, q2]))
end

branchrates (generic function with 3 methods)

... and sample (this takes a bit longer!)

chain1 = sample(branchrates(w, n, ccd), NUTS(), 200)

Chains MCMC chain (200×50×1 Array{Float64, 3}):

Iterations        = 101:1:300
Number of chains  = 1
Samples per chain = 200
Wall duration     = 61.85 seconds
Compute duration  = 61.85 seconds
parameters        = η, λ̄, μ̄, τ, λ[1], λ[2], λ[3], λ[4], λ[5], λ[6], λ[7], λ[8], λ[9], λ[10], λ[11], λ[12], λ[13], λ[14], λ[15], λ[16], μ[1], μ[2], μ[3], μ[4], μ[5], μ[6], μ[7], μ[8], μ[9], μ[10], μ[11], μ[12], μ[13], μ[14], μ[15], μ[16], q1, q2
internals         = lp, n_steps, is_accept, acceptance_rate, log_density, hamiltonian_energy, hamiltonian_energy_error, max_hamiltonian_energy_error, tree_depth, numerical_error, step_size, nom_step_size

Summary Statistics
  parameters      mean       std   naive_se      mcse       ess      rhat   es ⋯
      Symbol   Float64   Float64    Float64   Float64   Float64   Float64      ⋯

           η    0.7624    0.0088     0.0006    0.0022    5.0093    1.1250      ⋯
           λ̄   -2.4873    0.0326     0.0023    0.0054   17.8020    1.0396      ⋯
           μ̄   -1.8039    0.0282     0.0020    0.0062    2.2817    1.4026      ⋯
           τ    0.0546    0.0044     0.0003    0.0012    1.3286    2.1096      ⋯
        λ[1]   -2.4700    0.0681     0.0048    0.0172    6.8383    1.0380      ⋯
        λ[2]   -2.4520    0.0482     0.0034    0.0118    5.9925    1.1298      ⋯
        λ[3]   -2.4764    0.0373     0.0026    0.0086   11.7615    1.0227      ⋯
        λ[4]   -2.5010    0.0495     0.0035    0.0111    2.1071    1.4507      ⋯
        λ[5]   -2.5082    0.0521     0.0037    0.0131    8.7465    1.0463      ⋯
        λ[6]   -2.4845    0.0338     0.0024    0.0083    7.9660    1.1339      ⋯
        λ[7]   -2.4415    0.0458     0.0032    0.0112    1.5246    1.7847      ⋯
        λ[8]   -2.4669    0.0468     0.0033    0.0098   13.3540    1.1634      ⋯
        λ[9]   -2.5028    0.0510     0.0036    0.0119    3.8259    1.2367      ⋯
       λ[10]   -2.4905    0.0542     0.0038    0.0139   10.3745    1.0277      ⋯
       λ[11]   -2.4677    0.0422     0.0030    0.0101    8.8714    1.0836      ⋯
       λ[12]   -2.4920    0.0377     0.0027    0.0092   10.6115    1.0324      ⋯
       λ[13]   -2.4785    0.0639     0.0045    0.0167    6.4747    1.0838      ⋯
      ⋮           ⋮         ⋮         ⋮          ⋮         ⋮         ⋮         ⋱
                                                    1 column and 21 rows omitted

Quantiles
  parameters      2.5%     25.0%     50.0%     75.0%     97.5%
      Symbol   Float64   Float64   Float64   Float64   Float64

           η    0.7474    0.7575    0.7638    0.7678    0.7753
           λ̄   -2.5367   -2.5124   -2.4845   -2.4731   -2.4111
           μ̄   -1.8448   -1.8231   -1.8135   -1.7888   -1.7339
           τ    0.0486    0.0509    0.0528    0.0577    0.0635
        λ[1]   -2.5987   -2.5301   -2.4645   -2.4167   -2.3462
        λ[2]   -2.5379   -2.5034   -2.4298   -2.4147   -2.3750
        λ[3]   -2.5539   -2.5046   -2.4742   -2.4484   -2.4117
        λ[4]   -2.5791   -2.5355   -2.5126   -2.4698   -2.3962
        λ[5]   -2.6020   -2.5547   -2.5175   -2.4661   -2.4251
        λ[6]   -2.5501   -2.5055   -2.4825   -2.4619   -2.4106
        λ[7]   -2.5260   -2.4640   -2.4429   -2.4029   -2.3648
        λ[8]   -2.5320   -2.5088   -2.4776   -2.4367   -2.3612
        λ[9]   -2.5959   -2.5361   -2.5146   -2.4588   -2.4032
       λ[10]   -2.5782   -2.5141   -2.4917   -2.4503   -2.3990
       λ[11]   -2.5420   -2.5039   -2.4590   -2.4472   -2.4020
       λ[12]   -2.5637   -2.5133   -2.4826   -2.4724   -2.4323
       λ[13]   -2.6138   -2.5140   -2.4743   -2.4406   -2.3621
      ⋮           ⋮         ⋮         ⋮         ⋮         ⋮
                                                 21 rows omitted

Make a plot for the retention parameters

plot(chain1[[:q1,:q2]], size=(700,300))

Posterior predictive simulations

We can do posterior predictive simulations to assess model fit. There are of course many possible posterior predictive observables that we may employ to do so. The approach below is but one that is (partially) implemented. Here we compare simulated number of events for each branch with reconstructed number of events for each branch. That is, for $N$ samples from the posterior, we (1) simulate a data set of the size of our empirical data set and (2) sample a reconciled tree for each gene family. We then compare, for instance, the number of duplications on the branch leading to node $m$ in the two simulated sets. If the model fits these should be similar.

We need a function to get a parameterized model from a chain iterate:

function mfun(M, x)
    q1 = get(x, :q1).q1[1]
    q2 = get(x, :q2).q2[1]
    λ = vec(vcat(get(x, :λ).λ...))
    μ = vec(vcat(get(x, :μ).μ...))
    η = get(x, :η).η[1]
    M((λ=λ, μ=μ, q=[q1,q2], η=η))
end

mfun (generic function with 1 method)

The following will then do 100 posterior predictive simulations

pps = Whale.ppsims(chain1, mfun, w, ccd, 100);

and we make a plot for duplication events

lab = "duplication"
ps = map(w.order) do mnode
    l = "$(id(mnode))_$lab"
    dots = map(x->(x[1][l], x[2][l]), pps)
    xmn, xmx = extrema(vcat(first.(dots), first.(dots)))
    scatter(dots, color=:black, ms=2, alpha=0.5, legend=false,
            xlabel="\$y\$", ylabel="\$\\tilde{y}\$",
            title=Whale.cladelabel(mnode),
            xlim=(xmn-0.5,xmx+0.5), xticks=xmn:1:xmx)
    plot!(x->x, color=:lightgray)
end
plot(ps..., size=(700,600))

one can of course also compare loss events, speciation events, WGD-derived duplicates etc.

The above is somewhat more informative when analyzing more data. Using this approach on chain0 and comparing against the results displayed here, one could for instance check whether the constant rates assumption is strongly violated or not. If the dots in these plots are not systematically above or below the 1-1 line, at least this aspect of the data is explained well by the model.

Note

I know that this, and the sampling methods for reconciled trees, should be implemented in a somewhat more consistent and user-friendly style. It's on the to-do list.

This page was generated using Literate.jl.