Drosophila
Here I illustrate the usage of the DeadBird package for fitting phylogenetic birth-death process models to data using Maximum likelihood and Bayesian inference. We will fit a simple single-rate (turnover rate λ, as in e.g. CAFE) model to the 12 Drosophila species data set.
Load the required packages
using DeadBird
using Distributions, Turing, CSV, DataFrames, NewickTree, Optim
using Random; Random.seed!(761);
Load the data
datadir = joinpath(@__DIR__, "../../example/drosophila")
tree = readnw(readline(joinpath(datadir, "tree.nw")))
data = CSV.read(joinpath(datadir, "counts-oib.csv"), DataFrame);
The data set size and number of taxa are
nrow(data), length(getleaves(tree))
(11519, 12)
We'll take a subset of the data for the sake of time.
data = data[20:10:10010,:];
first(data, 5)
5 rows × 12 columns
dana | dere | dgri | dmel | dmoj | dper | dpse | dsec | dsim | dvir | dwil | dyak | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | |
1 | 7 | 7 | 9 | 7 | 6 | 7 | 7 | 7 | 7 | 6 | 7 | 7 |
2 | 7 | 5 | 9 | 5 | 4 | 7 | 7 | 5 | 5 | 5 | 4 | 5 |
3 | 4 | 4 | 6 | 4 | 5 | 6 | 5 | 4 | 5 | 5 | 4 | 4 |
4 | 5 | 4 | 7 | 5 | 3 | 3 | 3 | 4 | 5 | 3 | 3 | 4 |
5 | 5 | 4 | 3 | 4 | 2 | 4 | 4 | 5 | 7 | 3 | 1 | 4 |
The average number of genes in non-extinct families is
m = mean(filter(x->x>0,Matrix(data)))
1.127086653804211
We can use this to parameterize the prior for the number of ancestral lineages
η = 1/m
rootprior = ShiftedGeometric(η)
ShiftedGeometric{Float64}(η=0.8872432271509378)
We will use the DAG data structure (most efficient, but admits no family-specific models).
dag, bound = CountDAG(data, tree)
(dag = CountDAG({752, 1489} directed simple Int64 graph), bound = 84)
We will define a Turing model for this simple problem
@model singlerate(dag, bound, tree, rootprior) = begin
λ ~ Turing.FlatPos(0.)
θ = ConstantDLG(λ=λ, μ=λ, κ=zero(λ))
dag ~ PhyloBDP(θ, rootprior, tree, bound)
end
singlerate (generic function with 1 method)
Maximum likelihood inference
First we show how to conduct MLE of a single parameter model for the entire data (i.e. we estimate a genome-wide parameter) using the CountDAG
data structure.
model = singlerate(dag, bound, tree, rootprior)
@time mleresult = optimize(model, MLE())
ModeResult with maximized lp of -3650.20
1-element Named Vector{Float64}
A │
───┼──────────
:λ │ 0.0950648
For the complete data set of >10000 families, this takes about 10 seconds.
It is straightforward to adapt the model definition to allow for different duplication and loss rates, non-zero gain rates (κ
) or different root priors.
Alternatively we could use the ProfileMatrix, which admits models that deal with variation across families. We can also use this to fit models independently across families. Here we will estimate the MLE of a single turnover rate for 100 families independently.
matrix, bound = ProfileMatrix(data, tree)
@model singlerate(mat, bound, tree, rootprior) = begin
λ ~ Turing.FlatPos(0.)
θ = ConstantDLG(λ=λ, μ=λ, κ=zero(λ))
mat ~ PhyloBDP(θ, rootprior, tree, bound)
end
@time results = map(1:size(matrix, 1)) do i
x = matrix[i]
model = singlerate(x, x.x[1], tree, rootprior)
mleresult = optimize(model, MLE())
mleresult.lp, mleresult.values[1]
end;
17.444568 seconds (32.42 M allocations: 7.187 GiB, 6.89% gc time, 22.91% compilation time)
Here we have fitted a single parameter model to each count vector (a phylogenetic profile) independently. Note that the MLE will be zero under this model when the profile consists of ones only. The results of the above look like this
first(results, 10)
10-element Vector{Tuple{Float64, Float64}}:
(-22.166663074268236, 0.09656456464512346)
(-24.69438549106619, 0.5010970827956874)
(-22.63610969812189, 0.29002267759742906)
(-22.49806861126835, 0.6980503796995863)
(-23.485123097196496, 1.0721987563046236)
(-20.65072764821289, 0.6336332029569388)
(-24.132043549383727, 0.5863538085743983)
(-19.54792662026289, 0.7387507977416689)
(-19.122396310095528, 0.3626784107179777)
(-10.752273172530012, 0.053150884293764106)
alternatively we can use MAP estimation to regularize the λ
estimates, for instance:
@model singlerate_ln(mat, bound, tree, rootprior) = begin
λ ~ LogNormal(log(0.1), 1)
θ = ConstantDLG(λ=λ, μ=λ, κ=zero(λ))
mat ~ PhyloBDP(θ, rootprior, tree, bound)
end
@time results_map = map(1:size(matrix, 1)) do i
x = matrix[i]
model = singlerate_ln(x, x.x[1], tree, rootprior)
mleresult = optimize(model, MAP())
mleresult.lp, mleresult.values[1]
end;
6.948932 seconds (12.80 M allocations: 1.661 GiB, 5.84% gc time, 63.21% compilation time)
MAP is also faster, since it is numerically better behaved (the MLE of 0. being on the boundary of parameter space).
Bayesian inference
Now we'll perform Bayesian inference using the No-U-turn sampler. Note that we've defined an uninformative flat prior (FlatPos(0.0)
), so we expect to find a posterior mean estimate for λ
that coincides with the MLE.
chain = sample(model, NUTS(), 100)
Chains MCMC chain (100×13×1 Array{Float64, 3}):
Iterations = 1:1:100
Number of chains = 1
Samples per chain = 100
parameters = λ
internals = acceptance_rate, hamiltonian_energy, hamiltonian_energy_error, is_accept, log_density, lp, max_hamiltonian_energy_error, n_steps, nom_step_size, numerical_error, step_size, tree_depth
Summary Statistics
parameters mean std naive_se mcse ess rhat
Symbol Float64 Float64 Float64 Missing Float64 Float64
λ 0.0953 0.0038 0.0004 missing 43.1560 1.0518
Quantiles
parameters 2.5% 25.0% 50.0% 75.0% 97.5%
Symbol Float64 Float64 Float64 Float64 Float64
λ 0.0887 0.0928 0.0954 0.0976 0.1025
Of course, it would be better to run such a chain for more iterations, e.g. 1000, but for the sake of time I'm only taking a 100 samples here. The 95% uncertainty interval for the turnover rate can be obtained as
quantile(chain; q=[0.025, 0.975])
Quantiles
parameters 2.5% 97.5%
Symbol Float64 Float64
λ 0.0887 0.1025
Other models
It is straightforward to use the Turing.jl model syntax (using the @model
macro) in combination with the various rates models and root priors defined in DeadBird (ConstantDLG
, DLG
, DLGWGM
, ShiftedBetaGeometric
, ...) to specify complicated models. A not so complicated example would be the following. First we filter the data to only allow for non-extinct families:
nonextinct = filter(x->all(Array(x) .> 0), data);
We will model the excess number of genes, i.e. the number of extra (duplicated) genes per family, instead of the total number of genes.
excessgenes = nonextinct .- 1;
Again we construct a DAG object
dag, bound = CountDAG(excessgenes, tree)
(dag = CountDAG({616, 1204} directed simple Int64 graph), bound = 72)
The model we specify is a linear birth-death and immigration process with immigration (gain) rate equal to the duplication rate, κ = λ
, and loss rate μ
. This corresponds to a model where genes duplicate at rate λ, (note that a 0 -> 1
transition is also a duplication here since the zero state corresponds to a single copy family), and where duplicated genes get lost at rate μ
. We assume λ < μ
, in which case there is a geometric stationary distribution with mean 1 - λ/μ
for the excess number of genes in a family.
bound01(η) = η <= zero(η) ? zero(η) + 1e-16 : η >= one(η) ? one(η) - 1e-16 : η
@model nonextinctmodel(dag, bound, tree) = begin
μ ~ Turing.FlatPos(0.)
η ~ Beta(1, 1) # 1 - λ/μ
η = bound01(η)
λ = μ * (1 - η)
rates = ConstantDLG(λ=λ, μ=μ, κ=λ)
rootp = Geometric(η)
dag ~ PhyloBDP(rates, rootp, tree, bound, cond=:none)
end
nonextinctmodel (generic function with 1 method)
and we sample
chain = sample(nonextinctmodel(dag, bound, tree), NUTS(), 100)
Chains MCMC chain (100×14×1 Array{Float64, 3}):
Iterations = 1:1:100
Number of chains = 1
Samples per chain = 100
parameters = η, μ
internals = acceptance_rate, hamiltonian_energy, hamiltonian_energy_error, is_accept, log_density, lp, max_hamiltonian_energy_error, n_steps, nom_step_size, numerical_error, step_size, tree_depth
Summary Statistics
parameters mean std naive_se mcse ess rhat
Symbol Float64 Float64 Float64 Missing Float64 Float64
η 0.8818 0.0061 0.0006 missing 44.1066 0.9985
μ 1.0018 0.0616 0.0062 missing 47.9508 1.0175
Quantiles
parameters 2.5% 25.0% 50.0% 75.0% 97.5%
Symbol Float64 Float64 Float64 Float64 Float64
η 0.8715 0.8772 0.8817 0.8865 0.8928
μ 0.8980 0.9578 1.0023 1.0434 1.1191
Get the posterior as a dataframe
pdf = DataFrame(chain)
μs = pdf[:, :μ]
λs = μs .* (1 .- pdf[:,:η])
100-element Vector{Float64}:
0.11146975963916175
0.10906818321660615
0.11765909519346371
0.11805565365881292
0.11696391165435056
0.11591291906429504
0.11591291906429504
0.11890776847168741
0.11954953654177786
0.11277132084679321
⋮
0.11581137098966252
0.11108414238281142
0.12367837031142703
0.11883223019451133
0.11626403941810533
0.11108193132791708
0.12464330126754311
0.10979376463741101
0.12178628604833494
The marginal posterior mean duplication rate (and 95% uncertainty interval) is
mean(λs), quantile(λs, [0.025, 0.975])
(0.1182284702618578, [0.10718703821332727, 0.1312999472431907])
The marginal posterior mean loss rate per duplicated gene is
mean(μs), quantile(μs, [0.025, 0.975])
(1.0017514768827385, [0.898036042921607, 1.1190831043328426])
This page was generated using Literate.jl.