Drosophila

Here I illustrate the usage of the DeadBird package for fitting phylogenetic birth-death process models to data using Maximum likelihood and Bayesian inference. We will fit a simple single-rate (turnover rate λ, as in e.g. CAFE) model to the 12 Drosophila species data set.

Load the required packages

using DeadBird
using Distributions, Turing, CSV, DataFrames, NewickTree, Optim
using Random; Random.seed!(761);

Load the data

datadir = joinpath(@__DIR__, "../../example/drosophila")
tree = readnw(readline(joinpath(datadir, "tree.nw")))
data = CSV.read(joinpath(datadir, "counts-oib.csv"), DataFrame);

The data set size and number of taxa are

nrow(data), length(getleaves(tree))

(11519, 12)

We'll take a subset of the data for the sake of time.

data = data[20:10:10010,:];
first(data, 5)

5 rows × 12 columns

	dana	dere	dgri	dmel	dmoj	dper	dpse	dsec	dsim	dvir	dwil	dyak
	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64
1	7	7	9	7	6	7	7	7	7	6	7	7
2	7	5	9	5	4	7	7	5	5	5	4	5
3	4	4	6	4	5	6	5	4	5	5	4	4
4	5	4	7	5	3	3	3	4	5	3	3	4
5	5	4	3	4	2	4	4	5	7	3	1	4

The average number of genes in non-extinct families is

m = mean(filter(x->x>0,Matrix(data)))

1.127086653804211

We can use this to parameterize the prior for the number of ancestral lineages

η = 1/m
rootprior = ShiftedGeometric(η)

ShiftedGeometric{Float64}(η=0.8872432271509378)

We will use the DAG data structure (most efficient, but admits no family-specific models).

dag, bound = CountDAG(data, tree)

(dag = CountDAG({752, 1489} directed simple Int64 graph), bound = 84)

We will define a Turing model for this simple problem

@model singlerate(dag, bound, tree, rootprior) = begin
    λ ~ Turing.FlatPos(0.)
    θ = ConstantDLG(λ=λ, μ=λ, κ=zero(λ))
    dag ~ PhyloBDP(θ, rootprior, tree, bound)
end

singlerate (generic function with 1 method)

Maximum likelihood inference

First we show how to conduct MLE of a single parameter model for the entire data (i.e. we estimate a genome-wide parameter) using the CountDAG data structure.

model = singlerate(dag, bound, tree, rootprior)
@time mleresult = optimize(model, MLE())

ModeResult with maximized lp of -3650.20
1-element Named Vector{Float64}
A  │ 
───┼──────────
:λ │ 0.0950648

For the complete data set of >10000 families, this takes about 10 seconds.

It is straightforward to adapt the model definition to allow for different duplication and loss rates, non-zero gain rates (κ) or different root priors.

Alternatively we could use the ProfileMatrix, which admits models that deal with variation across families. We can also use this to fit models independently across families. Here we will estimate the MLE of a single turnover rate for 100 families independently.

matrix, bound = ProfileMatrix(data, tree)

@model singlerate(mat, bound, tree, rootprior) = begin
    λ ~ Turing.FlatPos(0.)
    θ = ConstantDLG(λ=λ, μ=λ, κ=zero(λ))
    mat ~ PhyloBDP(θ, rootprior, tree, bound)
end

@time results = map(1:size(matrix, 1)) do i
    x = matrix[i]
    model = singlerate(x, x.x[1], tree, rootprior)
    mleresult = optimize(model, MLE())
    mleresult.lp, mleresult.values[1]
end;

 17.444568 seconds (32.42 M allocations: 7.187 GiB, 6.89% gc time, 22.91% compilation time)

Here we have fitted a single parameter model to each count vector (a phylogenetic profile) independently. Note that the MLE will be zero under this model when the profile consists of ones only. The results of the above look like this

first(results, 10)

10-element Vector{Tuple{Float64, Float64}}:
 (-22.166663074268236, 0.09656456464512346)
 (-24.69438549106619, 0.5010970827956874)
 (-22.63610969812189, 0.29002267759742906)
 (-22.49806861126835, 0.6980503796995863)
 (-23.485123097196496, 1.0721987563046236)
 (-20.65072764821289, 0.6336332029569388)
 (-24.132043549383727, 0.5863538085743983)
 (-19.54792662026289, 0.7387507977416689)
 (-19.122396310095528, 0.3626784107179777)
 (-10.752273172530012, 0.053150884293764106)

alternatively we can use MAP estimation to regularize the λ estimates, for instance:

@model singlerate_ln(mat, bound, tree, rootprior) = begin
    λ ~ LogNormal(log(0.1), 1)
    θ = ConstantDLG(λ=λ, μ=λ, κ=zero(λ))
    mat ~ PhyloBDP(θ, rootprior, tree, bound)
end

@time results_map = map(1:size(matrix, 1)) do i
    x = matrix[i]
    model = singlerate_ln(x, x.x[1], tree, rootprior)
    mleresult = optimize(model, MAP())
    mleresult.lp, mleresult.values[1]
end;

  6.948932 seconds (12.80 M allocations: 1.661 GiB, 5.84% gc time, 63.21% compilation time)

MAP is also faster, since it is numerically better behaved (the MLE of 0. being on the boundary of parameter space).

Bayesian inference

Now we'll perform Bayesian inference using the No-U-turn sampler. Note that we've defined an uninformative flat prior (FlatPos(0.0)), so we expect to find a posterior mean estimate for λ that coincides with the MLE.

chain = sample(model, NUTS(), 100)

Chains MCMC chain (100×13×1 Array{Float64, 3}):

Iterations        = 1:1:100
Number of chains  = 1
Samples per chain = 100
parameters        = λ
internals         = acceptance_rate, hamiltonian_energy, hamiltonian_energy_error, is_accept, log_density, lp, max_hamiltonian_energy_error, n_steps, nom_step_size, numerical_error, step_size, tree_depth

Summary Statistics
  parameters      mean       std   naive_se      mcse       ess      rhat
      Symbol   Float64   Float64    Float64   Missing   Float64   Float64

           λ    0.0953    0.0038     0.0004   missing   43.1560    1.0518

Quantiles
  parameters      2.5%     25.0%     50.0%     75.0%     97.5%
      Symbol   Float64   Float64   Float64   Float64   Float64

           λ    0.0887    0.0928    0.0954    0.0976    0.1025

Of course, it would be better to run such a chain for more iterations, e.g. 1000, but for the sake of time I'm only taking a 100 samples here. The 95% uncertainty interval for the turnover rate can be obtained as

quantile(chain; q=[0.025, 0.975])

Quantiles
  parameters      2.5%     97.5%
      Symbol   Float64   Float64

           λ    0.0887    0.1025

Other models

It is straightforward to use the Turing.jl model syntax (using the @model macro) in combination with the various rates models and root priors defined in DeadBird (ConstantDLG, DLG, DLGWGM, ShiftedBetaGeometric, ...) to specify complicated models. A not so complicated example would be the following. First we filter the data to only allow for non-extinct families:

nonextinct = filter(x->all(Array(x) .> 0), data);

We will model the excess number of genes, i.e. the number of extra (duplicated) genes per family, instead of the total number of genes.

excessgenes = nonextinct .- 1;

Again we construct a DAG object

dag, bound = CountDAG(excessgenes, tree)

(dag = CountDAG({616, 1204} directed simple Int64 graph), bound = 72)

The model we specify is a linear birth-death and immigration process with immigration (gain) rate equal to the duplication rate, κ = λ, and loss rate μ. This corresponds to a model where genes duplicate at rate λ, (note that a 0 -> 1 transition is also a duplication here since the zero state corresponds to a single copy family), and where duplicated genes get lost at rate μ. We assume λ < μ, in which case there is a geometric stationary distribution with mean 1 - λ/μ for the excess number of genes in a family.

bound01(η) = η <= zero(η) ? zero(η) + 1e-16 : η >= one(η) ? one(η) - 1e-16 : η

@model nonextinctmodel(dag, bound, tree) = begin
    μ ~ Turing.FlatPos(0.)
    η ~ Beta(1, 1)  # 1 - λ/μ
    η = bound01(η)
    λ = μ * (1 - η)
    rates = ConstantDLG(λ=λ, μ=μ, κ=λ)
    rootp = Geometric(η)
    dag ~ PhyloBDP(rates, rootp, tree, bound, cond=:none)
end

nonextinctmodel (generic function with 1 method)

and we sample

chain = sample(nonextinctmodel(dag, bound, tree), NUTS(), 100)

Chains MCMC chain (100×14×1 Array{Float64, 3}):

Iterations        = 1:1:100
Number of chains  = 1
Samples per chain = 100
parameters        = η, μ
internals         = acceptance_rate, hamiltonian_energy, hamiltonian_energy_error, is_accept, log_density, lp, max_hamiltonian_energy_error, n_steps, nom_step_size, numerical_error, step_size, tree_depth

Summary Statistics
  parameters      mean       std   naive_se      mcse       ess      rhat
      Symbol   Float64   Float64    Float64   Missing   Float64   Float64

           η    0.8818    0.0061     0.0006   missing   44.1066    0.9985
           μ    1.0018    0.0616     0.0062   missing   47.9508    1.0175

Quantiles
  parameters      2.5%     25.0%     50.0%     75.0%     97.5%
      Symbol   Float64   Float64   Float64   Float64   Float64

           η    0.8715    0.8772    0.8817    0.8865    0.8928
           μ    0.8980    0.9578    1.0023    1.0434    1.1191

Get the posterior as a dataframe

pdf = DataFrame(chain)
μs = pdf[:, :μ]
λs = μs .* (1 .- pdf[:,:η])

100-element Vector{Float64}:
 0.11146975963916175
 0.10906818321660615
 0.11765909519346371
 0.11805565365881292
 0.11696391165435056
 0.11591291906429504
 0.11591291906429504
 0.11890776847168741
 0.11954953654177786
 0.11277132084679321
 ⋮
 0.11581137098966252
 0.11108414238281142
 0.12367837031142703
 0.11883223019451133
 0.11626403941810533
 0.11108193132791708
 0.12464330126754311
 0.10979376463741101
 0.12178628604833494

The marginal posterior mean duplication rate (and 95% uncertainty interval) is

mean(λs), quantile(λs, [0.025, 0.975])

(0.1182284702618578, [0.10718703821332727, 0.1312999472431907])

The marginal posterior mean loss rate per duplicated gene is

mean(μs), quantile(μs, [0.025, 0.975])

(1.0017514768827385, [0.898036042921607, 1.1190831043328426])

This page was generated using Literate.jl.