API
DeadBird.add_internal!
— Methodadd_internal!(dag, ndata, parts, x, n)
For a species tree internal node n
, this adds the gene family nodes associated with n
to the graph and provides the bound on the number of lineages that survive to the present below n
for each gene family. Note that x
is a vector of tuples of DAG nodes that each will be joined into a newly added node. The resulting nodes are returned.
!!! note: I believe this also works for multifurcating species trees (like the Csuros Miklos algorithm does too)
DeadBird.add_leaves!
— Methodadd_leaves!(dag, ndata, parts, x, n)
For a species tree leaf node n
, this adds the vector of (gene) counts x
for that species to the graph. This returns for each gene family the corresponding node that was added to the graph
DeadBird.cm!
— Methodcm!(dag, node, model)
Compute the conditional survival probabilities at n
using the Csuros & Miklos (2009) algorithm. This assumes the model
already contains the computed transition probability matrices W
and that the partial loglikelihood vectors for the child nodes in the DAG are already computed and available.
DeadBird.conditionfactor
— Methodconditionfactor(model)
Compute the condition factor for the model for the associated data filtering strategy.
DeadBird.discretize
— Methoddiscretize(d, K)
Discretize a distribution d
in K
equal probability classes. Uses the median of each class as representative rate, and rescales the resulting vector x
so that mean(x) == mean(d)
.
Better would be to have the mean value of each class as representative I guess, but the median is much more straightforward to obtain given that we have quantile functions available.
Example
julia> discretize(Gamma(10, 0.1), 5)
5-element Array{Float64,1}:
0.6269427439826725
0.8195837806573205
0.9743503743962694
1.1475354999847722
1.4315876009789656
DeadBird.example_data
— Methodexample_data()
Get some example_data.
Example (and benchmark)
julia> x = DeadBird.example_data();
julia> @btime DeadBird.loglikelihood!(x.dag, x.model)
36.974 μs (431 allocations: 31.53 KiB)
-26.30930561857625
julia> @btime DeadBird.loglikelihood!(x.mat, x.model)
32.876 μs (420 allocations: 29.91 KiB)
-26.309305618576246
DeadBird.extp
— Methodextp(t, λ, μ, ϵ)
Compute the extinction probability of a single lineage evolving according to a linear BDP for time t
with rate λ
and μ
and with extinction probability of a single lineage at t
equal to ϵ
. This is ∑ᵢℙ{Xₜ=i|X₀=1}ϵ^i
Takes ϵ on a [0,1] scale
DeadBird.getθ
— Methodgetθ(m<:RatesModel, node)
Get the parameter values from a RatesModel
relevant for a particular node in a phylogeny. Should be implemented for each RatesModel where parameters differ across branches.
DeadBird.getϕψ
— Methodgetϕψ(t, λ, μ)
Returns ϕ = μ(eʳ - 1)/(λeʳ - μ)
where r = t*(λ-μ)
and ψ = ϕ*λ/μ
, with special cases for λ ≈ μ. These methods should be implemented as to prevent underflow/overflow issues. Note these quantities are also called p and q (in Csuros & Miklos) or α and β (in Bailey). Note that ϕ = P(Xₜ=0|X₀=1), i.e. the extinction probability for a single gene.
DeadBird.getϕψ′
— Methodgetϕψ′(ϕ, ψ, ϵ)
Adjusted ϕ and ψ for a linear BDP process with extinction probability ϵ after the process.
\[\phi' = \frac{\psi(1-\epsilon) + (1 - \psi) \epsilon}{1 - \psi \epsilon} \psi' = \frac{\psi(1-\epsilon)}{1 - \psi \epsilon}\]
Some edge cases are when ϵ is 1 or 0. Other edge cases may be relevant when ψ and or ϕ is 1 or 0.
We take ϵ on [0,1] scale.
DeadBird.insertwgms
— Methodinsertwgms(model, wgms::Dict)
Insert a bunch of WGMs in a given PhyloBDP model, will return a new model object. wgms
should be a dict with vectors of tuples, keeping for each branch a vector with (t, k) tuples. This version does not modify anything in the template model.
This is not particularly convenient for use in rjMCMC algorithms, where we want to efficiently add and remove single WGM events...
Example
x = DeadBird.example_data()
m = PhyloBDP(RatesModel(ConstantDLGWGD(q=ones(9))), x.tr, 5)
insertwgms(m, Dict(3=>[(0.1, 2)], 2=>[(0.3, 4)]))
DeadBird.loglikelihood!
— Methodloglikelihood!(dag::CountDAG, model::PhyloBDP)
Compute the log likelihood on the DAG using the Csuros & Miklos algorithm.
DeadBird.marginal_extinctionp
— Functionmarginal_extinctionp(d, logϵ)
Compute the marginal log probability of having no observed descendants for a branching process starting off with n
genes given by a probability distribution d
when the probability that a single gene goes extinct is ϵ
. This is:
\[\sum_{k=1}^\infty ϵ^k P\{X₀ = k\}\]
For many priors a closed form can be obtained by manipulating the sum so that it becomes a geometric series.
DeadBird.marginalize
— Functionmarginalize(p::ShiftedBetaGeometric, ℓvec, logϵ, imax=100)
There seems to be no closed form for this, but we can devise a recursion, analogous to the ShiftedBetaGeometric case. This could probably share code, but I'll have it separate for now.
DeadBird.marginalize
— Functionmarginalize(p::ShiftedBetaGeometric, ℓvec, logϵ, imax=100)
There seems to be no closed form for this, but we can devise a recursion and obtain a near-exact solution efficiently.
DeadBird.marginalize
— Functionmarginalize(prior, ℓvec, logϵ)
Compute the log-likelihood of the data at the root by marginalizing the partial conditional survival likelihoods at the root over the prior on the number of genes at the root. This is the following sum
\[\ell = \sum_{n=1}^{b} \ell[n] \Big( \sum_{i=0}^\infty \binom{n+i}{i} \epsilon^i (1 - \epsilon)^n P\{X_o = n+i\} \Big)\]
Where ℓ[n] = P{data|Yₒ=n}
, where Yₒ
denotes the number of genes at the root that leave observed descendants and Xₒ
denotes the total number of genes at the root, for which we specified the prior. b
is the bound on the number of surviving lineages, which is determined by the observed data. For many priors, the innner infinite sum can be simplified to a closed form after some algebraic manipulation and relying on the fact that ∑ binom(α + k - 1, k) z^k = (1 - z)^(-k)
for |z| < 1
.
DeadBird.newmodel
— Methodnewmodel(m::M, θ) where M<:RatesModel
Construct a new model of type M
by taking the parameters of m
and parameters defined in the named tuple θ
, the latter overriding the former.
DeadBird.nonextinctfromrootcondition
— Methodnonextinctfromrootcondition(model)
Compute the probability that a family existing at the root of the species tree leaves observed descendants in both clades stemming from the root, i.e. does not go extinct in any of the two clades stemming from the root. This uses the marginalization of the extinction probability over the prior distribution on the number of genes at the root using marginal_extinctionp
DeadBird.ppmf
— Methodppmf(node, x, prior, bound)
Compute the posterior pmf for the ancestral state at node node
, where x
holds the partial likelihoods somewhere, and prior
corresponds to the root or transition probability density that acts as a prior on the node state.
DeadBird.sample_ancestral
— Methodsample_ancestral(spl::AncestralSampler, x)
Sample a set of ancestral states using a pre-order traversal over the tree. This assumes the partial likelihoods are available in x
.
DeadBird.sample_ancestral_node
— Methodsample_ancestral_node(rng, node, x, prior, bound)
Sample ancestral state for node node
with prior
prior and relevant partial likelihoods computed in x
. The prior refers to either a root prior distribution or the transient probability distribution of the process given the parent state.
DeadBird.setw!
— Methodsetw!(W, θ, t)
Compute transition probability matrix for the ordinary (not conditional on survival that is) birth-death process. Using the recursive formulation of Csuros & Miklos.
DeadBird.simulate
— Functionsimulate(mfun::Function, data::DataFrame, chain, N)
Perform posterior predictive simulations. mfun
should be a function that takes an iterate of the chain
and returns a PhyloBDP
model, i.e. mfun(chain[i])
should return a parameterized model. data
is the observed data set to which posterior predictive simulations should correspond.
Example
julia> x = DeadBird.example_data();
julia> DeadBird.simulate(y->x.model((λ=y, μ=y)), x.df, ones(10), 100)
PP simulations (N = 100, n = 10)
DeadBird.simulate
— Functionsimulate(m::ModelArray)
simulate(m::MixtureModel, n)
simulate(m::PhyloBDP, n)
Simulate a set of random profiles from a phylogenetic birth-death model.
Example
julia> x = DeadBird.example_data();
julia> simulate(x.model, 5)
5×5 DataFrame
│ Row │ A │ B │ C │ rejected │ extinct │
│ │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼──────────┼─────────┤
│ 1 │ 1 │ 1 │ 1 │ 0 │ 0 │
│ 2 │ 1 │ 1 │ 1 │ 0 │ 0 │
│ 3 │ 2 │ 2 │ 2 │ 0 │ 0 │
│ 4 │ 0 │ 1 │ 1 │ 1 │ 1 │
│ 5 │ 1 │ 1 │ 1 │ 0 │ 0 │
DeadBird.wgmϵ
— Methodwgmϵ(q, k, logϵ)
Compute the log-extinction probability of a single lineage going through a k-plication event, given the extinction probability of a single lineage after the WGM event. Assumes the single parameter WGM retention model (assuming a single gene before the WGM, the number of retained genes after the WGM is a rv X' = 1 + Y where Y is Binomial(k - 1, q)).
DeadBird.wstar!
— Methodwstar!(W::Matrix, t, θ, ϵ)
Compute the transition probabilities for the conditional survival process recursively (not implemented using recursion though!). Note that the resulting transition matrix is not a stochastic matrix of some Markov chain.
DeadBird.AncestralSampler
— TypeAncestralSampler(model, bound)
A wrapper that contains the transition probability matrices for the transient distributions for the PhyloBDP (not conditioned on survival) along each branch.
DeadBird.BetaGeometric
— TypeBetaGeometric(η, ζ)
DeadBird.ConstantDLG
— TypeConstantDLG{T}
Simple constant rates duplication-loss and gain model. All nodes of the tree are associated with the same parameters (duplication rate λ, loss rate μ, gain rate κ).
DeadBird.ConstantDLGWGM
— TypeConstantDLGWGM{T}
Similar to ConstantDLG
, but with a field for whole-genome multiplication (WGM) nodes in the phylogeny, which have a single retention rate parameter q
each.
DeadBird.CountDAG
— TypeCountDAG(df::DataFrame, tree::Node)
Get a CountDAG
from a count matrix, i.e. the directed acyclic graph (DAG) representation of a phylogenetic profile for an (assumed known) species tree. This is a multitree.
Example
julia> x = DeadBird.example_data();
julia> dag = CountDAG(x.df, x.tr)
(dag = CountDAG({17, 20} directed simple Int64 graph), bound = 7)
DeadBird.DLG
— TypeDLG{T}
Simple branch-wise rates duplication-loss and gain model.
DeadBird.DLGWGM
— TypeDLGWGM{T}
Similar to DLG
, but with WGM nodes, see also ConstantDLGWGM
.
DeadBird.NodeData
— TypeNodeData{I}
Keeps some relevant information for nodes in the DAG representation of a phylogenetic profile matrix.
DeadBird.PPSim
— TypePPPSim
Container for posterior predictive simulations, constructor should not be called directly nor exported.
DeadBird.PhyloBDP
— TypePhyloBDP(ratesmodel, tree, bound)
The phylogenetic birth-death process model as defined by Csuros & Miklos (2009). The bound is exactly defined by the data under consideration.
!!! note: implemented as a <: DiscreteMultivariateDistribution
(for convenience with Turing.jl), however does not support a lot of the Distributions.jl interface.
Example
julia> x = DeadBird.example_data();
julia> rates = RatesModel(ConstantDLG(λ=0.1, μ=0.1));
julia> dag, bound = CountDAG(x.df, x.tr);
julia> rates = RatesModel(ConstantDLG(λ=0.1, μ=0.1));
julia> PhyloBDP(rates, x.tr, bound)
PhyloBDP(
~root
RatesModel with () fixed
ConstantDLG{Float64}
λ: Float64 0.1
μ: Float64 0.1
κ: Float64 0.0
η: Float64 0.66
)
DeadBird.Profile
— TypeProfile{T,I}
A phylogenetic profile, i.e. an observation of a discrete random variable associated with the leaves of a phylogenetic tree. This has a field x
for the extended profile (which records the bound on the number of lineages that survive below an internal node for internal nodes) and a field for the 'partial likelihoods' ℓ
.
DeadBird.ProfileMatrix
— TypeProfileMatrix(df::DataFrame, tree)
Obtain a ProfileMatrix struct for a count dataframe.
Example
julia> x = DeadBird.example_data();
julia> mat, bound = ProfileMatrix(x.df, x.tr)
(matrix = Profile{Float64,Int64}[2 1 … 0 1; 3 2 … 1 1; 7 3 … 0 4; 7 3 … 3 4], bound = 7)
DeadBird.RatesModel
— TypeRatesModel
Abstract type for diferent rate models for phylogenies (e.g. constant rates across the tree, branch-specific rates, models with WGD nodes, ...).
DeadBird.ShiftedBetaGeometric
— TypeShiftedBetaGeometric(η, ζ)
Beta-Geometric compound distribution on the domain [1, 2, ..., ∞). The pmf is given by
\[p_k = \frac{\mathrm{B}(\alpha + 1, \beta + k - 1)}{\mathrm{B}(\alpha, \beta)}\]
We use the alternative parameterization using the mean η = α/(α+β)
and offset 'sample size' ζ = α + β - 1
, where ζ > 0
. That is, we assume α + β > 1
.
DeadBird.ShiftedGeometric
— TypeShiftedGeometric
Geometric distribution with domain [1, 2, ..., ∞).
DeadBird.Transient
— TypeTransient
Transient distribution P(X(t)|X(0)=k). This is a simple struct for the sampler.