Funding
I gratefully acknowledge current support from NSF grants
DMS-2023239 (TRIPODS Phase II)
and
DMS-2308495,
as well as a Van Vleck Research Professor Award
and a Vilas Distinguished Achievement Professorship.
Past funding includes NSF grants
DMS-1248176,
DMS-1149312 (CAREER),
DMS-1614242,
CCF-1740707 (TRIPODS Phase I),
DMS-1902892,
and
DMS-1916378,
as well as
an
Alfred P. Sloan Research Fellowship,
a Simons Fellowship,
and a Vilas Associates Award.
Students and Postdocs
Current Ph.D. Students
Hongyi Huang
Current Postdocs
David Clancy
Former Ph.D. Students
Max Hill [graduated 2023; now at IMSI]
Yu Sun [graduated 2023; now at TikTok]
Shuqi Yu [graduated 2023; now at Metanotitia]
Brandon Legried
[graduated 2020; now postdoc at Georgia Institute of Technology]
Kun-Chieh (Jason) Wang [graduated 2017; now at Google]
Former Postdocs
Wai Tong (Louis) Fan
[2015-2018; now assistant professor at Indiana University Bloomington]
Books and Surveys
Mathematical Methods in Data Science (with Python)
To be published by Cambridge University Press.
Modern Discrete Probability: An Essential Toolkit
Cambridge University Press, 2024.
Book review of "Phylogeny-Discrete and random processes in evolution by Mike Steel"
Bulletin of the AMS, 56:527-533, 2019.
Hands-on introduction to sequence-length requirements in phylogenetics
Bioinformatics and Phylogenetics. Computational Biology, vol 29. Springer, 2019.
In this tutorial, through a series of analytical computations and numerical
simulations, we review many known insights into a fundamental question: how much
data is needed to reconstruct the Tree of Life? Code is provided in Python.
Preprints
Estimating Graph Dimension with Cross-validated Eigenvalues
Preprint. With Fan Chen, Karl Rohe, Shuqi Yu.
In applied multivariate statistics, estimating the number of latent dimensions or the number of clusters is a fundamental and recurring problem. One common diagnostic is the scree plot, which shows the largest eigenvalues of the data matrix; the user searches for a "gap" or "elbow" in the decreasing eigenvalues; unfortunately, these patterns can hide beneath the bias of the sample eigenvalues. This methodological problem is conceptually difficult because, in many situations, there is only enough signal to detect a subset of the k population dimensions/eigenvectors. In this situation, one could argue that the correct choice of k is the number of detectable dimensions. We alleviate these problems with cross-validated eigenvalues. Under a large class of random graph models, without any parametric assumptions, we provide a p-value for each sample eigenvector. It tests the null hypothesis that this sample eigenvector is orthogonal to (i.e., uncorrelated with) the true latent dimensions. This approach naturally adapts to problems where some dimensions are not statistically detectable. In scenarios where all k dimensions can be estimated, we prove that our procedure consistently estimates k. In simulations and a data example, the proposed estimator compares favorably to alternative approaches in both computational and statistical performance.
Reducing Seed Bias in Respondent-Driven Sampling by Estimating Block Transition
Probabilities
Preprint. With Yilin Zhang and Karl Rohe.
Respondent-driven sampling (RDS) is a popular approach to study marginalized or
hard-to-reach populations. It collects samples from a networked population by
incentivizing participants to refer their friends into the study. One major challenge in
analyzing RDS samples is seed bias. Seed bias refers to the fact that when the social
network is divided into multiple communities (or blocks), the RDS sample might not
provide a balanced representation of the different communities in the population, and
such unbalance is correlated with the initial participant (or the seed). In this case,
the distributions of estimators are typically non-trivial mixtures, which are determined
(1) by the seed and (2) by how the referrals transition from one block to another. This
paper shows that (1) block-transition probabilities are easy to estimate with high
accuracy, and (2) we can use these estimated block-transition probabilities to estimate
the stationary distribution over blocks and thus, an estimate of the block proportions.
This stationary distribution on blocks has previously been used in the RDS literature to
evaluate whether the sampling process has appeared to `mix'. We use these estimated
block proportions in a simple post-stratified (PS) estimator that greatly diminishes
seed bias. By aggregating over the blocks/strata in this way, we prove that the PS
estimator is $\sqrt{n}$-consistent under a Markov model, even when other estimators are
not. Simulations show that the PS estimator has smaller Root Mean Square Error (RMSE)
compared to the state-of-the-art estimators.
Journal Papers and Refereed Proceedings
Maximum Likelihood Estimation for Unrooted 3-Leaf Trees: An Analytic Solution for the CFN Model
Bull. Math. Biol., 86:106, 2024. With Max Hill, Jose Israel Rodriguez.
Maximum likelihood estimation is among the most widely-used methods for
inferring phylogenetic trees from sequence data. This paper solves the problem
of computing solutions to the maximum likelihood problem for 3-leaf trees under
the 2-state symmetric mutation model (CFN model). Our main result is a
closed-form solution to the maximum likelihood problem for unrooted 3-leaf
trees, given generic data; this result characterizes all of the ways that a
maximum likelihood estimate can fail to exist for generic data and provides
theoretical validation for predictions made in [27]. Our proof makes use of
both classical tools for studying group-based phylogenetic models such as
Hadamard conjugation and reparameterization in terms of Fourier coordinates,
as well as more recent results concerning the semi-algebraic constraints of
the CFN model. To be able to put these into practice, we also give a complete
characterization to test genericity.
Pairwise sequence alignment at arbitrarily large evolutionary distance
Ann. Appl. Probab., 34(3):2714-2732, 2024. With Brandon Legried.
Ancestral sequence reconstruction is a key task in computational biology. It consists in inferring a
molecular sequence at an ancestral species of a known phylogeny, given descendant sequences at the tip of
the tree. In addition to its many biological applications, it has played a key role in elucidating the
statistical performance of phylogeny estimation methods. Here we establish a formal connection to another
important bioinformatics problem, multiple sequence alignment, where one attempts to best align a collection
of molecular sequences under some mismatch penalty score by inserting gaps. Our result is counter-intuitive:
we show that perfect pairwise sequence alignment with high probability is possible in principle at arbitrary
large evolutionary distances - provided the phylogeny is known and dense enough. We use techniques from
ancestral sequence reconstruction in the taxon-rich setting together with the probabilistic analysis of
sequence evolution models involving insertions and deletions.
QR-STAR: A Polynomial-Time Statistically Consistent Method for Rooting Species Trees under the Coalescent
Journal of Computational Biology, 30(11):1146-1181, 2023. With Yasamin Tabatabaee and Tandy Warnow.
We address the problem of rooting an unrooted species tree given a set of unrooted gene trees, under the assumption that gene trees evolve within the model species tree under the multispecies coalescent (MSC) model. Quintet Rooting (QR) is a polynomial time algorithm that was recently proposed for this problem, which is based on the theory developed by Allman, Degnan, and Rhodes that proves the identifiability of rooted 5-taxon trees from unrooted gene trees under the MSC. However, although QR had good accuracy in simulations, its statistical consistency was left as an open problem. We present QR-STAR, a variant of QR with an additional step and a different cost function, and prove that it is statistically consistent under the MSC. Moreover, we derive sample complexity bounds for QR-STAR and show that a particular variant of it based on “short quintets” has polynomial sample complexity. Finally, our simulation study under a variety of model conditions shows that QR-STAR matches or improves on the accuracy of QR. QR-STAR is available in open-source form on github.
Expanding the class of global objective functions for dissimilarity-based
hierarchical clustering
Journal of Classification, 40:513-526, 2023.
Recent work on dissimilarity-based hierarchical clustering has led to the introduction of global objective functions for this classical problem. Several standard approaches, such as average linkage clustering, as well as some new heuristics have been shown to provide approximation guarantees. Here, we introduce a broad new class of objective functions which satisfy desirable properties studied in prior work. Many common agglomerative and divisive clustering methods are shown to be greedy algorithms for these objectives, which are inspired by related concepts in phylogenetics.
Statistically consistent rooting of species trees under the multi-species coalescent model
RECOMB 2023. With Yasamin Tabatabaee and Tandy Warnow.
Rooted species trees are used in several downstream applications of phylogenetics. Most species tree estimation methods produce unrooted trees and additional methods are then used to root these unrooted trees. Recently, Quintet Rooting (QR) (Tabatabaee et al., ISMB and Bioinformatics 2022), a polynomial-time method for rooting an unrooted species tree given unrooted gene trees, was introduced. QR, which is based on a proof of identifiability of rooted 5-taxon trees in the presence of incomplete lineage sorting, was shown to have good accuracy, improving over other methods for rooting species trees when incomplete lineage sorting was the only cause of gene tree discordance, except when gene tree estimation error was very high. However, that study left the statistical consistency of QR as an open question. We present QR-STAR, a polynomial-time variant of QR that has an additional step for determining the rooted shape of each quintet tree. We prove that QR-STAR is statistically consistent under the multi-species coalescent (MSC) model. Our simulation study under a variety of model conditions shows that QR-STAR matches or improves on the accuracy of QR. QR-STAR is available in open source form at https://github.com/ytabatabaee/Quintet-Rooting.
Inconsistency of triplet-based and quartet-based species tree estimation under intralocus recombination
Journal of Computational Biology, 29(11):1173-1197, 2022. With Max Hill.
We consider species tree estimation from multiple loci subject to intralocus
recombination. We focus on $R^*$, a summary coalescent-based method using
rooted triplets, as well as a related quartet-based inference method. We demonstrate analytically that in both cases intralocus recombination
gives rise to an inconsistency zone, in which correct inference is not assured
even in the limit of infinite amount of data. In addition, we validate and
characterize this inconsistency zone through a simulation study that suggests
that differential rates of recombination between closely related taxa can
amplify the effect of incomplete lineage sorting and contribute to
inconsistency.
Impossibility of phylogeny reconstruction from k-mer counts
Annals of Applied Probability, 32(6):4893-4913, 2022. With Wai-Tong Louis Fan and Brandon Legried.
We consider phylogeny estimation under a two-state model of sequence evolution
by site substitution on a tree. In the asymptotic regime where the sequence
lengths tend to infinity, we show that for any fixed k no statistically
consistent phylogeny estimation is possible from k-mer counts of the leaf
sequences alone. Formally, we establish that the joint leaf distributions of
k-mer counts on two distinct trees have total variation distance bounded away
from 1 as the sequence length tends to infinity. That is, the two distributions
cannot be distinguished with probability going to one in that asymptotic regime.
Our results are information-theoretic: they imply an impossibility result for
any reconstruction method using only k-mer counts at the leaves.
Species tree estimation under joint modeling of coalescence and duplication:
sample complexity of quartet methods
Annals of Applied Probability, 32(6): 4681-4705, 2022. With Max Hill and Brandon Legried.
We consider species tree estimation under a standard stochastic model of gene
tree evolution that incorporates incomplete lineage sorting (as modeled by a
coalescent process) and gene duplication and loss (as modeled by a branching
process). Through a probabilistic analysis of the model, we derive sample
complexity bounds for widely used quartet-based inference methods that
highlight the effect of the duplication and loss rates in both subcritical and
supercritical regimes.
On the Effect of Intralocus Recombination on Triplet-Based Species Tree Estimation
RECOMB 2022. With Max Hill.
We consider species tree estimation from multiple loci subject to intralocus recombination.
We focus on R*, a summary coalescent-based methods using rooted triplets.
We demonstrate analytically that intralocus recombination gives rise to an
inconsistency zone, in which correct inference is not assured even in the limit
of infinite amount of data. In addition, we validate and characterize this
inconsistency zone through a simulation study that suggests that differential
rates of recombination between closely related taxa can amplify the effect of
incomplete lineage sorting and contribute to inconsistency.
A stochastic Farris transform for genetic data under the multispecies coalescent with applications to data requirements
Journal of Mathematical Biology, 84(5):36, April 2022. With Gautam Dasarathy, Elchanan Mossel, Robert Nowak.
The reconstruction of a species phylogeny from genomic data faces two
significant hurdles: 1) the trees describing the evolution of each
individual gene--i.e., the gene trees--may differ from the species
phylogeny and 2) the molecular sequences corresponding to each gene
often provide limited information about the gene trees themselves. In
this paper we consider an approach to species tree reconstruction that
addresses both these hurdles. Specifically, we propose an algorithm for
phylogeny reconstruction under the multispecies coalescent model with a
standard model of site substitution. The multispecies coalescent is
commonly used to model gene tree discordance due to incomplete lineage
sorting, a well-studied population-genetic effect. In previous work, an
information-theoretic trade-off was derived in this context between the
number of loci, m, needed for an accurate reconstruction and the length
of the locus sequences, k. It was shown that to reconstruct an internal
branch of length f, one needs m to be of the order of $1/[f^2 \sqrt{k}]$. That
previous result was obtained under the molecular clock assumption, i.e.,
under the assumption that mutation rates (as well as population sizes)
are constant across the species phylogeny. Here we generalize this
result beyond the restrictive molecular clock assumption, and obtain a
new reconstruction algorithm that has the same data requirement (up to
log factors). Our main contribution is a novel reduction to the
molecular clock case under the multispecies coalescent. As a corollary,
we also obtain a new identifiability result of independent interest: for
any species tree with $n \geq 3$ species, the rooted species tree can be
identified from the distribution of its unrooted weighted gene trees
even in the absence of a molecular clock.
Polynomial-Time Statistical Estimation of Species Trees Under Gene Duplication and Loss
Journal of Computational Biology, 28(5):452-468, 2021. With Brandon Legried, Erin Molloy, and Tandy Warnow.
Conference version in Proceedings of RECOMB 2020, 120-135.
Phylogenomics—the estimation of species trees from multi-locus datasets—is a
common step in many biological studies. However, this estimation is challenged
by the fact that genes can evolve under processes, including incomplete lineage
sorting (ILS) and gene duplication and loss (GDL), that make their trees
different from the species tree. In this paper, we address the challenge of
estimating the species tree under GDL. We show that species trees are
identifiable under a standard stochastic model for GDL, and that the
polynomial-time algorithm ASTRAL-multi, a recent development in the ASTRAL
suite of methods, is statistically consistent under this GDL model. We also
provide a simulation study evaluating ASTRAL-multi for species tree estimation
under GDL. All scripts and datasets used in this study are available on the
Illinois Data Bank: https://doi.org/10.13012/B2IDB-2626814_V1.
Sufficient condition for root reconstruction by parsimony on binary trees with general weights
Electronic Communications in Probability, 26:1-13, 2021. With Jason Wang.
We consider the problem of inferring an ancestral state from
observations at the leaves of a tree, assuming the state evolves along
the tree according to a two-state symmetric Markov process. We establish
a general branching rate condition under which maximum parsimony, a
common reconstruction method requiring only the knowledge of the tree,
succeeds better than random guessing uniformly in the depth of the tree.
We thereby generalize previous results of (Zhang et al., 2010) and
(Gascuel and Steel, 2010). Our results apply to both deterministic and
i.i.d. edge weights.
Impossibility of consistent distance estimation from sequence lengths under
the TKF91 model
Bulletin of Mathematical Biology, 82(9):123, 2020. With Wai-Tong Louis Fan and Brandon Legried.
We consider the problem of distance estimation under the TKF91 model of
sequence evolution by insertions, deletions and substitutions on a phylogeny.
In an asymptotic regime where the expected sequence lengths tend to infinity,
we show that no consistent distance estimation is possible from sequence
lengths alone. More formally, we establish that the distributions of pairs of
sequence lengths at different distances cannot be distinguished with
probability going to one.
Asymptotic seed bias in respondent-driven sampling
Electronic Journal of Statistics, 14(1):1577-1610, 2020. With Yuling Yan, Bret Hanlon and Karl Rohe.
Respondent-driven sampling (RDS) collects a sample of individuals in a networked
population by incentivizing the sampled individuals to refer their contacts into
the sample. This iterative process is initialized from some seed node(s). Sometimes,
this selection creates a large amount of seed bias. Other times, the seed bias is
small. This paper gains a deeper understanding of this bias by characterizing its
effect on the limiting distribution of various RDS estimators. Using classical tools
and results from multi-type branching processes (Kesten and Stigum, 1966), we show
that the seed bias is negligible for the Generalized Least Squares (GLS) estimator
and non-negligible for both the inverse probability weighted and Volz-Heckathorn (VH)
estimators. In particular, we show that (i) above a critical threshold, VH converge
to a non-trivial mixture distribution, where the mixture component depends on the
seed node, and the mixture distribution is possibly multi-modal. Moreover, (ii) GLS
converges to a Gaussian distribution independent of the seed node, under a certain
condition on the Markov process. Numerical experiments with both simulated data and
empirical social networks suggest that these results appear to hold beyond the Markov
conditions of the theorems.
Statistically consistent and computationally efficient inference of
ancestral DNA sequences in the TKF91 model under dense taxon sampling.
Bulletin of Mathematical Biology, 82(2):21, 2020. With L. Fan.
In evolutionary biology, the speciation history of living
organisms is represented graphically by a phylogeny, that is, a
rooted tree whose leaves correspond to current species and branchings
indicate past speciation events. Phylogenies are commonly estimated
from molecular sequences, such as DNA sequences, collected from the
species of interest. At a high level, the idea behind this inference
is simple: the further apart in the Tree of Life are two species, the
greater is the number of mutations to have accumulated in their genomes
since their most recent common ancestor. In order to obtain accurate
estimates in phylogenetic analyses, it is standard practice to employ
statistical approaches based on stochastic models of sequence evolution
on a tree. For tractability, such models necessarily make simplifying
assumptions about the evolutionary mechanisms involved. In particular,
commonly omitted are insertions and deletions of nucleotides -- also known
as indels.
Properly accounting for indels in statistical phylogenetic analyses remains
a major challenge in computational evolutionary biology. Here we consider
the problem of reconstructing ancestral sequences on a known phylogeny
in a model of sequence evolution incorporating nucleotide substitutions,
insertions and deletions, specifically the classical TKF91 process. We
focus on the case of dense phylogenies of bounded height, which we refer
to as the taxon-rich setting, where statistical consistency is achievable.
We give the first polynomial-time ancestral reconstruction algorithm with
provable guarantees under constant rates of mutation. Our algorithm
succeeds when the phylogeny satisfies the "big bang" condition, a
necessary and sufficient condition for statistical consistency in this
context.
Long-branch attraction in species tree estimation: inconsistency of
partitioned likelihood and topology-based summary methods
Systematic Biology, Volume 68, Issue 2, March 2019, Pages 281-297. With Michael Nute and Tandy Warnow.
With advances in sequencing technologies, there are now massive amounts of
genomic data from across all life, leading to the possibility that a robust
Tree of Life can be constructed. However, ``gene tree heterogeneity", which is
when different genomic regions can evolve differently, is a common phenomenon
in multi-locus datasets, and reduces the accuracy of standard methods for
species tree estimation that do not take this heterogeneity into account. New
methods have been developed for species tree estimation that specifically
address gene tree heterogeneity, and that have been proven to converge to the
true species tree when the number of loci and number of sites per locus both
increase (i.e., the methods are said to be ``statistically consistent"). Yet,
little is known about the biologically realistic condition where the number of
sites per locus is bounded. We show that when the sequence length of each locus
is bounded (by any arbitrarily chosen value), the most common approaches to
species tree estimation that take heterogeneity into account (i.e., traditional
fully partitioned concatenated maximum likelihood and newer approaches, called
summary methods, that estimate the species tree by combining gene trees) are
not statistically consistent, even when the heterogeneity is extremely
constrained. The main challenge is the presence of conditions such as long
branch attraction that create biased tree estimation when the number of sites
is restricted. Hence, our study uncovers a fundamental challenge to species
tree estimation using both traditional and new methods.
Generalized least squares can overcome the critical threshold in respondent-driven sampling
Proceedings of the National Academy of Sciences, 115(41):10299-10304, 2018. With Karl Rohe.
In order to sample marginalized and/or hard-to-reach populations,
respondent-driven sampling (RDS) and similar techniques reach their
participants via peer referral. Under a Markov model for RDS, previous
research has shown that if the typical participant refers too many
contacts, then the variance of common estimators does not decay like
O(1/n), where n is the sample size. This implies that confidence
intervals will be far wider than under a typical sampling design. Here
we show that generalized least squares (GLS) can effectively reduce the
variance of RDS estimates. In particular, a theoretical analysis
indicates that the variance of the GLS estimator is O(1/n). We then
derive two classes of feasible GLS estimators. The first class is based
upon a Degree Corrected Stochastic Blockmodel for the underlying social
network. The second class is based upon a rank-two model. It might be of
independent interest that in both model classes, the theoretical results
show that it is possible to estimate the spectral properties of the
population network from the sampled observations. Simulations on
empirical social networks show that the feasible GLS (fGLS) estimators
can have drastically smaller error and rarely increase the error. A
diagnostic plot helps to identify where fGLS will aid estimation. The
fGLS estimators continue to outperform standard estimators even when
they are built from a misspecified model and when there is preferential
recruitment.
Species tree estimation using ASTRAL: how many genes are enough?
IEEE/ACM Trans. Comput. Biology Bioinform., 15(5):1738--1747, 2018. With S. Shekhar and S. Mirarab.
Conference abstract in Proceedings of RECOMB 2017, 393-395.
Species tree reconstruction from genomic data is increasingly performed using methods that account
for sources of gene tree discordance such as incomplete lineage sorting. One popular method for
reconstructing species trees from unrooted gene tree topologies is ASTRAL. In this paper, we derive
theoretical sample complexity results for the number of genes required by ASTRAL to guarantee
reconstruction of the correct species tree with high probability. We also validate those theoretical
bounds in a simulation study. Our results indicate that ASTRAL requires $O(f^{-2} \log n)$ gene trees
to reconstruct the species tree correctly with high probability where n is the number of species and
f is the length of the shortest branch in the species tree. Our simulations, which are the first to
test ASTRAL explicitly under the anomaly zone, show trends consistent with the theoretical bounds and
also provide some practical insights on the conditions where ASTRAL works well.
Geometry of the sample frequency spectrum and the perils of demographic inference
Genetics, 210(2):665-682, 2018. With Zvi Rosen, Anand Bhaskar, Yun S. Song.
(Featured in
ISSUE HIGHLIGHTS
in Genetics, October 1, 2018.)
The sample frequency spectrum (SFS), which describes the distribution of
mutant alleles in a sample of DNA sequences, is a widely used summary
statistic in population genetics. The expected SFS has a strong
dependence on the historical population demography and this property is
exploited by popular statistical methods to infer complex demographic
histories from DNA sequence data. Most, if not all, of these inference
methods exhibit pathological behavior, however. Specifically, they often
display runaway behavior in optimization, where the inferred population
sizes and epoch durations can degenerate to 0 or diverge to infinity,
and show undesirable sensitivity of the inferred demography to
perturbations in the data. The goal of this paper is to provide
theoretical insights into why such problems arise. To this end, we
characterize the geometry of the expected SFS for piecewise-constant
demographic histories and use our results to show that the
aforementioned pathological behavior of popular inference methods is
intrinsic to the geometry of the expected SFS. We provide explicit
descriptions and visualizations for a toy model with sample size 4, and
generalize our intuition to arbitrary sample sizes n using tools from
convex and algebraic geometry. We also develop a universal
characterization result which shows that the expected SFS of a sample of
size n under an arbitrary population history can be recapitulated by a
piecewise-constant demography with only k(n) epochs, where k(n) is
between n/2 and 2n-1. The set of expected SFS for piecewise-constant
demographies with fewer than k(n) epochs is open and non-convex, which
causes the above phenomena for inference from data.
Necessary and sufficient conditions for consistent root reconstruction in Markov models
on trees
Electronic Journal of Probability, Volume 23, paper no. 47, 24 pp., 2018. With L. Fan.
We establish necessary and sufficient conditions for consistent root
reconstruction in continuous-time Markov models with countable state
space on bounded-height trees. Here a root state estimator is said to
be consistent if the probability that it returns to the true root state
converges to 1 as the number of leaves tends to infinity. We also derive
quantitative bounds on the error of reconstruction. Our results answer a
question of Gascuel and Steel and have implications for ancestral sequence
reconstruction in a classical evolutionary model of nucleotide insertion and
deletion.
On the Variance of Internode Distance Under the Multispecies Coalescent
Proceedings of RECOMB-CG 2018, 196-206.
We consider the problem of estimating species trees from unrooted gene tree topologies
in the presence of incomplete lineage sorting, a common phenomenon that creates gene
tree heterogeneity in multilocus datasets. One popular class of reconstruction methods
in this setting is based on internode distances, i.e. the average graph distance between
pairs of species across gene trees. While statistical consistency in the limit of large
numbers of loci has been established in some cases, little is known about the sample
complexity of such methods. Here we make progress on this question by deriving a lower
bound on the worst-case variance of internode distance which depends linearly on the
corresponding graph distance in the species tree. We also discuss some algorithmic
implications.
Circular Networks from Distorted Metrics
Proceedings of RECOMB 2018, 167-176. With J. Wang.
(
Best Paper Award, RECOMB 2018)
Trees have long been used as a graphical representation of species relationships.
However complex evolutionary events, such as genetic reassortments or hybrid
speciations which occur commonly in viruses, bacteria and plants, do not fit
into this elementary framework. Alternatively, various network representations
have been developed. Circular networks are a natural generalization of leaf-labeled
trees interpreted as split systems, that is, collections of bipartitions over leaf
labels corresponding to current species. Although such networks do not explicitly
model specific evolutionary events of interest, their straightforward visualization
and fast reconstruction have made them a popular exploratory tool to detect network-like
evolution in genetic datasets.
Standard reconstruction methods for circular networks, such as Neighbor-Net, rely
on an associated metric on the species set. Such a metric is first estimated from DNA
sequences, which leads to a key difficulty: distantly related sequences produce
statistically unreliable estimates. This is problematic for Neighbor-Net as it is based
on the popular tree reconstruction method Neighbor-Joining, whose sensitivity to distance
estimation errors is well established theoretically. In the tree case, more robust
reconstruction methods have been developed using the notion of a distorted metric,
which captures the dependence of the error in the distance through a radius of accuracy.
Here we design the first circular network reconstruction method based on distorted
metrics. Our method is computationally efficient. Moreover, the analysis of its radius
of accuracy highlights the important role played by the maximum incompatibility, a
measure of the extent to which the network differs from a tree.
Distance-based species tree estimation
under the coalescent: information-theoretic trade-off between number of loci and sequence length
Ann. Appl. Probab., 27(5)-2926-2955, 2017. With E. Mossel.
Conference abstract in Proceedings of RANDOM 2015, 931-942.
We consider the reconstruction of a phylogeny
from multiple genes under the multispecies coalescent.
We establish a connection with the sparse signal detection
problem, where one seeks to distinguish between
a distribution and a mixture of the distribution
and a sparse signal. Using this connection,
we derive an information-theoretic trade-off
between the number of genes, $m$, needed for an accurate
reconstruction and the sequence length, $k$, of the
genes. Specifically, we show that to detect
a branch of length $f$, one needs $m = \Theta(1/[f^{2} \sqrt{k}])$ genes.
Phase transition in the sample complexity of likelihood-based
phylogeny inference
Probability Theory and Related Fields, 169(1), 3-62, 2017. With A. Sly.
Reconstructing evolutionary trees from molecular sequence data
is a fundamental problem in computational biology. Stochastic
models of sequence evolution are closely related to spin systems
that have been extensively studied in statistical physics and
that connection has led to important insights on the theoretical
properties of phylogenetic reconstruction algorithms as well as the
development of new inference methods. Here, we study maximum
likelihood, a classical statistical technique which is perhaps
the most widely used in phylogenetic practice because of its
superior empirical accuracy.
At the theoretical level, except for its consistency, that is,
the guarantee of eventual correct reconstruction as the size of
the input data grows, much remains to be understood about the
statistical properties of maximum likelihood in this context.
In particular, the best bounds on the sample complexity or
sequence-length requirement of maximum likelihood, that is,
the amount of data required for correct reconstruction, are
exponential in the number, n, of tips---far from known lower
bounds based on information-theoretic arguments. Here we close
the gap by proving a new upper bound on the sequence-length
requirement of maximum likelihood that matches up to constants
the known lower bound for some standard models of evolution.
More specifically, for the r-state symmetric model of sequence
evolution on a binary phylogeny with bounded edge lengths, we
show that the sequence-length requirement behaves logarithmically
in n when the expected amount of mutation per edge is below what
is known as the Kesten-Stigum threshold. In general, the
sequence-length requirement is polynomial in n. Our results imply
moreover that the maximum likelihood estimator can be computed
efficiently on randomly generated data provided sequences are as above.
Phase transition on the convergence rate of parameter estimation under an Ornstein-Uhlenbeck diffusion on a tree
Journal of Mathematical Biology, 74(1):355-385, 2017. With C. Ane and L. Ho.
Diffusion processes on trees are commonly used in evolutionary biology to model the joint distribution
of continuous traits, such as body mass, across species. Estimating the parameters of such processes from
tip values presents challenges because of the intrinsic correlation between the observations produced by
the shared evolutionary history, thus violating the standard independence assumption of large-sample theory.
For instance (Ho and Ané, Ann Stat 41:957-981, 2013) recently proved that the mean (also known in this
context as selection optimum) of an Ornstein-Uhlenbeck process on a tree cannot be estimated consistently
from an increasing number of tip observations if the tree height is bounded. Here, using a fruitful connection
to the so-called reconstruction problem in probability theory, we study the convergence rate of parameter
estimation in the unbounded height case. For the mean of the process, we provide a necessary and sufficient
condition for the consistency of the maximum likelihood estimator (MLE) and establish a phase transition
on its convergence rate in terms of the growth of the tree. In particular we show that a loss of
$\sqrt{n}$-consistency (i.e., the variance of the MLE becomes $\Omega(n^{-1})$, where n is the number of tips)
occurs when the tree growth is larger than a threshold related to the phase transition of the reconstruction problem.
For the covariance parameters, we give a novel, efficient estimation method which achieves $\sqrt{n}$-consistency
under natural assumptions on the tree. Our theoretical results provide practical suggestions for the design of
comparative data collection.
Species tree estimation using ASTRAL: how many genes are enough?
Proceedings of RECOMB 2017, 393-395. With S. Shekhar and S. Mirarab.
Species tree reconstruction from genomic data is increasingly performed using methods that account
for sources of gene tree discordance such as incomplete lineage sorting. One popular method for
reconstructing species trees from unrooted gene tree topologies is ASTRAL. In this paper, we derive
theoretical sample complexity results for the number of genes required by ASTRAL to guarantee
reconstruction of the correct species tree with high probability. We also validate those theoretical
bounds in a simulation study. Our results indicate that ASTRAL requires $O(f^{-2} \log n)$ gene trees
to reconstruct the species tree correctly with high probability where n is the number of species and
f is the length of the shortest branch in the species tree. Our simulations, which are the first to
test ASTRAL explicitly under the anomaly zone, show trends consistent with the theoretical bounds and
also provide some practical insights on the conditions where ASTRAL works well.
Species trees from gene trees despite a high rate of lateral genetic
transfer: A tight bound
Proceedings of ACM-SIAM SODA 2016, 1621-1630. With C. Daskalakis.
Reconstructing the tree of life from molecular sequences
is a fundamental problem in computational biology. Modern
data sets often contain a large number of genes which can
complicate the reconstruction problem due to the fact that
different genes may undergo different evolutionary histories.
This is the case in particular in the presence of lateral
genetic transfer (LGT), whereby a gene is inherited from a
distant species rather than an immediate ancestor. Such an
event produces a gene tree which is distinct from
(but related to) the species phylogeny.
In previous work, a stochastic model of LGT was introduced
and it was shown that the species phylogeny can be reconstructed
from gene trees despite surprisingly high rates of LGT. Both
lower and upper bounds on this rate were obtained, but a large
gap remained. Here we close this gap, up to a constant. Specifically,
we show that the species phylogeny can be reconstructed perfectly
even when each edge of the tree has a constant probability of being
the location of an LGT event. Our new reconstruction algorithm builds
the tree recursively from the leaves. We also provide a matching bound
in the negative direction (up to a constant).
On the robustness to gene tree estimation error (or lack thereof) of
coalescent-based species tree methods
Systematic Biology, 64(4):663--676, 2015. With T. Warnow.
The estimation of species trees using multiple loci has become increasingly
common. Because different loci can have different phylogenetic histories
(reflected in different gene tree topologies) for multiple biological causes,
new approaches to species tree estimation have been developed that take gene
tree heterogeneity into account. Among these multiple causes, incomplete
lineage sorting (ILS), modeled by the multi-species coalescent, is potentially
the most common cause of gene tree heterogeneity, and much of the focus of
the recent literature has been on how to estimate species trees in the presence
of ILS. Despite progress in developing statistically consistent techniques for
estimating species trees when gene trees can differ due to ILS, there is
substantial controversy in the systematics community as to whether to use the
new coalescent-based methods or the traditional concatenation methods.
One of the key issues that has been raised is understanding the impact of
gene tree estimation error on coalescent-based methods that operate by
combining gene trees. Here we explore the mathematical guarantees of
coalescent-based methods when analyzing estimated rather than true gene trees.
Our results provide some insight into the differences between promise of
coalescent-based methods in theory and their performance in practice.
Data requirement for phylogenetic inference from multiple loci: A
new distance method
IEEE/ACM Trans. Comput. Biology Bioinform., 12(2):422-432, 2015. With Gautam Dasarathy and
Robert Nowak.
Conference abstract in Proceedings of ISIT 2014, 2037-2041.
We consider the problem of estimating the evolutionary history of a set of
species (phylogeny or species tree) from several genes. It is known that
the evolutionary history of individual genes (gene trees) might be topologically
distinct from each other and from the underlying species tree, possibly
confounding phylogenetic analysis. A further complication in practice is
that one has to estimate gene trees from molecular sequences of finite length.
We provide the first full data-requirement analysis of a species tree
reconstruction method that takes into account estimation errors at the
gene level. Under that criterion, we also devise a novel reconstruction
algorithm that provably improves over all previous methods in a regime of interest.
Likelihood-based tree reconstruction on a concatenation of aligned
sequence data sets can be statistically inconsistent
Theoretical Population Biology, 100:56-62, 2015. With M. Steel.
(Honorable Mention,
2018
Marcus W. Feldman Prize in Theoretical Population Biology.)
The reconstruction of a species tree from genomic data faces a double hurdle.
First, the (gene) tree describing the evolution of each gene may differ from
the species tree, for instance, due to incomplete lineage sorting. Second,
the aligned genetic sequences at the leaves of each gene tree provide merely
an imperfect estimate of the topology of the gene tree. In this note, we
demonstrate formally that a basic statistical problem arises if one tries
to avoid accounting for these two processes and analyses the genetic data
directly via a concatenation approach. More precisely, we show that, under
the multi-species coalescent with a standard site substitution model, maximum
likelihood estimation on sequence data that has been concatenated across
genes and performed under the incorrect assumption that all sites have evolved
independently and identically on a fixed tree is a statistically inconsistent
estimator of the species tree. Our results provide a formal justification of
simulation results described of Kubatko and Degnan (2007) and others, and
complements recent theoretical results by DeGorgio and Degnan (2010) and
Chifman and Kubtako (2014).
Distance-based species tree estimation
under the coalescent: information-theoretic trade-off between number of loci and sequence length
Proceedings of RANDOM 2015, 931-942. With E. Mossel.
We consider the reconstruction of a phylogeny
from multiple genes under the multispecies coalescent.
We establish a connection with the sparse signal detection
problem, where one seeks to distinguish between
a distribution and a mixture of the distribution
and a sparse signal. Using this connection,
we derive an information-theoretic trade-off
between the number of genes, $m$, needed for an accurate
reconstruction and the sequence length, $k$, of the
genes. Specifically, we show that to detect
a branch of length $f$, one needs $m = \Theta(1/[f^{2} \sqrt{k}])$ genes.
New sample complexity bounds for phylogenetic inference from multiple
loci
Proceedings of ISIT 2014, 2037-2041. With G. Dasarathy and
R. Nowak.
Journal version in IEEE/ACM Trans. Comput. Biology Bioinform., 12(2):422-432, 2015.
We consider the problem of estimating the evolutionary
history of a set of species (phylogeny or species tree)
from several genes. It has been known however that the
evolutionary history of individual genes (gene trees)
might be topologically distinct from each other and from
the underlying species tree, possibly confounding
phylogenetic analysis. A further complication in practice
is that one has to estimate gene trees from molecular
sequences of finite length. We provide the first full
data-requirement analysis of a species tree reconstruction
method that takes into account estimation errors at the
gene level. Under that criterion, we also devise a novel
algorithm that provably improves over all previous methods
in a regime of interest.
Recovering the tree-like trend of evolution despite extensive lateral genetic transfer: A probabilistic analysis
Journal of Computational Biology, 20(2):93-112, 2013. With S. Snir.
Conference abstract in Proceedings of RECOMB 2012, 224-238.
Lateral gene transfer (LGT) is a common mechanism of non-vertical
evolution where genetic material is transferred between two more or
less distantly related organisms. It is particularly common in
bacteria where it contributes to adaptive evolution with important
medical implications. In evolutionary studies, LGT has been shown to
create widespread discordance between gene trees as genomes become
mosaics of gene histories. In particular, the Tree of Life has been
questioned as an appropriate representation of bacterial evolutionary
history. Nevertheless a common hypothesis is that prokaryotic
evolution is primarily tree-like, but that the underlying trend is
obscured by LGT. Extensive empirical work has sought to extract a
common tree-like signal from conflicting gene trees.
Here we give a probabilistic perspective on the problem of recovering
the tree-like trend despite LGT. Under a model of randomly
distributed LGT, we show that the species phylogeny can be
reconstructed even in the presence of surprisingly many (almost
linear number of) LGT events per gene tree. Our results, which are
optimal up to logarithmic factors, are based on the analysis of a
robust, computationally efficient reconstruction method and provides
insight into the design of such methods. Finally we show that our
results have implications for the discovery of highways of gene
sharing.
Identifiability and inference of non-parametric rates-across-sites models on large-scale phylogenies
Journal of Mathematical Biology, 67(4):767-797, 2013. With E. Mossel.
Mutation rate variation across loci is well known to cause difficulties, notably identifiability issues, in the reconstruction of evolutionary trees from molecular sequences. Here we introduce a new approach for estimating general rates-across-sites models. Our results imply, in particular, that large phylogenies are typically identifiable under rate variation. We also derive sequence-length requirements for high-probability reconstruction.
Our main contribution is a novel algorithm that clusters sites according to their mutation rate. Following this site clustering step, standard reconstruction techniques can be used to recover the phylogeny. Our results rely on a basic insight: that, for large trees, certain site statistics experience concentration-of-measure phenomena.
Robust Estimation of Latent Tree Graphical Models: Inferring Hidden States with Inexact Parameters
IEEE Transactions on Information Theory, 59(7):4357-4373, 2013. With E. Mossel and A. Sly.
Latent tree graphical models are widely used in computational biology,
signal and image processing, and network tomography. Here we design a
new efficient, estimation procedure for latent tree models, including
Gaussian and discrete, reversible models, that significantly improves
on previous sample requirement bounds. Our techniques are based on a
new hidden state estimator which is robust to inaccuracies in estimated
parameters. More precisely, we prove that latent tree models can be
estimated with high probability in the so-called Kesten-Stigum regime
with $O(log^2 n)$ samples where $n$ is the number of nodes.
Alignment-Free Phylogenetic Reconstruction: Sample Complexity via a Branching Process Analysis
Annals of Applied Probability, 23(2):693-721, 2013. With C. Daskalakis.
Conference abstract in Proceedings of RECOMB 2010, 123-137.
We present an efficient phylogenetic reconstruction algorithm,
allowing insertions and deletions, which provably achieves a
sequence-length requirement (or sample complexity) growing
polynomially in the number of taxa. Our algorithm is distance-based,
that is, it relies on pairwise sequence comparisons. More importantly,
our approach largely bypasses the difficult problem of multiple
sequence alignment.
An analytical comparison of coalescent-based multilocus methods: The three-taxon case
Proceedings of PSB 2013, 297-306.
Incomplete lineage sorting (ILS) is a common source of gene tree incongruence
in multilocus analyses. A large number of methods have been developed to infer
species trees in the presence of ILS. Here we provide a mathematical analysis
of several coalescent-based methods. Our analysis is performed on a three-taxon
species tree and assumes that the gene trees are correctly reconstructed along
with their branch lengths.
Phylogenetic Mixtures: Concentration of Measure in the Large-Tree Limit
Annals of Applied Probability, 22(6):2429-2459, 2012. With E. Mossel.
The reconstruction of phylogenies from DNA or protein sequences is a major task of computational evolutionary biology. Common phenomena, notably variations in mutation rates across genomes and incongruences between gene lineage histories, often make it necessary to model molecular data as originating from a mixtureof phylogenies. Such mixed models play an increasingly important role in practice.
Using concentration of measure techniques, we show that mixtures of large trees are typically identifiable. We also derive sequence-length requirements for high-probability reconstruction.
Global Alignment of Molecular Sequences via Ancestral State Reconstruction
Stochastic Processes and their Applications, 122(12):3852-3874, 2012.
With A. Andoni, C. Daskalakis, and A. Hassidim.
Conference abstract in Proceedings of ICS 2010, 358-369.
We consider the trace reconstruction problem on a tree (TRPT):
a binary sequence is broadcast through a tree channel where
we allow substitutions, deletions, and insertions; we seek
to reconstruct the original sequence from the sequences
received at the leaves. The TRPT is motivated by the
multiple sequence alignment problem in computational biology.
We give a simple recursive procedure giving strong reconstruction
guarantees at low mutation rates.
We begin with a random sequence $x_1, \ldots, x_k$ which is located at the root.
If vertex $v$ has the sequence $y_1, \ldots y_{k_{v}}$, then each one of its $d$
children will have a sequence which is generated from $y_1, \ldots y_{k_{v}}$ by
flipping three biased coins for each bit. The first coin has probability $p_s$
for Heads, and determines whether this bit will be substituted or not.
The second coin has probability $p_d$, and determines whether this bit
will be deleted, and the third coin has probability $p_i$ and determines
whether a new random bit will be inserted. The input to the procedure is
the sequences of the $n$ leaves of the tree, as well as the tree structure
(but not the sequences of the inner vertices) and the goal is to reconstruct
an approximation to the sequence of the root (the DNA of the ancestral father).
For every $\epsilon > 0$, we present a deterministic algorithm which outputs an
approximation of $x_1, \ldots, x_k$ with probability $1 - \epsilon$, if
$p_i + p_d < O(1/k^{2/3} \log n)$ and $(1 - 2p_s)^2 > O(d^{-1} \log d)$,
where the probability is on the coin flips which determine the mutation process.
To our knowledge, this is the first rigorous trace reconstruction result on a tree in the presence of indels.
On Fixed-Price Marketing for Goods with Positive Network Externalities
Proceedings of WINE 2012, 532-538. With V. Mirrokni and M. Sundararajan.
In this paper we discuss marketing strategies for goods that have positive
network externalities, i.e., when a buyer's value for an item is positively
influenced by others owning the item.
We investigate revenue-optimal strategies of a specific form where the seller
gives the item for free to a set of users, and then sets a fixed price
for the rest. We present a $1\over 2$-approximation for this problem under
assumptions about the form of the externality. To do so, we apply ideas
from the influence maximization literature
and also use a recent result on non-negative submodular
maximization as a blax-box.
Recovering the tree-like trend of evolution despite extensive lateral genetic transfer: A probabilistic analysis
Proceedings of RECOMB 2012, 224-238. With S. Snir.
Journal version in Journal of Computational Biology, 20(2):93-112, 2013.
Lateral gene transfer (LGT) is a common mechanism of non-vertical
evolution where genetic material is transferred between two more or
less distantly related organisms. It is particularly common in
bacteria where it contributes to adaptive evolution with important
medical implications. In evolutionary studies, LGT has been shown to
create widespread discordance between gene trees as genomes become
mosaics of gene histories. In particular, the Tree of Life has been
questioned as an appropriate representation of bacterial evolutionary
history. Nevertheless a common hypothesis is that prokaryotic
evolution is primarily tree-like, but that the underlying trend is
obscured by LGT. Extensive empirical work has sought to extract a
common tree-like signal from conflicting gene trees.
Here we give a probabilistic perspective on the problem of recovering
the tree-like trend despite LGT. Under a model of randomly
distributed LGT, we show that the species phylogeny can be
reconstructed even in the presence of surprisingly many (almost
linear number of) LGT events per gene tree. Our results, which are
optimal up to logarithmic factors, are based on the analysis of a
robust, computationally efficient reconstruction method and provides
insight into the design of such methods. Finally we show that our
results have implications for the discovery of highways of gene
sharing.
Phylogenies without Branch Bounds: Contracting the Short, Pruning the Deep
SIAM J. Discrete Math., 25(2):872-893, 2011.
With C. Daskalakis, E. Mossel.
Conference abstract in Proceedings of RECOMB 2009, 451-465.
We introduce a new phylogenetic reconstruction algorithm which,
unlike most previous rigorous inference techniques, does not rely
on assumptions regarding the branch lengths or the depth of the tree.
The algorithm returns a forest which is guaranteed to contain all edges
that are: 1) sufficiently long and 2) sufficiently close to the leaves.
How much of the true tree is recovered depends on the sequence length provided.
The algorithm is distance-based and runs in polynomial time.
On the inference of large phylogenies with long branches: How long is too long?
Bulletin of Mathematical Biology, 73(7):1627-1644, 2011. With E. Mossel and A. Sly.
Recent work has highlighted deep connections between sequence-length requirements for high-probability phylogeny reconstruction and the related problem of the estimation of ancestral sequences. In [Daskalakis et al.'09], building on the work of [Mossel'04], a tight sequence-length requirement was obtained for the CFN model. In particular the required sequence length for high-probability reconstruction was shown to undergo a sharp transition (from $O(\log n)$ to $\hbox{poly}(n)$, where $n$ is the number of leaves) at the "critical" branch length $\critmlq$ (if it exists) of the ancestral reconstruction problem.
Here we consider the GTR model. For this model, recent results of [Roch'09] show that the tree can be accurately reconstructed with sequences of length $O(\log(n))$ when the branch lengths are below $\critksq$, known as the Kesten-Stigum (KS) bound. Although for the CFN model $\critmlq = \critksq$, it is known that for the more general GTR models one has $\critmlq \geq \critksq$ with a strict inequality in many cases. Here, we show that this phenomenon also holds for phylogenetic reconstruction by exhibiting a family of symmetric models $Q$ and a phylogenetic reconstruction algorithm which recovers the tree from $O(\log n)$-length sequences for some branch lengths in the range $(\critksq,\critmlq)$. Second we prove that phylogenetic reconstruction under GTR models requires a polynomial sequence-length for branch lengths above $\critmlq$.
Reconstruction on Trees: Exponential Moment Bounds for Linear Estimators
Electronic Communications in Probability, 16:251-261, 2011.
With Y. Peres.
Consider a Markov chain $(\xi_v)_{v \in V} \in [k]^V$ on the infinite
$b$-ary tree $T = (V,E)$ with irreducible edge transition
matrix $M$, where $b \geq 2$, $k \geq 2$ and $[k] = \{1,\ldots,k\}$.
We denote by $L_n$ the level-$n$ vertices of $T$.
Assume $M$ has a real second-largest (in absolute value) eigenvalue $\lambda$
with corresponding real eigenvector $\nu \neq 0$.
Letting $\sigma_v = \nu_{\xi_v}$, we consider the following root-state
estimator,
which was introduced by
Mossel and Peres (2003) in the context of the ``recontruction problem''
on trees:
\begin{equation*}
S_n = (b\lambda)^{-n} \sum_{x\in L_n} \sigma_x.
\end{equation*}
As noted by Mossel and Peres, when $b\lambda^2 > 1$ (the so-called Kesten-Stigum reconstruction phase)
the quantity $S_n$ has uniformly bounded variance. Here,
we give bounds on the moment-generating
functions of $S_n$ and $S_n^2$ when $b\lambda^2 > 1$. Our results have implications for the
inference of evolutionary trees.
Evolutionary Trees and the Ising Model on the Bethe Lattice: A Proof of Steel's Conjecture
Probability Theory and Related Fields, 149(1-2):149-189, 2011.
With C. Daskalakis, E. Mossel.
Conference abstract in Proceedings of ACM STOC 2006, 159-168.
A major task of evolutionary biology is the
reconstruction of phylogenetic trees from molecular data. The
evolutionary model is given by a Markov chain on
a tree. Given samples from the leaves of the Markov chain,
the goal is to reconstruct the
leaf-labelled tree.
It is well known that in order to
reconstruct a tree on $n$ leaves,
sample sequences
of length $\Omega(\log
n)$ are needed. It was conjectured by M.~Steel that for the CFN/Ising
evolutionary model, if the mutation probability on all edges of
the tree is less than $p^{\ast} = (\sqrt{2}-1)/2^{3/2}$,
then the
tree can be recovered from sequences of length $O(\log n)$. The value
$p^{\ast}$ is given by the transition point for the extremality of the
free Gibbs measure for the Ising model on the binary tree. Steel's
conjecture was proven by the second author in the special case where the tree
is ``balanced.'' The second author also proved that if all edges
have mutation probability larger than $p^{\ast}$ then the length
needed is $n^{\Omega(1)}$.
Here we show that Steel's conjecture holds true for general trees by giving a
reconstruction algorithm that recovers the tree from $O(\log n)$-length sequences
when the mutation probabilities are discretized and less than $p^\ast$.
Our proof and results demonstrate that extremality of the free
Gibbs measure on the infinite binary tree, which has been studied
before in probability, statistical physics and computer science,
determines how distinguishable are Gibbs measures on finite binary
trees.
Network Delay Inference from Additive Metrics
Random Structures and Algorithms, 37(2):176-203, 2010.
With S. Bhamidi and R. Rajagopal.
We demonstrate the use of computational phylogenetic techniques
to solve a central problem in inferential
network monitoring. More precisely, we design a novel algorithm
for multicast-based delay inference, i.e.
the problem of reconstructing the topology and delay characteristics
of a network from end-to-end delay
measurements on network paths. Our inference algorithm is based on
additive metric techniques widely used in phylogenetics. It runs in
polynomial time and requires a sample of size only $\poly(\log n)$.
Incomplete Lineage Sorting: Consistent Phylogeny Estimation from Multiple Loci
IEEE/ACM Transactions on Computational Biology and Bioinformatics, 7(1):166-171 , 2010.
With E. Mossel.
We introduce a simple algorithm for reconstructing phylogenies
from multiple gene trees in the presence of incomplete lineage
sorting, that is, when the topology of the gene trees may differ
from that of the species tree. We show that our technique is
statistically consistent under standard stochastic assumptions,
that is, it returns the correct tree given sufficiently many
unlinked loci. We also show that it can tolerate moderate estimation errors.
Submodularity of Influence in Social Networks: From Local to Global
SIAM J. Comput., 39(6):2176-2188, 2010.
With E. Mossel.
Conference abstract in Proceedings of ACM STOC 2007, 128-134.
Social networks are often represented as directed graphs, where the nodes are individuals and the edges indicate a form of social relationship. A simple way to model the diffusion of ideas, innovative behavior, or word-of-mouth effects on such a graph is to consider an increasing process of infected (or active) nodes: each node becomes infected once an activation function of the set of its infected neighbors crosses a certain threshold value. Such a model was introduced by Kempe, Kleinberg, and Tardos (KKT) in [Maximizing the spread of influence through a social network, in Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003, pp. 137146] and [Influential nodes in a diffusion model for social networks, in Proceedings of the 32nd International Colloquium on Automata, Languages and Programming (ICALP), 2005], where the authors also impose several natural assumptions: the threshold values are random and the activation functions are monotone and submodular. The monotonicity condition indicates that a node is more likely to become active if more of its neighbors are active, while the submodularity condition indicates that the marginal effect of each neighbor is decreasing when the set of active neighbors increases. For an initial set of active nodes $S$, let $\sigma(S)$ denote the expected number of active nodes at termination. Here, we prove a conjecture of KKT: we show that the function $\sigma(S)$ is submodular under the assumptions above. We prove the same result for the expected value of any monotone, submodular function of the set of active nodes at termination. Roughly, our results demonstrate that local submodularity is preserved globally under this diffusion process. This is of natural computational interest, as many optimization problems have good approximation algorithms for submodular functions.
Toward Extracting All Phylogenetic Information from Matrices of Evolutionary Distances
Science, 327(5971):1376 - 1379, 2010. (Posted by permission of the AAAS for personal use, not for redistribution.)
The matrix of evolutionary distances is a model-based statistic, derived from molecular sequences, summarizing the pairwise phylogenetic relationships between a collection of species. Phylogenetic tree reconstruction methods relying on this matrix are relatively fast and thus widely used in molecular systematics. However, because of their intrinsic reliance on summary statistics, distance-matrix methods are assumed to be less accurate than likelihood-based approaches. In this paper, pairwise sequence comparisons are shown to be more powerful than previously hypothesized. A statistical analysis of certain distance-based techniques indicates that their data requirement for large evolutionary trees essentially matches the conjectured performance of maximum likelihood methods---challenging the idea that summary statistics lead to suboptimal analyses. On the basis of a connection between ancestral state reconstruction and distance averaging, the critical role played by the covariances of the distance matrix is identified.
Alignment-Free Phylogenetic Reconstruction
Proceedings of RECOMB 2010, 123-137.
With C. Daskalakis.
Journal version in Annals of Applied Probability, 23(2):693-721, 2013.
We introduce the first polynomial-time phylogenetic
reconstruction algorithm under a model of sequence evolution
allowing insertions
and deletions---or indels. Given appropriate
assumptions, our algorithm requires sequence lengths growing
polynomially in the number of leaf taxa.
Our techniques are distance-based
and largely bypass the problem of multiple alignment.
Global Alignment of Molecular Sequences via Ancestral State Reconstruction
Proceedings of ICS 2010, 358-369.
With A. Andoni, C. Daskalakis, and A. Hassidim.
Journal version in Stochastic Processes and their Applications, 122(12):3852-3874, 2012.
We consider the trace reconstruction problem on a tree (TRPT):
a binary sequence is broadcast through a tree channel where
we allow substitutions, deletions, and insertions; we seek
to reconstruct the original sequence from the sequences
received at the leaves. The TRPT is motivated by the
multiple sequence alignment problem in computational biology.
We give a simple recursive procedure giving strong reconstruction
guarantees at low mutation rates.
We begin with a random sequence $x_1, \ldots, x_k$ which is located at the root.
If vertex $v$ has the sequence $y_1, \ldots y_{k_{v}}$, then each one of its $d$
children will have a sequence which is generated from $y_1, \ldots y_{k_{v}}$ by
flipping three biased coins for each bit. The first coin has probability $p_s$
for Heads, and determines whether this bit will be substituted or not.
The second coin has probability $p_d$, and determines whether this bit
will be deleted, and the third coin has probability $p_i$ and determines
whether a new random bit will be inserted. The input to the procedure is
the sequences of the $n$ leaves of the tree, as well as the tree structure
(but not the sequences of the inner vertices) and the goal is to reconstruct
an approximation to the sequence of the root (the DNA of the ancestral father).
For every $\epsilon > 0$, we present a deterministic algorithm which outputs an
approximation of $x_1, \ldots, x_k$ with probability $1 - \epsilon$, if
$p_i + p_d < O(1/k^{2/3} \log n)$ and $(1 - 2p_s)^2 > O(d^{-1} \log d)$,
where the probability is on the coin flips which determine the mutation process.
To our knowledge, this is the first rigorous trace reconstruction result on a tree in the presence of indels.
Shrinkage Effect in Ancestral Maximum Likelihood
IEEE/ACM Transactions on Computational Biology and Bioinformatics, 6(1):126-133, 2009.
With E. Mossel, M. Steel.
Ancestral maximum likelihood (AML) is a method that simultaneously
reconstructs a phylogenetic tree and ancestral sequences from extant
data (sequences at the leaves). The tree and ancestral sequences
maximize the probability of observing the given data under a Markov
model of sequence evolution, in which branch lengths are also optimized
but constrained to take the same value on any edge across all sequence
sites. AML differs from the more usual form of maximum likelihood (ML)
in phylogenetics because ML averages over all possible ancestral
sequences. ML has long been known to be statistically consistent --
that is, it converges on the correct tree with probability approaching
1 as the sequence length grows. However, the statistical consistency of
AML has not been formally determined, despite informal remarks in a
literature that dates back 20 years. In this short note we prove a general
result that implies that AML is statistically inconsistent. In particular
we show that AML can `shrink' short edges in a tree, resulting in a tree
that has no internal resolution as the sequence length grows. Our
results apply to any number of taxa.
Phylogenies without Branch Bounds: Contracting the Short, Pruning the Deep
Proceedings of RECOMB 2009, 451-465.
With C. Daskalakis, E. Mossel.
Journal version in SIAM J. Discrete Math., 25(2):872-893, 2011.
We introduce a new phylogenetic reconstruction algorithm which,
unlike most previous rigorous inference techniques, does not rely
on assumptions regarding the branch lengths or the depth of the tree.
The algorithm returns a forest which is guaranteed to contain all edges
that are: 1) sufficiently long and 2) sufficiently close to the leaves.
How much of the true tree is recovered depends on the sequence length provided.
The algorithm is distance-based and runs in polynomial time.
Sequence-Length Requirement of Distance-Based Phylogeny Reconstruction: Breaking the Polynomial Barrier
Proceedings of IEEE FOCS 2008, 729-738.
We introduce a new distance-based phylogeny reconstruction technique
which provably achieves, at sufficiently short branch lengths,
a polylogarithmic sequence-length
requirement - improving significantly over previous polynomial
bounds for distance-based methods.
The technique is based on an averaging procedure that implicitly
reconstructs ancestral sequences.
In the same token, we extend previous results on phase transitions in phylogeny reconstruction to general
time-reversible models. More precisely, we show that in the so-called Kesten-Stigum zone (roughly,
a region of the parameter space where ancestral sequences are well approximated by ``linear combinations''
of the observed sequences) sequences of length $\poly(\log n)$ suffice for reconstruction
when branch lengths are discretized.
Here $n$ is the number of extant species.
Our results challenge, to some extent, the conventional wisdom that estimates of evolutionary distances alone
carry significantly
less information about phylogenies than full sequence datasets.
On Learning Thresholds of Parities and Unions of Rectangles in Random Walk Models
Random Structures and Algorithms, 31(4):406-417, 2007.
In a recent breakthrough, [Bshouty et al., 2003] obtained
the first passive-learning algorithm for DNFs under
the uniform distribution, using
random walks. They also showed that DNFs are learnable
in the weaker Noise Sensitivity model. We extend those results
in several directions. We first show that thresholds of
parities, a natural class encompassing DNFs, cannot be learned efficiently
in the Noise Sensitivity model
using only statistical queries. In contrast, we show that
a cyclic version of the Random Walk model allows to learn efficiently
polynomially weighted thresholds of parities.
We also extend the algorithm of Bshouty et al. to the case
of Unions of Rectangles, a natural generalization of DNFs to $\{0,\ldots,b-1\}^n$.
Slow Emergence of Cooperation for Win-Stay Lose-Shift on Trees
Machine Learning 67(1-2):7-22, 2007. (Special Issue on Learning and Computational Game Theory)
With E. Mossel.
We consider a group of agents on a graph who repeatedly play the prisoner's
dilemma game against their neighbors. The players adapt their actions to the
past behavior of their opponents by applying the win-stay lose-shift strategy.
On a finite connected graph, it is easy to see that the system learns
to cooperate
by converging to the all-cooperate state in a finite time. We analyze the rate
of convergence in terms of the size and structure of the graph. [Dyer
et al., 2002]
showed that the system converges rapidly on the cycle, but that it takes a time
exponential in the size of the graph to converge to cooperation on the complete
graph. We show that the emergence of cooperation is exponentially slow in some
expander graphs. More surprisingly, we show that it is also exponentially slow
in bounded-degree trees, where many other dynamics are known to converge
rapidly.
Upstream Reciprocity and the Evolution of Gratitude
If someone is nice to you, you feel good and may be inclined
to be nice to somebody else. This every day experience is borne out by experimental
games: the recipients of an act of kindness are more inclined to help in turn,
even if the person who benefits from their generosity is somebody else. This
behavior, which has been called `upstream reciprocity', appears to be a misdirected
act of gratitude: you help somebody because somebody else has helped you. Does
this make any sense from an evolutionary or game theoretic perspective? In this
paper, we show that upstream reciprocity alone does not lead to the evolution of
cooperation. But upstream reciprocity can evolve and increase the level of
cooperation if it is linked to either direct reciprocity or network reciprocity.
We calculate the random chains of altruistic acts that are induced by upstream
reciprocity. Our analysis shows that gratitude and other positive emotions,
which increase the willingness to help others, can evolve in the competitive world
of natural selection.
On the Submodularity of Influence in Social Networks
Proceedings of ACM STOC 2007, 128-134.
With E. Mossel.
Journal version in SIAM J. Comput. 39(6):2176-2188, 2010.
We prove and extend a conjecture of Kempe,
Kleinberg, and Tardos (KKT) on the spread of influence in social networks.
A social network can be represented by a directed graph
where the nodes are individuals and the edges indicate a form of social relationship.
A simple way to model the diffusion of ideas, innovative behavior, or ``word-of-mouth'' effects
on such a graph is to consider an increasing process of ``infected'' (or active) nodes:
each node becomes infected once an activation function of the set of its
infected neighbors crosses a certain threshold value.
Such a model was introduced
by KKT in~\cite{KeKlTa:03,KeKlTa:05} where the authors
also impose several natural assumptions: the threshold
values are (uniformly) random to account for our lack of knowledge
of the true values; and the activation functions
are monotone and submodular, i.e.~have ``diminishing returns.''
The monotonicity condition
indicates that a node is more likely to become active if more of its neighbors
are active, while the submodularity condition,
indicates that the marginal effect
of each neighbor is decreasing when the set of active
neighbors increases.
For an initial set of active nodes $S$, let
$\sigma(S)$
denote the expected number of active nodes at termination.
Here we prove a conjecture of KKT: we show that the function $\sigma(S)$ is
submodular under the assumptions above.
We prove the same result for the expected value
of any monotone, submodular function of the set of active nodes at termination.
In other words,
our results demonstrate that ``local'' submodularity is preserved
``globally'' under diffusion processes. This is of natural computational interest, as
many optimization problems have good approximation algorithms for
submodular functions.
In particular, our results coupled with an argument in~\cite{KeKlTa:03} imply that a greedy
algorithm gives an $(1-1/e-\eps)$-approximation algorithm for maximizing
$\sigma(S)$ among all sets $S$ of a given size. This result has important
practical implications for many social network analysis problems, notably
viral marketing.
First to Market is not Everything:
an Analysis of Preferential Attachment with Fitness
Proceedings of ACM STOC 2007, 135-144.
With C. Borgs, J. Chayes and C. Daskalakis.
The design of algorithms on complex
networks, such as routing, ranking or recommendation algorithms,
requires a detailed understanding of the growth
characteristics of the networks of interest, such as the Internet,
the web graph, social networks or online communities. To this end,
preferential attachment, in which the popularity (or relevance) of
a node is determined by its degree, is a well-known and appealing
random graph model, whose predictions are in accordance with
experiments on the web graph and several social networks. However,
its central assumption, that the popularity of the nodes depends
only on their degree, is not a realistic one, since every node has
potentially some intrinsic quality which can differentiate its
attractiveness from other nodes with similar degrees.
In this paper, we provide a rigorous analysis of preferential
attachment with fitness, suggested by Bianconi and Barabasi and
studied by Motwani and Xu, in which the degree of a
vertex is scaled by its quality to determine its attractiveness.
Including quality considerations in the classical preferential
attachment model provides a much more realistic description of
many complex networks, such as the web graph, and allows to
observe a much richer behavior in the growth dynamics of these
networks. Specifically, depending on the shape of the distribution
from which the qualities of the vertices are drawn, we observe
three distinct phases, namely a first-mover-advantage phase, a
fit-get-richer phase and an innovation-pays-off
phase. We precisely characterize the properties of the quality
distribution that result in each of these phases and we compute
the exact growth dynamics for each phase. The dynamics provide
rich information about the quality of the vertices, which can be
very useful in many practical contexts, including ranking
algorithms for the web, recommendation algorithms, as well as the
study of social networks. Furthermore, the mathematical techniques
we introduce to establish these dynamics could be applicable to a
wide variety of problems.
Learning nonsingular phylogenies and hidden Markov models
Annals of Applied Probability, 16(2):583-614, 2006.
With E. Mossel.
Conference abstract in Proceedings of ACM STOC 2005, 366-375.
In this paper, we study the problem of
learning phylogenies and hidden Markov models.
We call the Markov model nonsingular
if all transtion matrices have determinants
bounded away from $0$ (and $1$).
We highlight the role of the nonsingularity condition for the learning problem.
Learning hidden Markov models without the nonsingularity condition
is at least as hard as learning parity with noise.
On the other hand, we give a polynomial-time algorithm for learning
nonsingular phylogenies and hidden Markov models.
A smoothing heuristic for a bilevel pricing problem
European Journal of Operational Research, 174(3):1396-1413, 2006.
With J.P. Dussault, P.Marcotte, G. Savard.
We provide a global heuristic for a class of bilevel programs with
bilinear objectives and linear contraints. The algorithm relies on
an interior point approach that can be seen as a combination of the
smoothing
and implicit programming techniques well-known to the MPEC litterature.
But contrary to what is usually done there, here the focus is on global
optimality. We tested the algorithm on large-scale instances of a network
pricing problem, one of the main applications of this model. The results
show that, on hard instances, the heuristic is a good alternative to
exact solvers based on mixed 0-1 programming formulations.
A Short Proof that Phylogenetic Tree Reconstruction by Maximum Likelihood is Hard
IEEE/ACM Transactions on Computational Biology and Bioinformatics, 3(1):92-94, 2006.
Maximum likelihood is one of the most widely used techniques to infer
evolutionary histories. Although it is thought to
be intractable, a proof of its hardness has been lacking. Here, we give
a short proof that computing the maximum
likelihood tree is NP-hard by exploiting a connection between likelihood
and parsimony observed by Tuffley and Steel.
The Kesten-Stigum Reconstruction Bound Is Tight for Roughly Symmetric Binary
Channels
Proceedings of IEEE FOCS 2006, 518-530.
With C. Borgs, J. Chayes, and E. Mossel.
We establish the exact threshold for the reconstruction problem for a
binary asymmetric channel on the b-ary tree,
provided that the asymmetry is sufficiently small. This is the first
exact reconstruction threshold obtained in roughly
a decade. We discuss the implications of our result for Glauber dynamics,
phylogenetic reconstruction, and
so-called ``replica symmetry breaking'' in spin glasses and random
satisfiability problems.
Optimal Phylogenetic Reconstruction
Proceedings of ACM STOC 2006, 159-168.
With C. Daskalakis, E. Mossel.
Journal version in Probability Theory and Related Fields, 149(1-2):149-189, 2011.
It is well known that in order to reconstruct a tree on $n$ leaves,
sequences of length $\Omega(\log n)$ are needed. It was conjectured
by M. Steel that for the CFN evolutionary model, if the mutation
probability on all edges of the tree is less than $p^{\ast} = (\sqrt{2}-1)/2^{3/2}$
than the tree can be recovered from sequences of length $O(\log n)$. This was proven
by the second author in the special case where the tree is ``balanced''.
The second author also proved that if all edges have mutation probability larger
than $p^{\ast}$ then the length needed is $n^{\Omega(1)}$. This ``phase-transition'' in
the number of samples needed is closely related to the phase transition for the
reconstruction problem (or extremality of free measure) studied extensively
in statistical physics and probability.
Here we complete the proof of Steel's conjecture and give a reconstruction algorithm
using optimal (up to a multiplicative constant) sequence length. Our results further
extend to obtain optimal reconstruction algorithm for the Jukes-Cantor model with
short edges. All reconstruction algorithms run in time polynomial in the sequence
length.
The algorithm and the proofs are based on a novel combination of combinatorial,
metric and probabilistic arguments.
Bounding Fastest Mixing
Electronic Communications in Probability, 10:282-296, 2005.
In a series of recent works,
Boyd, Diaconis, and their co-authors
have introduced a semidefinite programming approach
for computing
the fastest mixing Markov chain on a graph of allowed transitions,
given a target stationary distribution. In this paper, we show that
standard mixing-time analysis techniques---variational
characterizations, conductance, canonical paths---can
be used to give simple, nontrivial lower and upper bounds
on the fastest mixing time. To test the applicability of this idea,
we consider several detailed examples including
the Glauber dynamics of the Ising model---and get sharp bounds.
Design and Analysis of an Approximation Algorithm for Stackelberg Network Pricing
Networks, 46(1):57-67, 2005.
With P. Marcotte, G. Savard.
We consider the problem of maximizing the revenue
raised from tolls set on the arcs of a transportation network, under
the constraint that users are assigned to toll-compatible shortest
paths. We first provide a polynomial algorithm with a worst-case
guarantee of $O(\log m_T)$, where $m_T$ denotes the number of toll
arcs. Next we show that the approximation is tight with respect to the
linear programming relaxation by constructing examples for which the
integrality gap is reached.
Learning nonsingular phylogenies and hidden Markov models
Proceedings of ACM STOC 2005, 366-375.
With E. Mossel.
Journal version in Annals of Applied Probability, 16(2):583-614, 2006.
In this paper, we study the problem of
learning phylogenies and hidden Markov models.
We call the Markov model nonsingular
if all transtion matrices have determinants
bounded away from $0$ (and $1$).
We highlight the role of the nonsingularity condition for the learning problem.
Learning hidden Markov models without the nonsingularity condition
is at least as hard as learning parity with noise.
On the other hand, we give a polynomial-time algorithm for learning
nonsingular phylogenies and hidden Markov models.
Transient Growth in Taylor-Couette Flow
Physics of Fluids, 14(10), 2002.
With H. Hristova, P. Schmid, L. Tuckerman.
Conference abstract in Theoretical and Computational Fluid Dynamics 16:43-48, 2002.
Transient growth due to non-normality is investigated
for the Taylor-Couette problem with counter-rotating
cylinders as a function of aspect ratio $\eta$ and Reynolds
number $Re$. For all $Re \leq 500$, transient growth is
enhanced by curvature, i.e. is greater for $\eta < 1$
than for $\eta = 1$, the plane Couette limit. For fixed
$Re < 130$ it is found that the greatest transient growth
is achieved for $\eta$ between the Taylor-Couette linear
stability boundary, if it exists, and one, while for
$Re > 130$ the greatest transient growth is achieved
for $\eta$ on the linear stability boundary. Transient
growth is shown to be approximately 20% higher near the
linear stability boundary at $Re = 310$, $\eta = 0.986$
than at $Re = 310$, $\eta = 1$, near the threshold observed
for transition in plane Couette. The energy in the optimal
inputs is primarily meridional; that in the optimal outputs
is primarily azimuthal. Pseudospectra are calculated for
two contrasting cases. For large curvature, $\eta = 0.5,$
the pseudospectra adhere more closely to the spectrum than
in a narrow gap case, $\eta = 0.99$.
Non-colliding Random Walks, Tandem Queues and Discrete Orthogonal Polynomial Ensembles
Electronic Journal of Probability, 7:1-24, 2002.
With W. Koenig, Neil O'Connell.
We show that the function $h(x)=\prod_{i<j}(x_j-x_i)$ is harmonic for any
random walk in $\R^k$ with exchangeable increments, provided the required moments exist.
For the subclass of random walks which can only exit the Weyl chamber
$W=\{x\colon x_1<x_2<\cdots<x_k\}$ onto a point where $h$ vanishes,
we define the corresponding Doob $h$-transform. For certain special cases,
we show that the marginal distribution of the conditioned process at a fixed time
is given by a familiar discrete orthogonal polynomial ensemble. These include the
Krawtchouk and Charlier ensembles, where the underlying walks are binomial and
Poisson, respectively. We refer to the corresponding conditioned processes in
these cases as the Krawtchouk and Charlier processes. In [O'Connell and Yor (2001b)],
a representation was obtained for the Charlier process by considering a sequence
of $M/M/1$ queues in tandem. We present the analogue of this representation theorem
for the Krawtchouk process, by considering a sequence of discrete-time $M/M/1$ queues
in tandem. We also present related results for random
walks on the circle, and relate a system of non-colliding walks in this case
to the discrete analogue of the circular unitary ensemble (CUE).
Transient growth in exactly counter-rotating Couette-Taylor flow
Theoretical and Computational Fluid Dynamics 16:43-48, 2002.
With H. Hristova, P. Schmid, L. Tuckerman.
Journal version in Physics of Fluids, 14(10), 2002.
Transient growth due to non-normality is investigated for the Taylor-Couette problem with
counter-rotating cylinders as a function of aspect ratio $\eta$ and Reynolds number $Re$.
For all $Re \leq 500$, transient growth is enhanced by curvature, i.e. is greater for
$\eta < 1$ than for $\eta = 1$, the plane Couette limit. For fixed $Re < 130$ it is
found that the greatest transient growth is achieved for $\eta$ between the Taylor-Couette
linear stability boundary, if it exists, and one, while for $Re > 130$ the greatest
transient growth is achieved for $\eta$ on the linear stability boundary. Transient growth
is shown to be approximately 20% higher near the linear stability boundary at $Re = 310$,
$\eta = 0.986$ than at $Re = 310$, $\eta = 1$, near the threshold observed for transition
in plane Couette. The energy in the optimal inputs is primarily meridional; that in the
optimal outputs is primarily azimuthal. Pseudospectra are calculated for two contrasting
cases. For large curvature, $\eta = 0.5,$ the pseudospectra adhere more closely to the
spectrum than in a narrow gap case, $\eta = 0.99$.