Photo: Sylvain Crouzet
I am a
Vilas Distinguished Achievement Professor
in the
Department of Mathematics
at the
University of Wisconsin-Madison,
where I belong to the
Probability
Group and
Applied Mathematics
Group.
I have an affiliate appointment in the
Department of Statistics and
I am also affiliated with the Theory of Computing Group
in the
Department of Computer Sciences.
I can often be found at the
Wisconsin Institute for Discovery.
I am on the executive committee of the
Institute for Foundations of Data Science (IFDS),
a collaboration with the University of
Washington,
the University of California, Santa Cruz,
and the University of Chicago,
funded through the NSF TRIPODS Phase II program.
I work at the intersection of applied probability,
statistics and
theoretical computer science, with an emphasis on biological
applications.
More details on my research interests and
publications can be found
here.
My work is currently supported by NSF grants
DMS-2023239 (TRIPODS Phase II)
and
DMS-2308495,
as well as a Van Vleck Research Professor Award
and a Vilas Distinguished Achievement Professorship.
My CV
is
here.
External Profiles
arXiv,
Google Scholar,
dblp,
PubMed,
ORCID,
Math Genealogy,
Math Alliance
News
October 2024: In Spring 2025, I will teach a pilot for a
new
linear algebra course
targeting students interested in AI, data science, etc. The course,
entitled "Linear Algebra and Optimization" (a section of MATH 491), will cover
introductory linear algebra concepts (similarly to MATH 320, 340)
as well as aspects of differential calculus in several variables and basic optimization theory
(partly covered in MATH 234) - combining in a unified coverage two major pillars of modern
data-driven and computational fields.
July 2024: I gave an invited plenary talk at
SIAM DM24.
March 2024: I gave an IMS Mediallion Lecture at
SSP 2024.
Feb 2024: I was named Vilas Distinguished Achievement Professor.
Jan 2024: I am on the program committee of the Great Lakes Bioinformatics
Conference (GLBIO 2024) which will take place May 13-16, 2024 at
the University of Pittsburgh. You can register
here.
Dec 2023: My book Modern Discrete Probability: An Essential Toolkit is now
available to pre-order on Amazon. It will appear in January 2024.
Dec 2023: Our proposal, with Tandy Warnow and
Siavash Mirarab,
for an IMSI workshop
on Contemporary Challenges in Large-Scale Sequence Alignments and Phylogenies: Bridging Theory and Practice
has been approved. It will take place in August 2025. More details here.
Dec 2023: My Ph.D. student
Max Hill successfully defended his thesis. Congratulations!
Oct 2023: I received a Van Vleck Research Professor Award.
Sep 2023: I am starting a three-year term on
the Academic Planning Council
of the College of Letters and Science.
Aug 2023: I have a new NSF grant.
May 2023: My Ph.D. student
Yu Sun successfully defended his thesis. Congratulations!
Mar 2023: Our new course MATH 444: Graphs and Networks in Data Science, developed jointly
with Hanbaek Lyu, has been approved
and will be offered in Fall 2023.
Jan 2023: My Ph.D. student
Shuqi Yu successfully defended her thesis. Congratulations!
Jul 2022: Our proposal for a semester-long program
on Meeting the Genomics Data Challenge: Theory, Methods, and Applications of Quantitative Phylogenomics
at the Institute for Computational and Experimental Research in Mathematics (ICERM)
has been approved. The program (led by Elizabeth Allman and Laura Kubatko)
will take place Fall 2024. More details to come.
Apr 2022: I was named Fellow of the Institute of Mathematical Statistics (IMS).
For previous news, see here.
Teaching
Fall 2023: MATH 833 - Modern Discrete Probability [Topics in Probability]
Spring 2023: MATH 632 - Introduction to stochastic processes
Lecture notes and tutorials
Topics course on
high-dimensional probability and statistics
Advanced undergraduate course on the mathematics of data
Graduate course on
modern discrete probability
Topics course on
stochastic processes
in evolutionary genetics
First year of
graduate probability theory
Brief survey of
mathematical phylogenetics
Tutorial on
sequence-length requirements in phylogenetics
Summer school slides on
probability on graphs with applications to data science
Tutorial slides on
mathematical phylogenomics
Selected Publications (full list here)
Species tree estimation under joint modeling of coalescence and duplication:
sample complexity of quartet methods
Annals of Applied Probability, 32(6): 4681-4705, 2022. With Max Hill and Brandon Legried.
We consider species tree estimation under a standard stochastic model of gene
tree evolution that incorporates incomplete lineage sorting (as modeled by a
coalescent process) and gene duplication and loss (as modeled by a branching
process). Through a probabilistic analysis of the model, we derive sample
complexity bounds for widely used quartet-based inference methods that
highlight the effect of the duplication and loss rates in both subcritical and
supercritical regimes.
Polynomial-Time Statistical Estimation of Species Trees Under Gene Duplication and Loss
Journal of Computational Biology, 28(5):452-468. With Brandon Legried, Erin Molloy, and Tandy Warnow.
Conference version in Proceedings of RECOMB 2020, 120-135.
Phylogenomics—the estimation of species trees from multi-locus datasets—is a
common step in many biological studies. However, this estimation is challenged
by the fact that genes can evolve under processes, including incomplete lineage
sorting (ILS) and gene duplication and loss (GDL), that make their trees
different from the species tree. In this paper, we address the challenge of
estimating the species tree under GDL. We show that species trees are
identifiable under a standard stochastic model for GDL, and that the
polynomial-time algorithm ASTRAL-multi, a recent development in the ASTRAL
suite of methods, is statistically consistent under this GDL model. We also
provide a simulation study evaluating ASTRAL-multi for species tree estimation
under GDL. All scripts and datasets used in this study are available on the
Illinois Data Bank: https://doi.org/10.13012/B2IDB-2626814_V1.
Generalized least squares can overcome the critical threshold in respondent-driven sampling
Proceedings of the National Academy of Sciences, 115(41):10299-10304, 2018. With Karl Rohe.
In order to sample marginalized and/or hard-to-reach populations,
respondent-driven sampling (RDS) and similar techniques reach their
participants via peer referral. Under a Markov model for RDS, previous
research has shown that if the typical participant refers too many
contacts, then the variance of common estimators does not decay like
O(1/n), where n is the sample size. This implies that confidence
intervals will be far wider than under a typical sampling design. Here
we show that generalized least squares (GLS) can effectively reduce the
variance of RDS estimates. In particular, a theoretical analysis
indicates that the variance of the GLS estimator is O(1/n). We then
derive two classes of feasible GLS estimators. The first class is based
upon a Degree Corrected Stochastic Blockmodel for the underlying social
network. The second class is based upon a rank-two model. It might be of
independent interest that in both model classes, the theoretical results
show that it is possible to estimate the spectral properties of the
population network from the sampled observations. Simulations on
empirical social networks show that the feasible GLS (fGLS) estimators
can have drastically smaller error and rarely increase the error. A
diagnostic plot helps to identify where fGLS will aid estimation. The
fGLS estimators continue to outperform standard estimators even when
they are built from a misspecified model and when there is preferential
recruitment.
Circular Networks from Distorted Metrics
Proceedings of RECOMB 2018, 167-176. With J. Wang.
(
Best Paper Award, RECOMB 2018)
Trees have long been used as a graphical representation of species relationships.
However complex evolutionary events, such as genetic reassortments or hybrid
speciations which occur commonly in viruses, bacteria and plants, do not fit
into this elementary framework. Alternatively, various network representations
have been developed. Circular networks are a natural generalization of leaf-labeled
trees interpreted as split systems, that is, collections of bipartitions over leaf
labels corresponding to current species. Although such networks do not explicitly
model specific evolutionary events of interest, their straightforward visualization
and fast reconstruction have made them a popular exploratory tool to detect network-like
evolution in genetic datasets.
Standard reconstruction methods for circular networks, such as Neighbor-Net, rely
on an associated metric on the species set. Such a metric is first estimated from DNA
sequences, which leads to a key difficulty: distantly related sequences produce
statistically unreliable estimates. This is problematic for Neighbor-Net as it is based
on the popular tree reconstruction method Neighbor-Joining, whose sensitivity to distance
estimation errors is well established theoretically. In the tree case, more robust
reconstruction methods have been developed using the notion of a distorted metric,
which captures the dependence of the error in the distance through a radius of accuracy.
Here we design the first circular network reconstruction method based on distorted
metrics. Our method is computationally efficient. Moreover, the analysis of its radius
of accuracy highlights the important role played by the maximum incompatibility, a
measure of the extent to which the network differs from a tree.
Distance-based species tree estimation
under the coalescent: information-theoretic trade-off between number of loci and sequence length
Ann. Appl. Probab., 27(5)-2926-2955, 2017. With E. Mossel.
Conference abstract in Proceedings of RANDOM 2015, 931-942.
We consider the reconstruction of a phylogeny
from multiple genes under the multispecies coalescent.
We establish a connection with the sparse signal detection
problem, where one seeks to distinguish between
a distribution and a mixture of the distribution
and a sparse signal. Using this connection,
we derive an information-theoretic trade-off
between the number of genes, $m$, needed for an accurate
reconstruction and the sequence length, $k$, of the
genes. Specifically, we show that to detect
a branch of length $f$, one needs $m = \Theta(1/[f^{2} \sqrt{k}])$ genes.
Phase transition in the sample complexity of likelihood-based
phylogeny inference
Probability Theory and Related Fields, 169(1), 3-62, 2017. With A. Sly.
Reconstructing evolutionary trees from molecular sequence data
is a fundamental problem in computational biology. Stochastic
models of sequence evolution are closely related to spin systems
that have been extensively studied in statistical physics and
that connection has led to important insights on the theoretical
properties of phylogenetic reconstruction algorithms as well as the
development of new inference methods. Here, we study maximum
likelihood, a classical statistical technique which is perhaps
the most widely used in phylogenetic practice because of its
superior empirical accuracy.
At the theoretical level, except for its consistency, that is,
the guarantee of eventual correct reconstruction as the size of
the input data grows, much remains to be understood about the
statistical properties of maximum likelihood in this context.
In particular, the best bounds on the sample complexity or
sequence-length requirement of maximum likelihood, that is,
the amount of data required for correct reconstruction, are
exponential in the number, n, of tips---far from known lower
bounds based on information-theoretic arguments. Here we close
the gap by proving a new upper bound on the sequence-length
requirement of maximum likelihood that matches up to constants
the known lower bound for some standard models of evolution.
More specifically, for the r-state symmetric model of sequence
evolution on a binary phylogeny with bounded edge lengths, we
show that the sequence-length requirement behaves logarithmically
in n when the expected amount of mutation per edge is below what
is known as the Kesten-Stigum threshold. In general, the
sequence-length requirement is polynomial in n. Our results imply
moreover that the maximum likelihood estimator can be computed
efficiently on randomly generated data provided sequences are as above.
On the robustness to gene tree estimation error (or lack thereof) of
coalescent-based species tree methods
Systematic Biology, 64(4):663--676, 2015. With T. Warnow.
The estimation of species trees using multiple loci has become increasingly
common. Because different loci can have different phylogenetic histories
(reflected in different gene tree topologies) for multiple biological causes,
new approaches to species tree estimation have been developed that take gene
tree heterogeneity into account. Among these multiple causes, incomplete
lineage sorting (ILS), modeled by the multi-species coalescent, is potentially
the most common cause of gene tree heterogeneity, and much of the focus of
the recent literature has been on how to estimate species trees in the presence
of ILS. Despite progress in developing statistically consistent techniques for
estimating species trees when gene trees can differ due to ILS, there is
substantial controversy in the systematics community as to whether to use the
new coalescent-based methods or the traditional concatenation methods.
One of the key issues that has been raised is understanding the impact of
gene tree estimation error on coalescent-based methods that operate by
combining gene trees. Here we explore the mathematical guarantees of
coalescent-based methods when analyzing estimated rather than true gene trees.
Our results provide some insight into the differences between promise of
coalescent-based methods in theory and their performance in practice.
Likelihood-based tree reconstruction on a concatenation of aligned
sequence data sets can be statistically inconsistent
Theoretical Population Biology, 100:56-62, 2015. With M. Steel.
(Honorable Mention,
2018
Marcus W. Feldman Prize in Theoretical Population Biology.)
The reconstruction of a species tree from genomic data faces a double hurdle.
First, the (gene) tree describing the evolution of each gene may differ from
the species tree, for instance, due to incomplete lineage sorting. Second,
the aligned genetic sequences at the leaves of each gene tree provide merely
an imperfect estimate of the topology of the gene tree. In this note, we
demonstrate formally that a basic statistical problem arises if one tries
to avoid accounting for these two processes and analyses the genetic data
directly via a concatenation approach. More precisely, we show that, under
the multi-species coalescent with a standard site substitution model, maximum
likelihood estimation on sequence data that has been concatenated across
genes and performed under the incorrect assumption that all sites have evolved
independently and identically on a fixed tree is a statistically inconsistent
estimator of the species tree. Our results provide a formal justification of
simulation results described of Kubatko and Degnan (2007) and others, and
complements recent theoretical results by DeGorgio and Degnan (2010) and
Chifman and Kubtako (2014).
Alignment-Free Phylogenetic Reconstruction: Sample Complexity via a Branching Process Analysis
Annals of Applied Probability, 23(2):693-721, 2013. With C. Daskalakis.
Conference abstract in Proceedings of RECOMB 2010, 123-137.
We present an efficient phylogenetic reconstruction algorithm,
allowing insertions and deletions, which provably achieves a
sequence-length requirement (or sample complexity) growing
polynomially in the number of taxa. Our algorithm is distance-based,
that is, it relies on pairwise sequence comparisons. More importantly,
our approach largely bypasses the difficult problem of multiple
sequence alignment.
Submodularity of Influence in Social Networks: From Local to Global
SIAM J. Comput., 39(6):2176-2188, 2010.
With E. Mossel.
Conference abstract in Proceedings of ACM STOC 2007, 128-134.
Social networks are often represented as directed graphs, where the nodes are individuals and the edges indicate a form of social relationship. A simple way to model the diffusion of ideas, innovative behavior, or word-of-mouth effects on such a graph is to consider an increasing process of infected (or active) nodes: each node becomes infected once an activation function of the set of its infected neighbors crosses a certain threshold value. Such a model was introduced by Kempe, Kleinberg, and Tardos (KKT) in [Maximizing the spread of influence through a social network, in Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003, pp. 137146] and [Influential nodes in a diffusion model for social networks, in Proceedings of the 32nd International Colloquium on Automata, Languages and Programming (ICALP), 2005], where the authors also impose several natural assumptions: the threshold values are random and the activation functions are monotone and submodular. The monotonicity condition indicates that a node is more likely to become active if more of its neighbors are active, while the submodularity condition indicates that the marginal effect of each neighbor is decreasing when the set of active neighbors increases. For an initial set of active nodes $S$, let $\sigma(S)$ denote the expected number of active nodes at termination. Here, we prove a conjecture of KKT: we show that the function $\sigma(S)$ is submodular under the assumptions above. We prove the same result for the expected value of any monotone, submodular function of the set of active nodes at termination. Roughly, our results demonstrate that local submodularity is preserved globally under this diffusion process. This is of natural computational interest, as many optimization problems have good approximation algorithms for submodular functions.
Toward Extracting All Phylogenetic Information from Matrices of Evolutionary Distances
Science, 327(5971):1376 - 1379, 2010. (Posted by permission of the AAAS for personal use, not for redistribution.)
The matrix of evolutionary distances is a model-based statistic, derived from molecular sequences, summarizing the pairwise phylogenetic relationships between a collection of species. Phylogenetic tree reconstruction methods relying on this matrix are relatively fast and thus widely used in molecular systematics. However, because of their intrinsic reliance on summary statistics, distance-matrix methods are assumed to be less accurate than likelihood-based approaches. In this paper, pairwise sequence comparisons are shown to be more powerful than previously hypothesized. A statistical analysis of certain distance-based techniques indicates that their data requirement for large evolutionary trees essentially matches the conjectured performance of maximum likelihood methods---challenging the idea that summary statistics lead to suboptimal analyses. On the basis of a connection between ancestral state reconstruction and distance averaging, the critical role played by the covariances of the distance matrix is identified.
A Short Proof that Phylogenetic Tree Reconstruction by Maximum Likelihood is Hard
IEEE/ACM Transactions on Computational Biology and Bioinformatics, 3(1):92-94, 2006.
Maximum likelihood is one of the most widely used techniques to infer
evolutionary histories. Although it is thought to
be intractable, a proof of its hardness has been lacking. Here, we give
a short proof that computing the maximum
likelihood tree is NP-hard by exploiting a connection between likelihood
and parsimony observed by Tuffley and Steel.
A full list of publications is available here.
Contact Information
Office: Van Vleck 823
Phone: 608-263-3053
Fax: 608-263-8891
lastname[at]math[dot]wisc[dot]edu
Department of Mathematics
University of Wisconsin-Madison
480 Lincoln Drive
Madison, WI 53706