Algorithms and Inference for Mixture Models with Application to Protein Sequence Analysis

Youyi Fong, Department of Biostatistics, University of Washington

Mixture model-based clustering is a commonly used statistical tool. By combining bottom-up hierarchical clustering and partitional clustering, I propose new algorithms that are the only viable solutions to some challenging clustering problems, which tend to involve high dimensional observations. Insights into why the new algorithms perform so well can be obtained by drawing from the stochastic local search literature. The clustering problem that motivates my study of the algorithms is the modeling of a protein family as a mixture of profile hidden Markov models to identify functional subgroups to improve genome annotation. Using the asymptotically consistent Bayesian Information Criterion to select the number of components in the mixture turns out to over-penalize in datasets of practical sizes. On the other hand, using Bayes factors with substantive priors, but not the default prior, has satisfactory finite sample performance. Differences between the default and the substantive priors shed light on the roles of priors in estimating the mixture order. This is joint work with Drs. Jon Wakefield and Ken Rice at the University of Washington.

Return to Departmental Seminar List | Return to Home Page