統計学輪講 第11回

日時 2022年07月05日(火)
14時55分 ~ 16時35分
場所 ハイブリッド開催
講演者 徳田 智磯 (地震研)
演題 Multiple clustering based on nonparametric mixture models for Gaussian and Wishart distributions
概要

For high-dimensional data, it is not straightforward to cluster objects because all features are not always relevant for a particular cluster solution. Some features may be irrelevant (noisy) for that cluster solution, but relevant for another cluster solution. In general, in high-dimensional case, one may assume multiple cluster solutions depending on a specific subset of features. In this situation, a conventional clustering method would not be able to reveal the underlying cluster structure, which is characterized by multiple cluster solutions. Despite the large availability of high dimensional data, effective methods to find such cluster structures have been less developed.

In this talk, we discuss two novel clustering methods, which are useful to reveal the underlying multiple cluster structure. A first method (Tokuda et al., PLOS ONE, 2017) is based on Gaussian mixture models in which features are automatically partitioned into subsets of features, which yield multiple cluster solutions. This feature partition works as feature selection for a particular clustering solution, which screens out irrelevant features. Our method simultaneously optimizes both feature partition and multiple clustering, inferring the number of subsets and the number of clusters via the Dirichlet process. Further, to make our method applicable to high-dimensional data, a co-clustering structure is introduced for each subset of features. Moreover, we simultaneously model different distribution families, such as Gaussian, Poisson, and multinomial distributions in each cluster block, which widens areas of application to real data.

A second method (Tokuda et al., Neural Netw, 2021) is based on Wishart mixture models, which applies to correlation matrices of connectivity data without vectorization. The uniqueness of this method is that multiple clustering solutions are based on particular networks of nodes, optimized in a data-driven manner. Hence, it can identify the underlying pairs of associations between a cluster solution and a node sub-network. The key assumption of the method is independence among sub-networks, which is effectively addressed by whitening correlation matrices.

Finally, we applied these methods to brain imaging data (MRI) in neuroscience. A first application is to identify subtypes of depressive disorder (Tokuda et al., SciRep, 2018). A second one is to reveal a relationship between brain networks and psychiatric disorders (Tokuda et al., Front. Psychiatry, 2021). We demonstrate the usefulness and power of the proposed methods for these applications.