Research Seminar Series in Statistics and Mathematics

Ort: Wirtschaftsuniversität Wien D4.4.008 am 08. März 2019 Startet um 09:00 Endet um 10:30

Veranstalter Institut für Statistik und Mathematik

Antonietta Mira (Data Science Lab, Director | ICS, USI, Switzerland and Università dell’Insubria, Italy) about "Bayesian dimensionality reduction via identifications of data intrinsic dimensions"

The Institute for Statistics and Mathematics (Department of Finance, Accounting and Statistics) cordially invites everyone interested to attend the talks in our Research Seminar Series, where internationally renowned scholars from leading universities present and discuss their (working) papers.
No registration required.

The list of talks for the summer term 2019 is available via the following link: <link en statmath resseminar>www.wu.ac.at/en/statmath/resseminar

Abstract:
Even if defined on a large dimensional space, data points usually lie onto one or more hypersurfaces, or manifolds, with much smaller intrinsic dimensions (ID). The recent TWO-NN method (Facco et al., 2017, Scientific Report), allows estimating the ID when all points lie onto a single sub-manifold.
TWO-NN only assumes that the density of points is approximately constant in a small neighborhood around each point. Under this hypothesis, the ratio of the distances of a point from its first and second neighbor follows a Pareto distribution that depends parametrically only on the ID. We first extend the TWO-NN model to the case in which the data lie onto several sub-manifolds each one with its own different ID. While the idea behind the model extension is simple (the Pareto is replaced by a finite mixture of K Pareto distributions), a non-trivial Bayesian algorithm is required for estimating the model and assigning each point to its own manifold. Applying this method, which we dub Hidalgo (Heterogeneous Intrinsic Dimension ALGOrithm), we uncover a surprising ID variability in several real-world datasets. In fact, we are able to show how this methodology helps to discover latent clusters hidden in data of different nature, ranging from protein folding trajectory to financial indexes computed on balance sheets. Hidalgo obtains remarkable results, but its main limitation consists in fixing a priori the number of sub-manifolds, i.e. of components in the mixture. To overcome this issue we employ a flexible Bayesian Nonparametric approach and model the data as an infinite mixture of Pareto distributions using a Dirichlet Process Mixture Model. This framework allows evaluating the uncertainty relative to the number of mixture components and to the assignments of data points to sub-manifolds. Since the posterior distribution has no closed form, to perform inference we employ the Slice Sampler algorithm. From preliminary analyses on simulated and well-known datasets (e.g. Fisher's Iris dataset), the full Bayesian nonparametric version of the TWO-NN provides promising results allowing to recover a rich data structure starting from the intrinsic dimension, a pure geometric data feature, and only requiring the definition of a distance measure.



zurück zur Übersicht