# Abstracts Research Seminar Summer Term 2012

## Justinas Pelenis: Bayesian Semiparametric Regression

This paper considers a Bayesian estimation of restricted conditional moment models with linear regression as a particular example. The standard practice in the Bayesian literature for linear regression and other semiparametric models is to use flexible families of distributions for the errors and to assume that the errors are independent from covariates. However, a model with flexible covariate dependent error distributions should be preferred for the following reasons. First, assuming that the error distribution is independent of predictors might lead to inconsistent estimation of the parameters of interest when errors and covariates are dependent. Second, the prediction intervals obtained from a model with predictor dependent error distributions are likely to be superior to the ones obtained assuming a constant error distribution. Third, modeling conditional error density might allow one to obtain a more efficient estimator of the regression coefficients under heteroscedasticity. To address these issues, we develop a Bayesian semiparametric regression model with flexible predictor dependent error densities and with mean restricted by a conditional moment condition. Sufficient conditions to achieve posterior consistency of the regression parameters and conditional error densities are provided. In experiments, the proposed method compares favorably with classical and alternative Bayesian estimation methods for the estimation of the regression coefficients.

## Matt Taddy: Design of Text Mining Experiments

Design of experiments, which has long been a major area of interest in engineering and statistics, is becoming increasingly relevant for research in economics and marketing due to new sources of data from the internet and in social media. Previous work by the author has looked at sequential design problems in the context of robust search optimization, wherein we were tasked with augmenting a local pattern search algorithm with search locations that had a high probability of improvement. In particular, parallel execution required optimal sets of multiple new search points rather than a single 'best' new location. This talk will re-visit these ideas, and look to problems in text-sentiment analysis as a new application area. Here, the goal is to predict variables that motivated language use (e.g., the author's political beliefs in a news article, or consumer satisfaction in a blog post). Typically, huge amounts of text are available, but obtaining sentiment-scored text samples for model training is very expensive. Hence we discuss methods for choosing optimal sub-samples from the available conversation, based upon topic-factor language decompositions and multinomial text regression models. The technology will be illustrated in scoring of various sentiment indices for text data from the streaming twitter feed.

## Alexander McNeil: Multivariate Stress Testing for Solvency

We show how the probabilistic concepts of half-space trimming and depth may be used to define convex scenario sets for stress testing the risk factors that affect the solvency of an insurance company over a prescribed time period. By choosing the scenario in the scenario set that minimises net asset value at the end of the time period, we propose the idea of the least solvent likely event (LSLE) as a solution to the forward stress testing problem.

We establish theoretical properties of the LSLE when financial risk factors can be assumed to have a linear effect on the net assets of an insurer. In particular, we show that the LSLE may be interpreted as a scenario causing a loss equivalent to the Value-at-Risk (VaR), at some confidence level, provided the VaR at that confidence level is a subadditive risk measure on linear combinations of the risk factors. In this case, we also show that the LSLE has an interpretation as a per-unit allocation of capital to the underlying risk factors when the overall capital is determined according to the VaR. These insights allow us to define alternative scenario sets that relate in similar ways to coherent measures, such as expected shortfall. We give a useful dual representation for such risk measures.

We also introduce the most likely ruin event (MLRE) as a solution to the problem of reverse stress testing.

## Martyn Plummer: Sampling Methods for Generalized Linear Models

JAGS is a clone of the popular WinBUGS software for the analysis of Bayesian graphical models using Markov Chain Monte Carlo. These programs allow the user to build complex models from simple components using the BUGS language, in which variables in the model are represented as nodes in a directed acyclic graph.

JAGS chooses sampling methods automatically so that the user can concentrate on modelling issues. My goal is for JAGS to recognize commonly recurring design motifs in large graphical models and then choose efficient sampling methods for them. One important motif is the generalized linear model (GLM). In this talk, I will review some methods for sampling GLMs that have been proposed in the literature and are implemented in the current version of JAGS. I will describe how efficient sampling in a graphical model requires a balance two opposing techniques - data augmentation (adding nodes) and marginalization (removing nodes). I will also try to explain why, from a Bayesian point of view, there is no such thing as a mixed model.

## Claudia Czado: The World of Vines

Pair copula constructions (PCC) allow the construction of very flexible multivariate distributions (see Aas et.al 2009). These are characterized by a sequence of linked trees called a vine structure, bivariate copula families and marginal distributions (see Kurowicka and Cooke (2006) and Kurowicka and Joe (2011). Two often studied subclasses of regular vine models are C- and D-vines. The multivariate Gauss and t-distribution are special cases. This class is very useful for modeling of multivariate data in economics and finance, since it can capture different non symmetric and different tail dependencies for different pairs of variables. I will introduce PCC models, show their flexibility and discuss different estimation and model selection methods. Ideas and methods will be illustrated by applications to financial times series. Finally I will indicate areas for further extensions and applications.

## Gregor Heinrich: A Generic Approach to Topic Models

Topic models are probabilistic representations of grouped discrete data. Applied to text, the basic topic model represents documents as mixtures of topics -- probability distributions over the vocabulary. In many cases, there exists a semantic relationship between terms that have high probability within the same topic -- a phenomenon that is rooted in the word co-occurrence patterns in the text and that can be used for information retrieval and knowledge discovery in databases. Consequently, a large body of work extends the basic topic model, mostly modelling structures in the data beyond term co-occurrence or analysing different data modalities jointly to discover their inter-relations.

This talk presents an analysis of topic models as a generic model class, motivated by the conjecture that important properties may hold across individual models. It is shown that this generic approach indeed leads to practical simplifications in the derivation of model properties, inference algorithms and finally model design.

As an exemplary application, topic models for expert finding are developed that use semantic tags in addition to document text and authorship information to improve retrieval results and topic coherence. Finally, the talk gives an outlook on how the presented results may be used in the programming environment of the R language.

## Marco Gori: Learning From Constraints

In this talk, I propose a functional framework to understand the emergence of intelligence in agents exposed to examples and knowledge granules. The theory is based on the abstract notion of constraint, which provides a representation of knowledge granules gained from the interaction with the environment. I give a picture of the “agent body” in terms of representation theorems by extending the classic framework of kernel machines in such a way to incorporate logic formalisms, like first-order logic. This is made possible by the unification of continuous and discrete computational mechanisms in the same functional framework, so as any stimulus, like supervised examples and logic predicates, is translated into a constraint. The learning, which is based on constrained variational calculus, is either guided by a parsimonious match of the constraints or by unsupervised mechanisms expressed in terms of the minimization of the entropy.

I show some experiments with different kinds of symbolic and sub-symbolic constraints, and then I give insights on the adoption of the proposed framework in computer vision. It is shown that in most interesting tasks the learning from constraints naturally leads to “deep architectures”, that emerge when following the developmental principle of focusing attention on “easy constraints”, at each stage. Interestingly, this suggests that stage-based learning, as discussed in developmental psychology, might not be primarily the outcome of biology, but it could be instead the consequence of optimization principles and complexity issues that hold regardless of the “body”.

## Nalan Baştürk: Bayesian Testing for Multimodality Using Mixture Distributions

(based on joint work with Lennart Hoogerheide, Peter de Knijf and Herman K. van Dijk)

In several applications the data comes from a non-standard, multimodal distribution. In such cases, standard exploratory data analysis can be misleading since possible multimodality is not taken into account. This seminar will outline a Bayesian test for detecting the number of distinct modes in the data. A mixture of shifted Poisson distributions is proposed to estimate the probability of multimodality. The method is applied to two datasets. First, we apply the proposed method on DNA tandem repeats’ data from three groups of individuals with Asian, African and European origin. Second, we analyze possible multimodality in the economic growth performances of countries, measured by Gross Domestic Product per capita. The results are compared with those of standard frequentist tests.