Annual Meeting in the statistics network

January 22-23, 2009

Skjalm Hvide Hotel

  Thursday 22/1

9:00 - 9:30 Coffee, tea and rolls
9:30 - 9:45 Introduction
9:45 - 10:25 Dynamical stochastic models
10:25 - 10:55 Break
10:55 - 12:15 Dynamical stochastic models
 
12:15 - 13:30 Lunch
 
13:30 - 14:50 Bioinformatics
14:50 - 15:20 Coffee, tea and cake
15:20 - 16:00 Functional data and image analysis
16:00 - 17:00 Discussions and planning in the research groups:
Bioinformatics gr & Functional data and image analysis gr
17:00 - 18:00 Discussions and planning in the research groups:
Dynamical stochastic models - group & Survival analysis - group
 
18:30 - Dinner
 
 
  Friday 23/1

9.00 - 10:20 Survival analysis
10:20 - 10:50 Coffee and Tea
10:50 - 12:10 Statistical computing
 
12:10 - 13:30 Lunch
 
13:30 - 14:30 Discussions and planning in the research groups:
Statistical computing - group and other groups if needed
14:30 - 15:00 Plenary closing discussion
15:00 - 15:30 Coffee, tea and cake
 
 

Talks and Abstracts

  Thursday 22/1

9.45-12.15 Dynamical stochastic models

9.45-10.15 Helle Sørensen - Growth and energy intake for pigs

Data on growth and energy intake for pigs are analyzed simultaneously with a diffusion model derived from simple biological assumptions on maintenance and allometry. The dataset is unusual for animal nutrition science because the pigs are followed from birth until maturity and because of the simultaneous measurements of weight and energy intake. In the talk I will discuss the model and issues concerning estimation strategies, model validation, correlation, and the separation between diffusion noise and measurement noise. This is joint work with Anders Strathe, KU-LIFE.

10.55-11.25 Søren Johansen - Statistical analysis of global surface temperature and sea level using analysis of non stationary time series

Global averages of surface temperature and sea level have risen through the past century. We analyse the relationship between these non-stationary time series using a data set from Myhre et al. The statistical technique is the analysis of non stationary time series using the vector autoregressive processes. We find one stationary relation between the nonstationary variables:

\hat{β}'xt = Tt - 0.0065 ht - 0.768 wmggt - 1.478 aerosolt
        (t=-3.52)   (t=4.68)   (t=3.51)

where Tt is temperature, ht sea level, wmggt well mixed green house gases, and finally aerosolt are are sulfates and the measurements are annual data from 1881 to 1997.

The lecture is based upon joint work with Torben Schmith and Peter Thejll, DMI.

Reference:

Myhre, G. Myhre, A. and Stordal, F. Historical evolution of radiative forcing of climate Atmospheric Environment 35, 2001 2361-2373

11.35-12.05 Michael Sørensen - Efficient estimation for ergodic stochastic differential equation models sampled at high frequency

Simple and easily checked conditions are given that ensure rate optimality and efficiency of estimators for ergodic stochastic differential equation models in a high frequency asymptotic scenario, where the time between observations goes to zero while the observation horizon goes to infinity. For diffusion models rate optimality is important because parameters in the diffusion coefficient can be estimated at a higher rate than parameters in the drift. The criteria presented in the talk provide, in combination with considerations of computing time, much needed clarity in the profusion of estimators that have been proposed for parametric diffusion models. The focus is on approximate martingale estimating functions for discrete time observations. This covers most of the previously proposed estimators, and the few that are not covered are likely to be less efficient, because non-martingale estimating functions, in general, do not approximate the score function as well as martingales.

Optimal martingale estimating functions in the sense of Godambe and Heyde have turned out to provide simple estimators for many stochastic differential equation models. These estimating functions are approximations to the score functions, which are rarely explicitly known, and have often turned out to provide estimators with a surprisingly high efficiency. This can now be explained: the estimators are, under weak conditions, rate optimal and efficient in the high frequency asymptotics considered in the talk.

13.30-14.50 Bioinformatics

13.30-14.00 Thomas Hamelryck - Probabilistic models and machine learning in structural bioinformatics

Structural bioinformatics is concerned with the molecular structure of biomacromolecules on a genomic scale, using computational methods. Classic problems in structural bioinformatics include the prediction of protein and RNA structure from sequence, the design of artificial proteins or enzymes, and the automated analysis and comparison of biomacromolecules in atomic detail. These problems are of enormous importance in science, medicine and biotechnology. Recently, it has become increasingly clear that the enormous complexity of these problems cannot be handled by ad hoc methods anymore, but requires the development of sound statistical methods. Often, problems in structural bioinformatics lead to fascinating statistical challenges, due to the peculiar nature of the data (angles, orientations, directions, large data sets, high dimensionality, sequential data,... ).

The structural bioinformatics group at KU's bioinformatics center resolutely adopts a probabilistic approach, which is based on the formulation of statistical models of macromolecular structure. The use of graphical models (Bayesian networks and factor graphs) allows us to formulate tractable models with a clear physical interpretation. The statistical sub-disciplines of directional statistics and shape theory allow us to deal with observations on unusual manifolds, such as the hyper sphere or the hyper torus, which are typically associated with macromolecular data. In the talk, I will focus on some recent success stories of our approach, and along the way point to some open research opportunities for the adventurous probabilist.

References:

Hamelryck, T., Kent, J., Krogh, A. (2006) Sampling realistic protein conformations using local structural bias. PLoS Comput. Biol., 2(9): e131

Boomsma, W., Mardia, KV., Taylor, CC., Ferkinghoff-Borg, J., Krogh, A. and Hamelryck, T. (2008) A generative, probabilistic model of local protein structure. Proc. Natl. Acad. Sci. USA, 105, 8932-8937.

14.10-14.40 Anders Tolver Jensen - Inference for Amplified Count Models with Applications to Drug Discovery

In this talk we consider statistical models for multivariate counts obtained by sampling from an amplified distribution that might itself have been through some selection process before amplification. The motivation is a sample of DNA-sequence counts after amplification by the Polymerase Chain Reaction (PCR). The model turns out to have a mathematically tractable structure in particular when considering statistical inference.

We further present an application of the model to data from a screening experiment that addresses the ability of individual drugs in a million compound chemical library to bind to a given target molecule. A good mathematical model for the joint bivariate count in the target population and in a target-specific control population is crucial to exploit the data in a reasonable way. In particular this may allow us to suggest improvements in the design of future experiments and optimize parameters of chemical processes with the purpose of increasing the power of the multiple selection procedure.

The lecture presents joint work with Ib Michael Skovgaard and is the result of a collaboration with the Danish biotech company Vipergen.

15.20-16.00 Functional data and image analysis

15.20-15.50 Bo Markussen - Operator based analysis of functional data

Basic functional data analysis in a fully probabilistic model is done using operator calculus. The probabilistic model allows for maximum likelihood inference for the smoothing parameter. The operator approach renders the usage of a functional basis unnecessary, clarifies the importance of the boundary conditions, and expose the intrinsic structure of the curve fit. The developed methodology is applied on histograms of voxel intensities for CT-scans of eals, and we propose a hypothesis test for the local minima of the second order derivative.

  Friday 23/1

9.00-10.20 Survival analysis

9.00-9.30 Niels Keiding - Estimation of time to pregnancy from current duration data
joint work with Oluf K.H. Hansen and Lars H. Nielsen

Time to pregnancy is the duration from a couple starts trying to become pregnant until they succeed and is considered one of the most direct methods to measure natural fecundity in humans. Statistical tools for designing and analysing time to pregnancy studies belong to the general area of survival analysis, but several special features require special attention. We focus here on starting from a cross-sectional sample of couples currently trying to be come pregnant, using current duration as basis for the estimation. This corresponds to using the backward recurrence time as basis for the inference, and here the preferable statistical model turns out to be the accelerated failure time model.

The inference is quite sensitive to observations near zero; indeed Woodroofe and Sun pointed out that the nonparametric maximum likelihood estimator is inconsistent at zero. Switching to parametric models, we are faced with balancing between stability or flexibility. The former represented by a simple mixed exponential model such as the Pareto distribution, the latter by a much larger class of distributions suggested by Yamaguchi in a sociological context.

We are collaborating with a group of epidemiologists, statisticians and demographers in France where we have helped design and analyze a large telephone survey among French women based on the current duration approach. The talk will briefly mention several practical features that one has to take into account but we will focus on the statistical problems and present various simulation studies highlighting the sensitivity problems.

References:

Keiding, N., Kvist, K., Hartvig, H., Tvede, M. & Juul, S. (2002). Estimating time to pregnancy from current durations in a cross-sectional sample. Biostatistics 3, 565-578.

Scheike, T. & Keiding, N. (2006). Design and analysis of time to pregnancy. Stat. Meth. Med. Res. 15, 127-140.

Slama, R., Ducot, B., Carstensen, L., Lorente, C., de La Rochebrochard, E., Leridon, H., Keiding, N. & Bouyer, J. (2006). Feasibility of the current duration approach to study human fecundity. Epidemiology 17, 440-449.

9.40-10.10 Ulla Brasch Mogensen - Survival paths through the forest
Joint work with Thomas A. Gerds, Ole Winther

A challenging role for biostatisticians is to establish computer algo- rithms that provide information gathered on similar patients to provide evidence based medical guidance. Tools from machine learning are pop- ular for analyzing complex and high dimensional data. The aim of our pro ject is to develop the random forest method for survival analysis, in particular for situations with competing risks. The random forest method is an ensemble method which combines the results of many classification or regression trees to achieve accurate predictions. The idea presented here is to use time-dependent pseudo-values to build classification trees. This new method is compared to conventional regression modelling and to existing versions of the survival random forest method. Some first results are shown from applications on real and simulated data.

10.50-12.10 Statistical computing

10.50-11.20 Klaus Holst - An R package for structural equation modelling

The class of linear structural equation models (SEM) covers a broad range of models and provides a natural framework for analyzing high-dimensional data as obtained among others from imaging studies.

R offers only limited support for handling of SEMs and as a result most researchers will turn to one of several proprietary software solutions. However, implementation of new methods and exploration of these through large scale simulations are generally not feasible in this setup.

In this talk I will present a new SEM library and demonstrate how to specify models and estimate parameters using MLE under normality assumptions. Topics such as estimation in presence of missing data and recent developments in model diagnostics will also be discussed.



11.30-12.00 Peter Dalgaard - Ideas about likelihood-based data analysis in R

R has for a long time provided access to good quality optimizers. In combination with its powerful programming language, it is often quite straightforward simply to write up a likelihood and find maximum likelihood estimators by direct maximization, provided the problem is "sufficiently well-behaved". This can be more expedient than trying to fit a problem into an existing modelling framework such as glm().

However, there is a need for support functionality at several levels. The mle() function in R and various associated functions were written to provide a basic toolkit for summarizing an ML fit, extract quantities, and describing the behaviour of the likelihood near the optimum (profiling).

It is desirable to extend the toolkit in various ways. For instance, the mle() function has no checks that its mll argument is actually a negative log-likelihood, and it is easy to make mistakes. Also, it would be nice to have a way to construct models from simpler "building blocks".

This suggests that we need an extended likelihood-model class, objects from which the mll function can be extracted when needed, but which share some aspects with the more conventional model classes, and can be defined using simpler primitives than low-level programming. Some natural operations may be defined on models (e.g., combination of two experiments, mixture models, or censoring), and it is necessary to discuss the requirements this puts on the object structure.
 
 

Research group meetings

  Thursday 22/1

16.00-17.00 Bioinformatics
16.00-17.00 Functional data and image analysis
17.00-18.00 Dynamical stochastic models
17.00-18.00 Survival analysis

  Friday 23/1

13.30-14.30 Statistical computing