Annual Meeting in the statistics network

August 19-20, 2013

Comwell hotel, Holte

How to get to the hotel by public transportation: Go to "Holte station", e.g by S-train, take bus 334 to "Kongevejen/Vasevej". Comwell hotel is located 200 meters further down Kongevejen.
Program:
Monday 19/8
9:15 - 9:45 Coffee with bread, juice and fruit
9:45 - 11:45 Survival analysis
Torben Martinussen: Structural nested additive hazards models to adjust for time-varying confounding in survival analysis
Per Krag Andersen: Decomposing number of life years lost according to causes of death
Thomas Scheike: Describing dependence in multivariate competing risks data
12:00 - 13:30 Lunch
13:30 - 15:15 Functional data and image analysis
Martha Muller: Functional Data Analysis in Nutri-Metabolomics
Anders Tolver Jensen: Registration subject to biomechanical constraints
Stefan Sommer: Deformation Analysis and Non-linear Statistics for Computational Anatomy
15:15 - 15:45 Break
15:45 - 17:00 Computational statistics
Klaus Holst: Computational Statistics in Neurobiology
Niels Richard Hansen: A survey of sparse models and computations
17.00 - 17.15 Break
17:15 - 18:15 Invited talk
Torsten Hothorn
Conditional Transformation Models
19:00 - Dinner

Tuesday 20/8

9:00 - 10:15 Bioinformatics
Tomas Bertelsen: Random Survival Forests applied to cancer in a clinical setting
Claus Ekstrøm: Integrative analysis of metabolomics and transcriptomics data
10:15 - 10:45 Break
10:45 - 12:30 Stochastic Dynamic Models
Alexandre Iolov: Optimal Control of Spike Trains in an Ornstein-Uhlenbeck Model
Carsten Wiuf: Stochastic and deterministic "Chemical Reaction Models" and model reduction
Michael Sørensen: Likelihood inference for stochastic differential equations
12:30 - 13.30 Lunch
13:30 - 14:30 Invited talk
Svend Kreiner
The PISA controversy. A discussion of statistical issues



Abstracts

Monday 19/8

9:45-11:45 Survival analysis

Torben Martinussen
Structural nested additive hazards models to adjust for time-varying confounding in survival analysis

Time-varying confounding forms a pervasive problem in observational studies that attempt to assess the total effect of a time-varying exposure on a survival outcome. It is tempting to see prosperity in the use of routine survival models with time-varying covariates, but unfortunately the resulting estimates of the exposure effect are prone to bias when, as often, time-varying confounders are themselves affected by earlier exposures. This is because standard regression adjustment for such confounders eliminates indirect effects of early exposures that are mediated via those covariates, and in addition, may induce a so-called collider-stratification bias. Martinussen et al. (2011) demonstrated how a valid adjustment for time-varying confounding is attainable when effects are pa- rameterized on the additive hazard scale. Because they focused on the special case of 2 exposures, the first of which is dichotomous and randomly assigned, we here ex- tend their results to general time-varying exposures. The extension explicates a close link with inference under structural nested cumulative failure time models (Young et al., 2010; Picciotto et al., 2011), but yields a different class of estimators which accommodate continuous time settings and are deemed to be more efficient. Relative to G-estimation for structural nested accelerated failure time models, the attraction of the proposed approach is that it naturally accommodates independent censoring without requiring an artificial recensoring procedure to maintain unbiased estimating equations.


Per Krag Andersen
Decomposing number of life years lost according to causes of death

The competing risks model is studied and we show that the cause j cumulative incidence function integrated from 0 to tau has a natural interpretation as the expected number of life years lost due to cause j before time tau. This is analogous to the tau-restricted mean life time which is the survival function integrated from 0 to tau. It is discussed how the number of years lost may be related to subject-specific explanatory variables in a regression model based on pseudo-observations and the method is exemplified using a standard data set on survival with malignant melanoma. Finally, the connection to related measures used in demography is discussed.


Thomas Scheike
Describing dependence in multivariate competing risks data

There has been considerable interest in studying the heritability of a specific disease. This is typically derived from family or twin studies, where the basic idea is to compare the correlation for different pairs that share different amount of genes. We here consider data from the Danish twin registry and discuss how to define heritability for breast cancer. The key point is that this should be done taking censoring as well as the competing risks due to eg. death into account. I will describe how to assess the dependence between twins on the probability scale and on the hazard scale and show that various models can be used to achieve sensible estimates of the dependence within monozygotic and dizygotic twin pairs that may vary over time. These dependence measures can subsequently by decomposed into a genetic and environmental component using random effects models. I will present several models that in essence describe the association in terms of the concordance rate, i.e., the probability that both twins experience the event, in the competing risks setting. We also discuss how to deal with the left truncation present in the Nordic twin registries, due to sampling only of twin pairs where both twins are alive at the initiation of the registries.


13:30-15:15 Functional data and image analysis

Martha Muller
Functional Data Analysis in Nutri-Metabolomics

Metabolomics is the study of chemical processes involving small molecules that are the intermediates and products of metabolism. The aim of metabolomics in nutrition is to understand how changes in the nutrient content of diet influence changes in metabolic regulation.

Nuclear magnetic resonance (NMR) technology generates a vast output of spectral data from a single biofluid sample. This high-dimensional data poses a great challenge in metabolomics. Standard chemometric methods typically rely on principal component analysis (PCA) and/or partial least squares (PLS) of the data points.

As an alternative approach, we model the NMR spectra as functions. The key idea is to utilise the information in first and second derivatives of these functions. We explore the variability among diet groups graphically as well as analytically, using methods from functional data analysis (Ramsay & Silverman, 2005).


Anders Tolver Jensen
Registration subject to biomechanical constraints

Statistical analysis of functions are usually complicated by the fact the raw data are misaligned in their time argument. A popular solution is to 'separate time and amplitude variation' by estimating a time transformation for each function. Further statistical analysis then deals with amplitude variation of the registered (= time transformed) functions.

We claim that in some situations where phase and amplitude variation may not be clearly separated then too much registration may complicate interpretability of the data. Instead registration should be carried out subject to physical constraints of the data generating system.

In this talk we present our work resulting from a workshop on 'Statistics of Time Warpings and Phase Variations' in Ohio, November 2012.


Stefan Sommer
Deformation Analysis and Non-linear Statistics for Computational Anatomy

The field of computational anatomy aims at building models and statistical tools for analysis and quantification of the shape of organs and biological tissue. At the heart of computational anatomy lies the analysis of deformation that brings organs from different subjects into correspondence. In the talk, I will first discuss sparse modeling of deformation. The kernel bundle framework builds on previous geometric approaches to deformation modeling by extending the representation to multiple scales. In addition, the introduction of higher order momenta in the framework allows modeling of locally affine deformation. We will see how the increased description capacity allows registration with very few parameters while keeping the capacity of the deformation model and how the kernels can be applied for registering MR scans of patients suffering from Alzheimer's disease. The subsequent statistical analysis is challenged by the non-linear structure of the deformation space. I will discuss the horizontal component analysis (HCA) dimensionality reduction procedure that extends PCA to Riemannian manifolds. We will see how HCA can approximate data sampled from multimodal distributions, and how it can be used for visualizing non-linear variation in brain tissue.


15:45-17:00 Computational statistics

Klaus Holst
Computational Statistics in Neurobiology

Neurobiology studies typically involves massive data collection of different types of imaging modalities such as Magnetic Resonance Imaging (MRI), functional Magnetic Resonance Imaging (fMRI), Positron Emission Tomography (PET), Diffusion Tensor Imaging (DTI) and often many different markers within each imaging modality may be examined. Further large amounts of neuropsychiatric data (questionnaire data) and various types of genomics data are often also collected. The large amount of data requires special attention to the choice of computational methods and computing environment. Different examples of such studies will be given and methods for integrating analyses within a statistical programming environment such as R using external data representations will be demonstrated.
Niels Richard Hansen
A survey of sparse models and computations

Sparse model assumptions and sparse data structures are important for high-dimensional and big data modeling for at least three reasons. Interpretability, predictive accuracy and computational efficiency. We discuss how sparsity is, and has been, important to some of the projects supported by this Program of Excellence. These include Gaussian graphical models, causal modeling, multivariate point processes and multinomial classification.
17:15-18:15 Torsten Hothorn
Conditional Transformation Models

The ultimate goal of regression analysis is to obtain information about the conditional distribution of a response given a set of explanatory variables. This goal is, however, seldom achieved because most established regression models only estimate the conditional mean as a function of the explanatory variables and assume that higher moments are not affected by the regressors. The underlying reason for such a restriction is the assumption of additivity of signal and noise. We propose to relax this common assumption in the framework of transformation models. The novel class of semi-parametric regression models proposed herein allows transformation functions to depend on explanatory variables. These transformation functions are estimated by regularised optimisation of scoring rules for probabilistic forecasts, \eg the continuous ranked probability score. The corresponding estimated conditional distribution functions are consistent. Conditional transformation models are potentially useful for describing possible heteroscedasticity, comparing spatially varying distributions, identifying extreme events, deriving prediction intervals and selecting variables beyond mean regression effects. An empirical investigation based on a heteroscedastic varying coefficient simulation model demonstrates that semi-parametric estimation of conditional distribution functions can be more beneficial than kernel-based non-parametric approaches or parametric generalised additive models for location, scale and shape.
Tuesday 20/8

9:00-10:15 Bioinformatics

Tomas Bertelsen
Random Survival Forests applied to cancer in a clinical setting


Claus Ekstrøm
Integrative analysis of metabolomics and transcriptomics data

The abundance of high-dimensional measurements in the form of gene expression and mass spectroscopy calls for models to elucidate the underlying biological system. We propose a statistical method that is applicable to a dataset consisting of Liquid Chromatography-Mass Spectroscopy (LC-MS) and gene expression (DNA microarray) measurements from the same samples, to identify genes controlling the production of metabolites.

Due to the high dimensionality of both LC-MS and DNA microarray data, dimension reduction and variable selection are key elements of the analysis. The approach starts by identifying the basis functions ("building blocks") that constitute the output from a mass spectrometry experiment. Subsequently, the weights of these basis functions are related to the observations from the corresponding gene expression data in order to identify which genes are associated with specific patterns seen in the metabolite data. The modeling framework is extremely flexible as well as computationally fast and can accommodate treatment effects and other variables related to the experimental design. We demonstrate that within the proposed framework, genes regulating the production of specific metabolites can be identified correctly unless the variation in the noise is more than twice that of the signal.


10:45-12:30 Stochastic dynamic models

Alexandre Iolov
Optimal Control of Spike Trains in an Ornstein-Uhlenbeck Model.


The membrane potential of neuron cells is often modelled as a simple stochastic differential equation (SDE), with a threshold-hitting used to signify the physical voltage spike through which real neurons convey information. One of the advantages of this model is that there is a direct relation between external physical stimulation and model parameters. In this talk, we address the elicitation of a pre-determined spike train from a single neuron via external stimulation. In our model, this reduces to the control of first-passage-times for an SDE. We explore two control contexts — one in which the detailed voltage trace is observable and one in which only the actual hitting-time is observable. The two contexts necessitate different techniques. We discuss both techniques and then show their performance on several representative parameter regimes.

(This is joint work with Susanne Ditlevsen and Andre Longtin.)


Carsten Wiuf
Stochastic and deterministic "Chemical Reaction Models" and model reduction.


Systems biology aims to understand complex biochemical systems and the mechanisms that are responsible for specific behaviors, such as multi-stationarity or oscillation. Typical mathematical models of biological systems produce polynomial systems of equations. However, it is non-trivial to build a good model as our biochemical knowledge is far from complete and reliable. A main concern is therefore robustness of properties of models of the same biochemical system. In this talk I will discuss some problems we encounter in modeling biochemical systems and some basic questions the biologist is interested in. The results are based on deterministic modeling but if time permits I will make some remarks about stochastic models. The talk is based on work in progress.


Michael Sørensen
Likelihood inference for stochastic differential equations.

Two approaches to inference for discretely sampled stochastic differential equations will be presented. Martingale estimating functions are a simple way of approximating likelihood inference that provides estimators which are easy to calculate. These estimators are generally consistent, and if the estimating function is chosen optimally, they are efficient in a high frequency asymptotic scenario, where the sampling frequency goes to infinity. At low sampling frequencies, efficient estimators can be obtained by more accurate approximations to likelihood inference based on simulation methods, including both the stochastic EM-algorithm and Bayesian approaches like the Gibbs sampler. These methods are much more computer intensive. Simulation of diffusion bridges plays a central role. Therefore this problem has been investigated actively over the last 10 years. A simple method for diffusion bridge simulation will be presented.


13:30-14:30 Svend Kreiner
The PISA controversy. A discussion of statistical issues.

Assessment of the validity and reliability of educational tests by analysis of the fit of Rasch or other types of models from either item response theory or confirmatory factor analysis is usually regarded as an exercise for psychometricians. When educational tests are used in large scale comparative studies where there is no interest whatsoever in the test results of specific children, the question of validity converts to main stream statistical problem of model fit and the degree to which conclusions drawn from the study are robust towards the model errors.

The purpose of this talk is to discuss the statistical issues. PISA (and, for that matter, many other educational surveys) compares countries by analysis of so-called plausible values. Plausible values are random data drawn from the conditional distribution of a latent variable given the observed responses to the items of the educational test. To generate such values PISA has to have a scaling model describing the conditional distribution of item responses given the latent variable and a population model describing the distribution of the latent variable in the different countries.

I will briefly describe the scaling model — a so-called Rasch model — used by PISA and what they and I have done to test whether the scaling model fits data. Since it turns out — and PISA now admits that this is the case — that the Rasch model does not fit data, the important question is whether the model errors matter. To address this issue, we have analyzed the robustness of the country comparisons in several different ways: 1) by analysis of country ranking by different types of subsets of items, 2) by comparison of country ranks by 1000 random subsets of items, and 3) by comparison of the ranking by PISA's scaling model with ranking by two different scaling models where some of the flaws of the Rasch model have been corrected. Finally, because one argument against the importance of model misfit always is that statistical models never fit in large sample studies, we have assessed the degree of misfit by an evaluation of the sample size required for the misfit to be significant.

The results of these analyses will be presented. To us, these results do not support PISA's claim that the country ranks are robust towards the errors of the Rasch model, but that — of course — is something to discuss. Another topic for discussion is whether there are better and more appropriate ways to assess robustness. It would be very welcome, if the discussion after my talk could come up with some proposals to pursue.