Abstracts

Tales from the everyday life of a pharmaceutical statistician.

Speaker: Claus Dethlefsen, Novo Nordisk.

Abstract: Novo Nordisk is a Danish pharmaceutical company with a large Data Science department, including biostatistics. A pharmaceutical statistician is typically involved in the design, conduct and reporting of randomised clinical trials.

It is of interest to conduct trials that are small enough to be feasible and fast to run, but yet provide sufficient evidence for decision making. Historical data from earlier conducted clinical trials may be used to lower the sample size of a new trial. This can be done by define a prognostic score based on the historical data. The score can then be used in the new trial as covariate.

When communicating trial objectives and results, one important feature is to understand, discuss and make clear how intercurrent events may affect the outcome. It is important to state in clear language how data collected after occurence of intercurrent events are handled and how sensitivity analyses intend to investigate the robustness of these assumptions.

Traditionally, SAS has been the industry standard when reporting trial results for new drug candidates. Like other pharma companies, Novo Nordisk has decided to investigate if R can be used. This led to the first submission of a drug candidate where all output programs were programmed using R.

Slides

Provable Boolean interaction recovery from tree ensemble obtained via random forests.

Speaker: Merle Behr, University of Regensburg.

Abstract: Random Forests (RFs) are at the cutting edge of supervised machine learning in terms of prediction performance, especially in genomics. Iterative RFs (iRFs) use a tree ensemble from iteratively modified RFs to obtain predictive and stable nonlinear or Boolean interactions of features.

They have shown great promise for Boolean biological interaction discovery that is central to advancing functional genomics and precision medicine.

However, theoretical studies into how tree-based methods discover Boolean feature interactions are missing.
Inspired by the thresholding behavior in many biological processes, we first introduce a discontinuous nonlinear regression model, called the "Locally Spiky Sparse" (LSS) model.

Specifically, the LSS model assumes that the regression function is a linear combination of piecewise constant Boolean interaction terms. Given an RF tree ensemble, we define a quantity called Depth-Weighted Prevalence (DWP) for a set of signed features S. Intuitively speaking, DWP(S) measures how frequently features in S appear together in an RF tree ensemble. We prove that, with high probability, DWP(S) attains a universal upper bound that does not involve any model coefficients, if and only if S corresponds to a union of Boolean interactions under the LSS model. Consequentially, we show that a theoretically tractable version of the iRF procedure, called LSSFind, yields consistent interaction discovery under the LSS model as the sample size goes to infinity. Finally, simulation results show that LSSFind recovers the interactions under the LSS model, even when some assumptions are violated.

Slides

Phase-type representations for exponential distributions

Speaker: Clara Brimnes Gardner, Technical University of Denmark - DTU

Abstract: A phase-type distribution describes the time until absorption in a finite state Markov jump process. The sojourn times in such a process are exponentially distributed, and therefore a phase-type distributed random variable is a random sum of dependent, exponentially distributed random variables. This talk consider the necessary and sufficient conditions for a phase-type distribution to simplify into an exponential distribution.

The conditions turn out to be connected to different concepts characterizing the over-parameterization of phase-type distributions such as the algebraic degree of a phase-type distribution and PH-simplicity, and we explore this connection.

Slides

Designing a Data Science Education

Speaker: Therese Graversen, IT University of Copenhagen

Abstract: The first bachelor programme in Data Science in Denmark opened at the IT University (ITU) in 2017 and has become highly popular - at at the time of writing we have 93 students in their first year.

I took over the role as the head of programme at the beginning of 2020 just in time for graduating the first cohort, and I have since had the pleasure of evaluating the programme and implementing major revisions. In this talk I will share some thoughts on designing good data science education programmes. I will also touch upon the pertinent question of "what is data science, actually?", particularly from a statistician's point of view.

The Emperor's New Machine Learning Clothes – the fairy tale of a marketed clinical prediction model

Speaker: Simon Tilma Vistisen, Department of Clinical Medicine

Abstract: Clarifying one of the bigger scientific scandals in anesthesia.

A low blood pressure during surgeries is associated with poor outcomes (organ damage and mortality), and a large medical company has allegedly developed a machine learning algorithm to predict upcoming/imminent low blood pressure, such that clinicians can pro-actively treat patients and prevent a low blood pressure. Today, the algorithm is the company’s flagship technology with monitoring of patients. Unfortunately, this algorithm is worthless because of a basic and simple error introduced in the very beginning of algorithm development. It causes a severe data leakage problem. However, this need to be convincingly described in the scientific literature - with data. Yes, this a quite unbelievable case of a scientific-industrial scandal.

Slides

Pseudo-observations in a multistate setting

Speaker: Morten Overgaard, Aarhus University

Abstract: Regression analyses of how state occupation probabilities depend on baseline covariates in multistate settings can be performed using the pseudo-observation method, which involves calculating jack-knife pseudo-observations based on some estimator of the state occupation probability and using these as outcomes in the regression analysis.

We will consider such a pseudo-observation-based approach based on the Aalen--Johansen-derived estimator of the state occupation probabilities: Some theory of why the approach would work; an algorithm for calculation; some results of a simulation study.

Slides

Reviewing data embeddings to structured algebras

Speaker: Jacob Hjelmborg, University of Southern Denmark

Abstract: We focus on a recent proposal of a data analysis framework that involves the theory of operator algebras. We will be reviewing embeddings of data into Hilbert C*-modules that have the reproducing kernel property. The classic idea of embedding data into some higher dimensional space to find solution of problems, e.g., separation of features is key and has a modern aspect through extended kernel-based methods.

Does this have a relation to health science questions? We intend to relate to the aims of the Nordic twin study of cancer.

Slides

How to generate realistic, non-personal synthetic health data?

Speaker: Martin Bøgsted, Center for Clinical Data Science, Aalborg University and Aalborg University Hospital

Abstract: A person’s health data are according to GDPR guided by strict governance and access rules. To share such data to a wider audience for educational, research, and development purposes, it has been suggested to generate synthetic data, i.e., anonymous data (which per definition are outside the GDPR) that preserves the properties of the real dataset. Synthetic data are typically obtained by sampling from a noisy model of the data. There is a clear privacy/utility tradeoff for such data. If we add much noise, the privacy is high, but the utility is low and if we add no noise the privacy is low, but the utility is high. Thus, there is a clear need to establish ways of measuring the privacy of synthetic data generation. Differential privacy is considered the state-of-the-art privacy property. However, this measure does not directly measure the individual’s privacy risk. As an alternative, Bayesian privacy has been suggested which measures the individual’s posterior risk given the synthetic data and the generating mechanism. However, current implementations scale poorly both regarding to construction and computation time. In this talk, the notation of differential and Bayesian privacy will be defined and there will be a discussion on the open problems in large scale adoption of Bayesian privacy measures. This work is a part of the project “Synthetic Health Data: Ethical Development and Deployment via Deep Learning approaches (SE3D)” supported by the Nordisk Foundation’s Data Science Collaborative Research Programme.