d1s1t3n4_pratibha jalui & reetabrata bhattacharyya

25
Imputation of Missing Data through Bayesian Approach Pratibha Jalui Cytel Statistical Software & Services Pvt. Ltd, Pune Email: [email protected] Reetabrata Bhattacharyya Tata Consultancy Services Limited, Mumbai Email: [email protected] Pratibha (Cytel) & Reetabrata (TCS) Imputation of Missing Data through Bayesian ConSPIC, 8 th - 10 th Oct, 2015 1 / 24

Upload: reetabrata-bhattacharyya

Post on 21-Jan-2017

228 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: D1S1T3N4_Pratibha Jalui & Reetabrata Bhattacharyya

Imputation of Missing Data through Bayesian Approach

Pratibha JaluiCytel Statistical Software & Services Pvt. Ltd, Pune

Email: [email protected]

Reetabrata BhattacharyyaTata Consultancy Services Limited, Mumbai

Email: [email protected]

Pratibha (Cytel) & Reetabrata (TCS) Imputation of Missing Data through Bayesian ConSPIC, 8th - 10th Oct, 2015 1 / 24

Page 2: D1S1T3N4_Pratibha Jalui & Reetabrata Bhattacharyya

Overview

1 Introduction and Background

2 Mechanisms

3 Motivation

4 Objective

5 Data and Methods

6 Results

7 Conclusion and Discussion

Pratibha (Cytel) & Reetabrata (TCS) Imputation of Missing Data through Bayesian ConSPIC, 8th - 10th Oct, 2015 2 / 24

Page 3: D1S1T3N4_Pratibha Jalui & Reetabrata Bhattacharyya

Why talk about Missing Data?

Randomized clinical trials - primary tool forevaluating new medical interventions.

More than $7 billion spent every year in evaluatingdrugs, devices, and biologists of which a substantialpercentage of outcomes of interest is often missing.

Missingness reduces the benefit provided byrandomization - introduces potential biases incomparison of the treatment groups.

As large as 65% of articles in PubMed journals donot report the handling of Missing data.

Health Authorities encourage better approaches tohandle missing data

"The only really good solution to the missing data problem is not to have any" - Paul Allison

Pratibha (Cytel) & Reetabrata (TCS) Imputation of Missing Data through Bayesian ConSPIC, 8th - 10th Oct, 2015 3 / 24

Page 4: D1S1T3N4_Pratibha Jalui & Reetabrata Bhattacharyya

Why talk about Missing Data?

Randomized clinical trials - primary tool forevaluating new medical interventions.

More than $7 billion spent every year in evaluatingdrugs, devices, and biologists of which a substantialpercentage of outcomes of interest is often missing.

Missingness reduces the benefit provided byrandomization - introduces potential biases incomparison of the treatment groups.

As large as 65% of articles in PubMed journals donot report the handling of Missing data.

Health Authorities encourage better approaches tohandle missing data

"The only really good solution to the missing data problem is not to have any" - Paul Allison

Pratibha (Cytel) & Reetabrata (TCS) Imputation of Missing Data through Bayesian ConSPIC, 8th - 10th Oct, 2015 3 / 24

Page 5: D1S1T3N4_Pratibha Jalui & Reetabrata Bhattacharyya

How do we define Missing Data?

Missing Data

Data that were planned to be recorded but are not available.

Broadly two types of missing data which are as follows:

Monotone missing data

All data for a subject are missing after a certain time-point.

Serious problem in interpreting the results of a trial.

Non-monotone or intermediate missing data

A subject misses a visit but contributes data at later visits.

Pratibha (Cytel) & Reetabrata (TCS) Imputation of Missing Data through Bayesian ConSPIC, 8th - 10th Oct, 2015 4 / 24

Page 6: D1S1T3N4_Pratibha Jalui & Reetabrata Bhattacharyya

Types of Missingness

1. Missing Completely at Random (MCAR)

Missingness is independent on observed and unobserved data.

Example:• Patient moving to another city for non-health reasons. Patients who drop

out from a study for this reason could be considered a random andrepresentative sample from the total study population.

2. Missing at Random (MAR)

Missingness depends on observed data.

Example:• Dropout due to previous lack of efficacy could be MAR, because in some

sense predictable from the observed data in the model.• Men may be more likely to decline to answer some questions than women.

Pratibha (Cytel) & Reetabrata (TCS) Imputation of Missing Data through Bayesian ConSPIC, 8th - 10th Oct, 2015 5 / 24

Page 7: D1S1T3N4_Pratibha Jalui & Reetabrata Bhattacharyya

Types of Missingness

3. Missing Not At Random (MNAR)

Missingness is not independent in unobserved data, even afteraccounting form the observed data.

Difficult to model

Example:• It may happen that after a series of visits with good outcome, a patient

drops out due to lack of efficacy. In this situation the analysis model basedon the observed data, including relevant covariates, is likely to continue topredict a good outcome, but it is usually unreasonable to expect the patientto continue to derive benefit from treatment.

• Individuals with very high incomes are more likely to decline to answerquestions about their own income.

Pratibha (Cytel) & Reetabrata (TCS) Imputation of Missing Data through Bayesian ConSPIC, 8th - 10th Oct, 2015 6 / 24

Page 8: D1S1T3N4_Pratibha Jalui & Reetabrata Bhattacharyya

The Effect of Missing Values on Analysis and Interpretation

The following problems may affect the interpretation of the trial results whensome missing data are present.

Power and Variability

• Power of a trial will increase if the sample size is increased or if thevariability of the outcomes is reduced.

Bias• Risk of bias in the estimation on the treatment effect from the observed

data depends upon the relationship between missingness, treatment andoutcome.

• Type of bias that can critically affect interpretation will depend uponwhether the objective of the study is to show a difference or demonstratenon-inferiority/equivalence.

Pratibha (Cytel) & Reetabrata (TCS) Imputation of Missing Data through Bayesian ConSPIC, 8th - 10th Oct, 2015 7 / 24

Page 9: D1S1T3N4_Pratibha Jalui & Reetabrata Bhattacharyya

Goals of Statistical Analysis with Missing Data

Goals of Statistical Analysis:

Minimize bias

Maximize use of available information

Obtain appropriate estimates of uncertainty

Key points to keep in mind:

Research question (i.e. the hypothesis under investigation)

Information in the observed data

Reason(s) for missing data

As statisticians/programmers we need to:

Consult with Investigators to design to minimize missing data/ infor-mation, postulate plausible missingness, perform valid analysis andinterpret the results.

Pratibha (Cytel) & Reetabrata (TCS) Imputation of Missing Data through Bayesian ConSPIC, 8th - 10th Oct, 2015 8 / 24

Page 10: D1S1T3N4_Pratibha Jalui & Reetabrata Bhattacharyya

What do the Regulatory Bodies (FDA/EMEA) recommend?

Avoid Missing Data wherever possible

Protocol to address potential impact and treatment of anticipated missingdata

Design strategies to minimize treatment and analysis dropouts

Continue to collect information on key outcomes on participants whodiscontinue -record and use it for analysis

Set a minimum rate of completeness for the primary outcome(s), basedon similar past trials

Specify Statistical methods and assumptions for handling missing data inprotocols such a way that is understood by clinicians

Focused efforts on training staff

Pratibha (Cytel) & Reetabrata (TCS) Imputation of Missing Data through Bayesian ConSPIC, 8th - 10th Oct, 2015 9 / 24

Page 11: D1S1T3N4_Pratibha Jalui & Reetabrata Bhattacharyya

What do the Regulatory Bodies (FDA/EMEA) recommend?

Avoid Single imputation methods like LOCF and BOCF as the primaryapproach to the treatment of missing data unless underlying assumptionsare scientifically justified.

Parametric models, random effects models to be used with caution -allassumptions clearly stated - accompanied by goodness-of-fit procedures.

Weighted generalized estimating equations methods be more widelyused as alternative to parametric modeling.

When substantial missing data are anticipated, auxiliary informationshould be collected.

Sensitivity analyses mandated as part of the primary reporting of findingsfrom clinical trials

Pratibha (Cytel) & Reetabrata (TCS) Imputation of Missing Data through Bayesian ConSPIC, 8th - 10th Oct, 2015 10 / 24

Page 12: D1S1T3N4_Pratibha Jalui & Reetabrata Bhattacharyya

Treatments for Missing Data: Traditional Approach

List wise Deletion• Omit cases with missing data and run analyses on what remains.

Simple Imputation Method - Last Observation Carried Forward• Subject’s missing responses is equal to their last observed response and it

is developed under Missing Completely At Random (MCAR) framework• Usually used in longitudinal (repeated measures) studies of continuous

outcomes

Simple Imputation Method - Baseline Observation Carried Forward

• Similar to LOCF but here we assume a patient’s missing responses isequal to their baseline observed response.

Empirically developed models• Unconditional and conditional mean imputation• Best or worst case imputation• Regression methods and Hot-deck imputation

Pratibha (Cytel) & Reetabrata (TCS) Imputation of Missing Data through Bayesian ConSPIC, 8th - 10th Oct, 2015 11 / 24

Page 13: D1S1T3N4_Pratibha Jalui & Reetabrata Bhattacharyya

Treatments for Missing Data: Modern Approach

Full Information Maximum Likelihood (FIML) model

• Uses pragmatic missing data estimation approach for structural equationmodeling

• Produces unbiased parameter estimates and standard errors under MARand MCAR.

• Unlike the maximum likelihood method FIML uses all availableinformation in all observations.

Mixed-Effect Model Repeated Measure (MMRM) model

• Applies with a Restricted Maximum Likelihood solution to studylongitudinal (repeated measures) analyses under MAR assumption.

• Missing data are not explicitly imputed. No effect on other scores fromthat same patient.

Pratibha (Cytel) & Reetabrata (TCS) Imputation of Missing Data through Bayesian ConSPIC, 8th - 10th Oct, 2015 12 / 24

Page 14: D1S1T3N4_Pratibha Jalui & Reetabrata Bhattacharyya

Objective

1 To examine the multiple imputation(MI) approach, specifically, BayesianMarkov Chain Monte Carlo (MCMC) random sampling method for theanalysis of incomplete data.

2 To compare the performance of original data using last observationcarried forward (LOCF) and baseline observation carriedforward(BOCF) imputation approaches versus MI through BayesianMCMC random sampling method.

Pratibha (Cytel) & Reetabrata (TCS) Imputation of Missing Data through Bayesian ConSPIC, 8th - 10th Oct, 2015 13 / 24

Page 15: D1S1T3N4_Pratibha Jalui & Reetabrata Bhattacharyya

Data : Analytical Background

Testing of treatment (Hypothesis of Interest)

To evaluate the efficacy of Treatment A at Week-16 for change inVitreous Haze (VH) score.

Statistical Analysis Plan

The change from baseline to Week-16 in VH score are comparedbetween treatment groups using an Analysis of Covariance (ANCOVA)model.

The model are included the fixed categorical effect of treatment groups,visits and treatment-by visit interaction as well as the fixed continuouscovariate of baseline VH.

The model provides adjusted least square (LS) means estimates at week16 for both the treatment groups, difference between the means,corresponding standard error (SE), confidence interval (CI) and p-value.

Pratibha (Cytel) & Reetabrata (TCS) Imputation of Missing Data through Bayesian ConSPIC, 8th - 10th Oct, 2015 14 / 24

Page 16: D1S1T3N4_Pratibha Jalui & Reetabrata Bhattacharyya

Data: Simulation

Simulated hypothetical clinical trial efficacy dataset as an input in orderto perform the MCMC method for missing data imputation.

100 patients are considered with an amount of missing data similar to theone observed in our real data set.

Missing data pattern is randomly created.

This is an exhaustive simulation study just to demonstrate the applicationof Bayesian method for imputing missing value.

A data set simulation is done to obtain a more complete comparison ofthe three methods (BOCF, LOCF with MI).

Pratibha (Cytel) & Reetabrata (TCS) Imputation of Missing Data through Bayesian ConSPIC, 8th - 10th Oct, 2015 15 / 24

Page 17: D1S1T3N4_Pratibha Jalui & Reetabrata Bhattacharyya

Methods: Analytical Background

Bayesian Approach

In Bayesian inference, information about unknown parameters isexpressed in the form of a posterior probability distribution.

Markov Chain Monte Carlo (MCMC)

A Markov chain is a sequence of random variables in which thedistribution of each element depends on the value of the previous one.

Through MCMC, we can simulate the entire joint posterior distributionof the unknown quantities and obtain simulation based estimates ofposterior parameters of interest.

It is a collection of methods for simulating random draws fromnonstandard distributions via Markov chains.

By repeatedly simulating steps of the chain, it simulates draws from thedistribution of interest.

Pratibha (Cytel) & Reetabrata (TCS) Imputation of Missing Data through Bayesian ConSPIC, 8th - 10th Oct, 2015 16 / 24

Page 18: D1S1T3N4_Pratibha Jalui & Reetabrata Bhattacharyya

Method: Data Augmentation (DA) Algorithm

GoalTo have the iterates converge to the stationary distribution.To simulate an approximately independent draw of the missing values.

AssumptionAssuming that the data are from a multivariate normal distribution.Data augmentation is applied to Bayesian inference with missing data byrepeating the following steps:

Step - 1The imputation I-step:

To estimate mean vector and covariance matrix.I-step simulates the missing values for each observation independently.The I-step draws values for Yi(mis) from a conditional distribution Yi(mis)given Yi(obs) .where, Yi(mis): the variables with missing values for observation i ;

Yi(obs): the variables with observed values for observation i .Pratibha (Cytel) & Reetabrata (TCS) Imputation of Missing Data through Bayesian ConSPIC, 8th - 10th Oct, 2015 17 / 24

Page 19: D1S1T3N4_Pratibha Jalui & Reetabrata Bhattacharyya

Method: Data Augmentation (DA) Algorithm

Step - 2

The posterior P-step:P-step simulates the posterior population mean vector and covariancematrix from the complete sample estimates by using non-informativeprior.

These new estimates are then used in the I-step.

Iterates converge to their stationary distribution and then to simulate anapproximately independent draw of the missing values.

SummaryCurrent parameter estimate θ(t) at tth iteration.

I-step draws Y(t+1)mis from P(Ymis|Yobs, θ

(t))

P-step draws θ(t+1) from P(θ(t)|Yobs,Ymis)

This creates a Markov chain (Y(1)mis , θ(1)) , (Y(2)

mis , θ(2)),........

It converges in distribution to P(Ymis, θ|Yobs).Pratibha (Cytel) & Reetabrata (TCS) Imputation of Missing Data through Bayesian ConSPIC, 8th - 10th Oct, 2015 18 / 24

Page 20: D1S1T3N4_Pratibha Jalui & Reetabrata Bhattacharyya

Method: Application in SAS

Multiple Imputation step 1

MCMC method used in conjunction with the IMPUTE=MONOTONEoption to create an imputed data set with a monotone missing pattern.

Variables include treatment group and VH scores at baseline andpost-baseline analysis visits.

This method implies that VH scores are analysed as continuous variablesand treatment group is a dummy variable.

SAS Codeproc mi data=dset1 out=MIstep1 seed=27160 nimpute=1000 noprint ;

mcmc impute=monotone chain=multiple ;var armn baseline week2 week4 week6 week8 week10

week12 week14 week16;run;

Pratibha (Cytel) & Reetabrata (TCS) Imputation of Missing Data through Bayesian ConSPIC, 8th - 10th Oct, 2015 19 / 24

Page 21: D1S1T3N4_Pratibha Jalui & Reetabrata Bhattacharyya

Method: Application in SAS

Multiple Imputation step 2Missing data are imputed with a regression method by using themonotone data set from step 1Variables include treatment group, stratification variables and VH scoresat baseline and post-baseline analysis visits.This method implies that VH scores are analysed as continuous variables.Output data set from step 1 (after rounding) is used as input data set forstep 2.Only 1 imputation in step 2 (for each imputation from step 1).

SAS Codeproc mi data=MIstep1r out=MIstep2 seed=54320 nimpute=1 noprint ;

var armn stratum baseline week2 week4 week6 week8 week10week12 week14 week16;

class armn stratum;monotone reg;

run;Pratibha (Cytel) & Reetabrata (TCS) Imputation of Missing Data through Bayesian ConSPIC, 8th - 10th Oct, 2015 20 / 24

Page 22: D1S1T3N4_Pratibha Jalui & Reetabrata Bhattacharyya

Result: Tabular representation of efficacy endpoint

Table 1 : Change from baseline in VH Score to Week 16, MITT population

Vitreous Hazre (Miami 9-step scale) Placebo (N=43) Treatment A (N=57)

Baseline

Number 43 57

Mean (SD) 4.47 (1.96) 4.68 (2.49)

Median 5.00 5.00

Min : Max 1.0 : 8.0 1.0 : 8.0

Week 16

Number 28 44

Mean (SD) 4.18 (2.47) 4.64 (2.30)

Median 3.50 5.00

Min : Max 1.0 : 8.0 1.0 : 8.0

Change from Baseline

Number 28 44

Mean (SD) -0.11 (3.58) -0.25 (3.36)

Median 0.50 -1.00

Min : Max -7.0 : 6.0 -7.0 : 6.0

Analysis : Original Data

LS Means (SE)

-0.42 (0.414) 0.06 (0.331)

90% CI (-1.103 to 0.262) (-0.485 to 0.604)

LS Mean differences (SE) vs. Placebo

0.48 (0.530)

90% CI (-0.393 to 1.354)

p-value 0.3653

Analysis : BOCF

LS Means (SE)

-0.18 (0.340) -0.10 (0.295)

90% CI (-0.739 to 0.380) (-0.590 to 0.383)

LS Mean differences (SE) vs. Placebo

0.08 (0.450)

90% CI (-0.665 to 0.817)

p-value 0.8659

Analysis : LOCF

LS Means (SE)

0.15 (0.329) 0.15 (0.329)

90% CI (-0.395 to 0.689) (-0.197 to 0.745)

LS Mean differences (SE) vs. Placeboa

0.13 (0.436)

90% CI (-0.591 to 0.845)

p-value 0.7704

Analysis : Imputation (Bayesian)

LS Means (SE)

-0.78 (0.380) -0.11 (0.317)

90% CI (-1.527 to -0.335) (-0.729 to 0.515)

LS Mean differences (SE) vs. Placebo

0.67 (0.495)

90% CI (-0.298 to 1.645)

p-value 0.1739

Pratibha (Cytel) & Reetabrata (TCS) Imputation of Missing Data through Bayesian ConSPIC, 8th - 10th Oct, 2015 21 / 24

Page 23: D1S1T3N4_Pratibha Jalui & Reetabrata Bhattacharyya

Conclusion and Discussion

From Table 1, we see that, LS mean change in VH score, from baselineto week 16 is higher in the Treatment A compared to the placebo group,but also tends to statistically significant difference for imputation byusing Bayesian .

Improvement of p-values has been noticed for imputation by usingBayesian (0.1739) than LOCF (0.7704) & BOCF (0.8659) compared tooriginal data (0.3653).Bayesian approach lends itself naturally different choices of priordistributions encoding assumptions about the missing data process.

It offers possibility of including informative prior information aboutmissing data process.But models can become computationallychallenging.

The procedure can be used in the data preparation steps before callingthe analysis model to simplify the clinical efficacy data analysis process.

Pratibha (Cytel) & Reetabrata (TCS) Imputation of Missing Data through Bayesian ConSPIC, 8th - 10th Oct, 2015 22 / 24

Page 24: D1S1T3N4_Pratibha Jalui & Reetabrata Bhattacharyya

References

Allison, P.D. (2000). Multiple Imputation for Missing Data: ACautionary Tale. Sociological Methods and Research, 28: 301-309.

Barnard J, Rubin DB (1999). Small-Sample Degrees of Freedom withMultiple Imputation. Biometrika, 86: 948-955.

National Research Council. The Prevention and Treatment of MissingData in Clinical Trials. The Panel on Handling Missing Data in ClinicalTrials

Rubin DB (1976). Inference and Missing Data. Biometrika, 63: 581-592.

Rubin DB (1987). Multiple Imputation for Nonresponse in Surveys. JohnWiley & Sons.

Rubin DB (1996). Imputation After 18+ Years. Journal of the AmericanStatistical Association, 91: 473-489.

Yuan, Yang (2011). Multiple Imputation Using SAS Software. Journalof Statistical Software, 45(6): 1-25.

Pratibha (Cytel) & Reetabrata (TCS) Imputation of Missing Data through Bayesian ConSPIC, 8th - 10th Oct, 2015 23 / 24

Page 25: D1S1T3N4_Pratibha Jalui & Reetabrata Bhattacharyya

Pratibha (Cytel) & Reetabrata (TCS) Imputation of Missing Data through Bayesian ConSPIC, 8th - 10th Oct, 2015 24 / 24