This is a walk-through of using the code fcn_median_estimation.R to deductively estimate the median outcome of a treatment arm of an ignorable assignment design (see paper), and to obtain an estimated standard error.

To run this example, you will need to download and unzip the compressed file code+example.zip From there, open the R file Instruction_code.R and follow/run the steps below.

Data Preparation

We use data_example_252pts.csv as the data example for demonstration, and we are interested in estimating the median outcome of a specific treatment, under the ignorable assignment design. The CSV file contains 252 rows, i.e. 252 observations. This data set has several columns of baseline varialbe \(X\), a column of outcome of interest \(Y\), a column of missingness indicator \(R\) (\(R = 1\) if and only if \(Y\) is observed), and a column of estimated propensity score \(e\) (\(e(X)=P\{R=1|X\}\)).

# load data and source code
library(bootstrap) # for jackknife standard error
## Warning: package 'bootstrap' was built under R version 3.1.1
dt <- read.csv("data_example_252pts.csv")
source("fcn_median_estimation.R")
names(dt)
##  [1] "age_18_24"          "age_25_34"          "age_35_44"         
##  [4] "age_45_54"          "age_55"             "female"            
##  [7] "race_afraican_amer" "race_white"         "race_asian_amer"   
## [10] "race_other"         "educ_highsch_below" "educ_college"      
## [13] "educ_graduate"      "insu_prvt_emply"    "insu_prvt_self"    
## [16] "insu_public"        "insu_other"         "drug_insu"         
## [19] "seve_mild_inter"    "seve_mild_pers"     "seve_mode_pers"    
## [22] "seve_seve_pers"     "num_comorb"         "sf36_phy_scr"      
## [25] "sf36_men_scr"       "r_vec"              "y_vec"             
## [28] "e_vec"

In this dataset, column 1 - 25 are baseline variables; column 26 r_vec is missingness indicator; column 27 y_vec is outcome of interest; column 28 e_vec is the estimated propensity score. Note that the estimated propensity score needs to be specified by the user (either using logistic regression, or some fancier techniques), before calling the main function.

Specify arguments to be passed into the main function

The main function for median estimation is deduct_est_median(). We first specify the necessary components to be passed into this function.

# Specify baseline covariate data frame
x_df <- subset(dt, select = -c(r_vec, y_vec, e_vec))

# Estimate working model of expected Y given X
# Here we use linear regression, using all the patients with Y observed
terms <- paste(colnames(x_df), collapse = "+")
formula <- as.formula(paste("y_vec~", terms))  # just an easier way to throw 
                                               # all the X into the regression model
print(formula)
## y_vec ~ age_18_24 + age_25_34 + age_35_44 + age_45_54 + age_55 + 
##     female + race_afraican_amer + race_white + race_asian_amer + 
##     race_other + educ_highsch_below + educ_college + educ_graduate + 
##     insu_prvt_emply + insu_prvt_self + insu_public + insu_other + 
##     drug_insu + seve_mild_inter + seve_mild_pers + seve_mode_pers + 
##     seve_seve_pers + num_comorb + sf36_phy_scr + sf36_men_scr
fit_glm <- lm(formula, data = subset(dt, r_vec == 1))

# Specify the initial working model coefficients
beta_init <- coefficients(fit_glm)

print(beta_init)
##        (Intercept)          age_18_24          age_25_34 
##            1.22401           -1.81198           -3.14069 
##          age_35_44          age_45_54             age_55 
##           -2.17348           -1.74510                 NA 
##             female race_afraican_amer         race_white 
##           -0.32059           -0.31954           -0.56521 
##    race_asian_amer         race_other educ_highsch_below 
##           -0.20046                 NA           -3.27940 
##       educ_college      educ_graduate    insu_prvt_emply 
##           -2.42679                 NA           -0.30452 
##     insu_prvt_self        insu_public         insu_other 
##            3.55544                 NA                 NA 
##          drug_insu    seve_mild_inter     seve_mild_pers 
##           -0.59548           -0.90503           -0.05848 
##     seve_mode_pers     seve_seve_pers         num_comorb 
##           -0.38097                 NA            0.16039 
##       sf36_phy_scr       sf36_men_scr 
##            0.01305            0.02061
# Specify the form of distribution of Y given X
# Here we assume Y|X follows a normal distribution with mean X*beta and standard error 1.
py_vec_fcn = function(t, x, beta){
  pnorm(t, mean = x%*%beta[-1] + beta[1], sd = 1) # beta[1] is intercept
}

py_vec_fcn needs to be a function that takes in arguments: 1) scalar \(t\); 2) \(n \times m\) matrix \(X\); 3) regression coefficient \(\beta\). And it returns a vector of \(P\{Y \leq t | X, \beta\}\), which is of length \(n\). The probability distribution function of \(Y\) given \(X\) is determined by the user. Here we assume \(Y|X \sim N(X\beta,1)\).

Estimation of the median

Note that, in doing the estimation, the function will first automatically delete the possible NA’s in beta_init, which results from linearly dependent columns in x_df in this example - and also delete corresponding redundant columns in x_df. By default, the function will free-up the intercept in beta_init.

est <- deduct_est_median(x_df, dt$r_vec, dt$y_vec,
                         dt$e_vec, py_vec_fcn,
                         beta_init)
## Warning: There are NA's in beta_init. Deleted them with corresponding columns in x_df.
##   Assuming linear term x*beta in y_vec_fcn.
## The coefficient to be free-up is (Intercept)
## 
## Searching for delta to make sum(EIF) zero...
##     delta: -0.391892; sum(EIF): 0.207498; iter: 13
##     Deductive median estimate: -1.842718
print(est)
## [1] -1.843

To free-up other coefficient in beta_init, the user can specify free_idx argument with the desired variable name whose coefficient is intended to be freed-up.

deduct_est_median(x_df, dt$r_vec, dt$y_vec,
                dt$e_vec, py_vec_fcn,
                beta_init, free_idx = "sf36_phy_scr") # sf36_phy_scr is continuous
## Warning: There are NA's in beta_init. Deleted them with corresponding columns in x_df.
##   Assuming linear term x*beta in y_vec_fcn.
## The coefficient to be free-up is sf36_phy_scr
## 
## Searching for delta to make sum(EIF) zero...
##     delta: -0.008014; sum(EIF): 10.786362; iter: 10
##     Deductive median estimate: -1.843514
deduct_est_median(x_df, dt$r_vec, dt$y_vec,
                dt$e_vec, py_vec_fcn,
                beta_init, free_idx = "female") # female is binary
## Warning: There are NA's in beta_init. Deleted them with corresponding columns in x_df.
##   Assuming linear term x*beta in y_vec_fcn.
## The coefficient to be free-up is female
## 
## Searching for delta to make sum(EIF) zero...
##     delta: -0.605123; sum(EIF): -2.091792; iter: 28
##     Deductive median estimate: -1.842679

We can ask the function to give jackknife standard error of the estimate, by specifying jack_se = TRUE in the arguments. This might take a while.

est_w_se <- deduct_est_median(x_df, dt$r_vec, dt$y_vec,
                              dt$e_vec, py_vec_fcn,
                              beta_init, jack_se = TRUE)
## Warning: There are NA's in beta_init. Deleted them with corresponding columns in x_df.
##   Assuming linear term x*beta in y_vec_fcn.
## The coefficient to be free-up is (Intercept)
## 
## Searching for delta to make sum(EIF) zero...
##     delta: -0.391892; sum(EIF): 0.207498; iter: 13
##     Deductive median estimate: -1.842718
## 
## Calculating Jackknife standard error of the deductive estimator...
##     This may take a while...
##     Jackknife standard error: 1.513699
print(est_w_se)
## $estimate
## [1] -1.843
## 
## $std_err
## [1] 1.514

One can also use data_example_1260pts.csv, which consists 1260 observations. This will take longer time.