This is a walk-through of using the code fcn_median_estimation.R
to deductively estimate the median outcome of a treatment arm of an ignorable assignment design (see paper), and to obtain an estimated standard error.
To run this example, you will need to download and unzip the compressed file code+example.zip From there, open the R file Instruction_code.R and follow/run the steps below.
We use data_example_252pts.csv
as the data example for demonstration, and we are interested in estimating the median outcome of a specific treatment, under the ignorable assignment design. The CSV file contains 252 rows, i.e. 252 observations. This data set has several columns of baseline varialbe \(X\), a column of outcome of interest \(Y\), a column of missingness indicator \(R\) (\(R = 1\) if and only if \(Y\) is observed), and a column of estimated propensity score \(e\) (\(e(X)=P\{R=1|X\}\)).
# load data and source code
library(bootstrap) # for jackknife standard error
## Warning: package 'bootstrap' was built under R version 3.1.1
dt <- read.csv("data_example_252pts.csv")
source("fcn_median_estimation.R")
names(dt)
## [1] "age_18_24" "age_25_34" "age_35_44"
## [4] "age_45_54" "age_55" "female"
## [7] "race_afraican_amer" "race_white" "race_asian_amer"
## [10] "race_other" "educ_highsch_below" "educ_college"
## [13] "educ_graduate" "insu_prvt_emply" "insu_prvt_self"
## [16] "insu_public" "insu_other" "drug_insu"
## [19] "seve_mild_inter" "seve_mild_pers" "seve_mode_pers"
## [22] "seve_seve_pers" "num_comorb" "sf36_phy_scr"
## [25] "sf36_men_scr" "r_vec" "y_vec"
## [28] "e_vec"
In this dataset, column 1 - 25 are baseline variables; column 26 r_vec
is missingness indicator; column 27 y_vec
is outcome of interest; column 28 e_vec
is the estimated propensity score. Note that the estimated propensity score needs to be specified by the user (either using logistic regression, or some fancier techniques), before calling the main function.
The main function for median estimation is deduct_est_median()
. We first specify the necessary components to be passed into this function.
# Specify baseline covariate data frame
x_df <- subset(dt, select = -c(r_vec, y_vec, e_vec))
# Estimate working model of expected Y given X
# Here we use linear regression, using all the patients with Y observed
terms <- paste(colnames(x_df), collapse = "+")
formula <- as.formula(paste("y_vec~", terms)) # just an easier way to throw
# all the X into the regression model
print(formula)
## y_vec ~ age_18_24 + age_25_34 + age_35_44 + age_45_54 + age_55 +
## female + race_afraican_amer + race_white + race_asian_amer +
## race_other + educ_highsch_below + educ_college + educ_graduate +
## insu_prvt_emply + insu_prvt_self + insu_public + insu_other +
## drug_insu + seve_mild_inter + seve_mild_pers + seve_mode_pers +
## seve_seve_pers + num_comorb + sf36_phy_scr + sf36_men_scr
fit_glm <- lm(formula, data = subset(dt, r_vec == 1))
# Specify the initial working model coefficients
beta_init <- coefficients(fit_glm)
print(beta_init)
## (Intercept) age_18_24 age_25_34
## 1.22401 -1.81198 -3.14069
## age_35_44 age_45_54 age_55
## -2.17348 -1.74510 NA
## female race_afraican_amer race_white
## -0.32059 -0.31954 -0.56521
## race_asian_amer race_other educ_highsch_below
## -0.20046 NA -3.27940
## educ_college educ_graduate insu_prvt_emply
## -2.42679 NA -0.30452
## insu_prvt_self insu_public insu_other
## 3.55544 NA NA
## drug_insu seve_mild_inter seve_mild_pers
## -0.59548 -0.90503 -0.05848
## seve_mode_pers seve_seve_pers num_comorb
## -0.38097 NA 0.16039
## sf36_phy_scr sf36_men_scr
## 0.01305 0.02061
# Specify the form of distribution of Y given X
# Here we assume Y|X follows a normal distribution with mean X*beta and standard error 1.
py_vec_fcn = function(t, x, beta){
pnorm(t, mean = x%*%beta[-1] + beta[1], sd = 1) # beta[1] is intercept
}
py_vec_fcn
needs to be a function that takes in arguments: 1) scalar \(t\); 2) \(n \times m\) matrix \(X\); 3) regression coefficient \(\beta\). And it returns a vector of \(P\{Y \leq t | X, \beta\}\), which is of length \(n\). The probability distribution function of \(Y\) given \(X\) is determined by the user. Here we assume \(Y|X \sim N(X\beta,1)\).
Note that, in doing the estimation, the function will first automatically delete the possible NA
’s in beta_init
, which results from linearly dependent columns in x_df
in this example - and also delete corresponding redundant columns in x_df
. By default, the function will free-up the intercept in beta_init
.
est <- deduct_est_median(x_df, dt$r_vec, dt$y_vec,
dt$e_vec, py_vec_fcn,
beta_init)
## Warning: There are NA's in beta_init. Deleted them with corresponding columns in x_df.
## Assuming linear term x*beta in y_vec_fcn.
## The coefficient to be free-up is (Intercept)
##
## Searching for delta to make sum(EIF) zero...
## delta: -0.391892; sum(EIF): 0.207498; iter: 13
## Deductive median estimate: -1.842718
print(est)
## [1] -1.843
To free-up other coefficient in beta_init
, the user can specify free_idx
argument with the desired variable name whose coefficient is intended to be freed-up.
deduct_est_median(x_df, dt$r_vec, dt$y_vec,
dt$e_vec, py_vec_fcn,
beta_init, free_idx = "sf36_phy_scr") # sf36_phy_scr is continuous
## Warning: There are NA's in beta_init. Deleted them with corresponding columns in x_df.
## Assuming linear term x*beta in y_vec_fcn.
## The coefficient to be free-up is sf36_phy_scr
##
## Searching for delta to make sum(EIF) zero...
## delta: -0.008014; sum(EIF): 10.786362; iter: 10
## Deductive median estimate: -1.843514
deduct_est_median(x_df, dt$r_vec, dt$y_vec,
dt$e_vec, py_vec_fcn,
beta_init, free_idx = "female") # female is binary
## Warning: There are NA's in beta_init. Deleted them with corresponding columns in x_df.
## Assuming linear term x*beta in y_vec_fcn.
## The coefficient to be free-up is female
##
## Searching for delta to make sum(EIF) zero...
## delta: -0.605123; sum(EIF): -2.091792; iter: 28
## Deductive median estimate: -1.842679
We can ask the function to give jackknife standard error of the estimate, by specifying jack_se = TRUE
in the arguments. This might take a while.
est_w_se <- deduct_est_median(x_df, dt$r_vec, dt$y_vec,
dt$e_vec, py_vec_fcn,
beta_init, jack_se = TRUE)
## Warning: There are NA's in beta_init. Deleted them with corresponding columns in x_df.
## Assuming linear term x*beta in y_vec_fcn.
## The coefficient to be free-up is (Intercept)
##
## Searching for delta to make sum(EIF) zero...
## delta: -0.391892; sum(EIF): 0.207498; iter: 13
## Deductive median estimate: -1.842718
##
## Calculating Jackknife standard error of the deductive estimator...
## This may take a while...
## Jackknife standard error: 1.513699
print(est_w_se)
## $estimate
## [1] -1.843
##
## $std_err
## [1] 1.514
One can also use data_example_1260pts.csv
, which consists 1260 observations. This will take longer time.