** HW 1 Solutions ** set more off ** change your working directory to where you have the afcr.raw dataset saved ** cd "/Users/haley/Dropbox/Desktop/LDA09/data" clear infile id time sqafcr group ptreat age using "afcr.raw" save "afcr.dta", replace ************************************************************* ** Problem 1 ************************************************ xtset id time * a) Describe the structure of the data set xtdes bysort time: sum sqafcr codebook group ptreat age xtsum age xtsum ptreat xttab group * b)visualize the longitudinal nature of the data by plotting indvidual trajectories *generate a smooth lowess curve lowess sqafcr time, gen(smth) nograph *make a spaghetti plot using all the data sort id time twoway line sqafcr time, pstyle(p15) connect(ascending) || line smth time, pstyle(p1) clwidth(thick) sort ||, ytitle("Square root of AFCR") *make a spaghetti plot using 10% of the subjects (15 subjects) ** reshape to wide format to get one row per subject reshape wide sqafcr smth, i(id) j(time) ** set seed to 345 (guarantee you can reproduce the same plot again) set seed 345 ** generate a random variable from uniform(0,1) distribution. gen random=uniform() ** reshape data back to long format. reshape long ** find the 10% quantile of the uniform(0,1) variable. centile random, centile(10) ** plot trajectories only for subjects with value of 'random' < 10th percentile. twoway line sqafcr time if random<.0498491, pstyle(p15) connect(ascending) || line smth time, pstyle(p1) clwidth(thick) sort ||, ytitle("Square root of AFCR") *ZAP plot ** use the sqafcr data (instead of residuals - which is another, valid approach) ** calculate median sqafcr for each subject. egen medsqafcr=median(sqafcr), by(id) ** Reshape data to wide format reshape wide egen maxmedsqafcr=max(medsqafcr) egen minmedsqafcr=min(medsqafcr) egen medmedsqafcr=median(medsqafcr) egen medsqafcr25=pctile(medsqafcr), p(25) egen medsqafcr75=pctile(medsqafcr), p(75) ** Reshape data to long format reshape long gen maxsqafcr=sqafcr if medsqafcr==maxmedsqafcr gen minsqafcr=sqafcr if medsqafcr==minmedsqafcr gen msqafcr=sqafcr if medsqafcr==medmedsqafcr gen sqafcr25=sqafcr if medsqafcr==medsqafcr25 gen sqafcr75=sqafcr if medsqafcr==medsqafcr75 ** Make a ZAP spaghetti plot using raw response data sort id time scatter sqafcr time, s(oh)|| line maxsqafcr minsqafcr msqafcr sqafcr25 sqafcr75 time, clwidth(medthick medthick medthick medthick medthick) connect(ascending)|| line smth time, pstyle(p1) clwidth(thick) sort ||, legend(lab(1 "sqafcr")) ytitle(Square root of AFCR) *PROBLEM : note that there are multiple individuals with the same quantiles of median sqafcr *and even using the connect(ascending) option of line is not helping us to plot the distinct *trajectories for each individual * identify the number of individuals at each of our quantiles of interest tab id maxsqafcr tab id minsqafcr tab id msqafcr tab id sqafcr25 tab id sqafcr75 * our biggest problem is with the median of the median sqafcr where we have three individuals * the trajectories for these individuals are not being properly plotted - they are being 'confused' * i.e., we don't have three distinct trajectories being plotted * the ids for these three individuals are 36, 117 and 131 - we'll plot msqafcr for each of these ids separately ** second try making a ZAP spaghetti plot using raw response data, fix the problem with msqafcr sort id time scatter sqafcr time, s(oh)|| line maxsqafcr minsqafcr sqafcr25 sqafcr75 time, clwidth(medthick medthick medthick medthick) connect(ascending)|| line msqafcr time if id==36, lcol(lavender) || line msqafcr time if id==117 , lcol(lavender) || line msqafcr time if id==131 , lcol(lavender) ||line smth time, pstyle(p1) clwidth(thick) sort ||, legend(lab(1 "sqafcr")) ytitle(Square root of AFCR) ** third try making a ZAP spaghetti plot using raw response data, fix the problem with sqafcr75 ** we see three trajectories - there should only be two sort id time scatter sqafcr time, s(oh)|| line maxsqafcr minsqafcr sqafcr25 time, clwidth(medthick medthick medthick medthick) connect(ascending)|| line msqafcr time if id==36, lcol(lavender) || line msqafcr time if id==117 , lcol(lavender) || line msqafcr time if id==131 , lcol(lavender) || line sqafcr75 time if id==84, lcol(magenta) || line sqafcr75 time if id==122, lcol(magenta) || line smth time, pstyle(p1) clwidth(thick) sort ||, legend(lab(1 "sqafcr")) ytitle(Square root of AFCR) * legend isn't that pretty, but now we are seeing the correct trajectories!! * c) explore the longitudinal structure of the data with respect to the primary scientific aim of the study *ignore age and prior treatment effects for now *grouped by treatment, offset a little bit to make error bars more visible xtgraph sqafcr, group(group) av(median) bar(iqr) offset(0.2) * d) Explore the correlation structure of the response variable **remove the effects of covariates, including time, group, ptreat, and age xi: reg sqafcr i.time group ptreat age predict afcrres1, resid autocor afcrres1 time id **calculate ACF confidence interval keep id time afcrres1 reshape wide afcrres1, i(id) j(time) pwcorr afcrres10-afcrres118, obs *number of independent pairs for lag 1 display 93+95+103+104+92+89 *number of independent pairs for lag 2 display 102+95+105+96+100 *number of independent pairs for lag 3 display 100+97+96+98 *number of independent pairs for lag 4 display 103+91+99 *number of independent pairs for lag 5 display 97+92 *number of independent pairs for lag 6 display 99 insheet using acf, names clear gen lag=_n gen int num = 576 in 1 replace num = 498 in 2 replace num = 391 in 3 replace num= 293 in 4 replace num= 189 in 5 replace num= 99 in 6 gen se = 1/sqrt(num) **plot acf with 95% CI gen lb = acf - 2*se gen ub = acf + 2*se replace ub = 1 if ub>1 gen ulag = lag*3 scatter acf lb ub ulag, ylabel(0 (0.5) 1) xlabel(3 (3) 18) sort ulag save "acf_temp.dta", replace ** Plot Variogram (more useful in the continuous time situation ** where it is hard to compute ACF directly) ** need to load the afcr data again ** use "afcr.dta", clear ** need longitudinal envionment for the variogram ** xtset id time ** neet to run variogram.ado to define this command ** ** need to run xtdiff.ado, ksmapprox.ado before running variogram.ado ** quietly xi: reg sqafcr i.time group ptreat age predict afcrres1, resid *plot the variogram variogram afcrres1 ** Calculate the ACF from variogram ** ** save current data set before infile the data for variogram from the ** "variogram" command above save "afcr_temp.dta", replace ** read in the variogram data generated by "variogram.ado" ** ** vario in the following command is the file name in which the variogram ** values are stored ** ** this file is in the working directory of STATA ** insheet using vario, names clear ** Make our own variogram plot: ** v is the sample variogram ** ** vsmth is the variogram which is the smooth average of sample variogram ** vary is the total variance, the horizontal line ** ** ulag is the time-difference ** twoway scatter v vsmth vary ulag if v<10, msymbol(p i i) connect(i l l) ** calculation of the ACF from variogram ** This is the way to estimate the ACF when you have continuous time data ** and you can't calculate the ACF directly from the data unless you group the data gen varioacf = 1 - vsmth/vary twoway line varioacf ulag, ylabel(-1(.2)1) yline(-1 0 1) ** we make the ylabel from -1 to 1 in order to allow negative correlation * sort ulag save "vgram_temp.dta", replace ** merge the variogram data with the acf data ** so that we can plot the variogram - based acf on the same plot as the ** acf that was calculated directly merge ulag using "acf_temp.dta" ** plot the estimates of the acf using the two different techniques ** no negative correlation, so just plot the postive range from 0 to 1 of the y-axis twoway line varioacf ulag || scatter acf lb ub ulag, ylabel(0(.2)1) yline(0 0 1) xlabel(3 (3) 18) ************************************************************* ** Problem 2 ************************************************ * a) Formulate a general mean model that includes, at minimum, an effect * for treatment group, for age, prior treatment, for time, and all two-way * interactions between time and the other three covariates use "afcr.dta", clear xtset id time gen aget=age*time gen group2 = group-1 gen gp2t = group2*time gen ptrtt = ptreat*time regress sqafcr time group2 ptreat age aget gp2t ptrtt predict afcrres, resid autocor afcrres time id * b) Using residuals from the model in (a), select a correlation model under which to analyze the data. xtgee sqafcr time group2 ptreat age aget gp2t ptrtt, corr(exch) xtgee sqafcr time group2 ptreat age aget gp2t ptrtt, corr(ar1) * c) Using GEE and your selected correlation structure from (b), test whether treatment * has an effect on AFCR, either at baseline and/or on the rate of change of AFCR over time ** Use the robust variance estimation by specifiying "robust" xtgee sqafcr time group2 ptreat age aget gp2t ptrtt, corr(exch) nmp robust xtcorr, compact estat wcorrelation test group2 gp2t test gp2t *d) Repeat (c) for the prior treatment variable instead of the treatment variable in the study. test ptreat ptrtt test ptrtt *e) Based on your results in (c) and (d), fit a reduced model xtgee sqafcr time group2 ptreat age aget gp2t, corr(exch) nmp robust *f) Refit your model from (e) xtgee sqafcr time group2 ptreat age aget gp2t, corr(exch) nmp xtgee sqafcr time group2 ptreat age aget gp2t, corr(ind) nmp xtgee sqafcr time group2 ptreat age aget gp2t, corr(ind) nmp robust * saturated model * xtgee sqafcr time group2 ptreat age aget gp2t, corr(uns)