Biostat 778: Homework

Homework 3

Read Chapter 8 of Tanenbaum
[UPDATED 12/11] Download the Twitter dataset [93MB] which was collected from the Twitter public timeline between 2007-10-15 and 2007-12-04. The zip archive contains two files:
- biostat778_meta.txt: contains the date of the message ("date"), a 40-character message identifier ("description"), a user name ("uid"), the type, and the location of the user ("location"). All fields are separated by "|". A typical entry in this file is
```
date:2007-10-15 04:00:46|description:71eed6f3ff09e4a5695b1de890230f267c383c32|uid:bouie|type:twitter|location:Manila, Philippines
```
- biostat778_descriptions.txt: contains a 40-character message identifier (same as in the "description" field in the biostat778_meta.txt file) and the message text, separated by "|". A typical entry is
```
93b0d242e70e528f9fdeac03c51dac2d52bae2aa|TSDivaDani: have to re-record Real Time because the sound kept popping and was inaudible. I'm sure picky since becoming a podcaster.
```
Please work in two groups of 3:
1. Haley, Nick, Yen-Yi
2. Jessica, Peter, Hao
Possible ideas to investigate (I'll try to post more as I think of them):
- What are the trends in usage of a particular word/phrase (or regular expression) over time (a) in the twitterverse as a whole and (b) by user?
- On a given day, what are the main ideas being twitter-ed? Can the ideas be separated into "central" ideas and "residual" messages?
- Most of the messages on twitter might be uninteresting. How can we identify messages that are intereseting, unusual, or out of the mainstream?
Presentations (~30 minutes) will be made in class on Tuesday December 18, 2007.

Homework 2

Read Chapter 6 of Tanenbaum [Optional: Answer questions 11, 16, 18, 19, 20, 22, 31]
Complete the problems in this handout. For problem 2, you can obtain data by running in R:
```
d <- read.csv("http://www.biostat.jhsph.edu/MCAPS/estimates-subset.csv")
est <- subset(d, outcome == "heart failure", c(beta, var))
```
The data frame should have two columns---"beta" and "var". You can use the hierarchical model to pool the betas to get an overall log-relative risk.
For more background, these estimates come from the paper Dominici F, et al. (2006) JAMA, 295 (10) 1127--1134. Compare the overall log-relative risks that you get with the ones in the paper. You will have to multiply your estimates by 1000 to make a fair comparison.

Homework 1

Read Chapter 1 of Tanenbaum [Optional: Answer questions 1, 2, 5, 7, 11, 13, 15]
Read Chapter 2 of Tanenbaum [Optional: Answer questions 7, 9, 10, 11, 18, 23, 28, 31]
Complete the problems in this handout

Homework 0.5

Read Appendix A and Appendix B from Tanenbaum [PDF]
Write an R function named Unique which takes an arbitrary vector as input and returns a vector of all the unique elements of that vector. DO NOT use the following functions: unique, duplicated, %in%, or match.
- Don't worry about the order of the output vector. The unique elements do not need to be returned in any particular order.
- The output vector does not need to have names, even if the input vector did have names.
- Please adhere to the coding standards for the class.
- Place your R code in a file called Unique.R. If you use C code, place that code in a file called Unique.c.
You can test your Unique function on these test vectors, available as an R workspace (you can load it into R using the load function). The test vectors are named 'a', 'b', 'c', ..., 'k'. [NOTE: The object 'c' has now been renamed to 'c1']

If you use Emacs as your text editor and ESS for editing R code, you can use the following in your .emacs file to get 8-spaces tab indents:

;; ESS
(add-hook 'ess-mode-hook
          (lambda ()
            (ess-set-style 'BSD)
            (setq tab-width 8)
            (setq indent-tabs-mode t)))