Homework 3

  1. Read Chapter 8 of Tanenbaum
  2. [UPDATED 12/11] Download the Twitter dataset [93MB] which was collected from the Twitter public timeline between 2007-10-15 and 2007-12-04. The zip archive contains two files:
    • biostat778_meta.txt: contains the date of the message ("date"), a 40-character message identifier ("description"), a user name ("uid"), the type, and the location of the user ("location"). All fields are separated by "|". A typical entry in this file is
      date:2007-10-15 04:00:46|description:71eed6f3ff09e4a5695b1de890230f267c383c32|uid:bouie|type:twitter|location:Manila, Philippines
      
    • biostat778_descriptions.txt: contains a 40-character message identifier (same as in the "description" field in the biostat778_meta.txt file) and the message text, separated by "|". A typical entry is
      93b0d242e70e528f9fdeac03c51dac2d52bae2aa|TSDivaDani: have to re-record Real Time because the sound kept popping and was inaudible. I'm sure picky since becoming a podcaster.
      
  3. Please work in two groups of 3:
    1. Haley, Nick, Yen-Yi
    2. Jessica, Peter, Hao
  4. Possible ideas to investigate (I'll try to post more as I think of them):
    • What are the trends in usage of a particular word/phrase (or regular expression) over time (a) in the twitterverse as a whole and (b) by user?
    • On a given day, what are the main ideas being twitter-ed? Can the ideas be separated into "central" ideas and "residual" messages?
    • Most of the messages on twitter might be uninteresting. How can we identify messages that are intereseting, unusual, or out of the mainstream?
  5. Presentations (~30 minutes) will be made in class on Tuesday December 18, 2007.

Homework 2

  1. Read Chapter 6 of Tanenbaum [Optional: Answer questions 11, 16, 18, 19, 20, 22, 31]
  2. Complete the problems in this handout. For problem 2, you can obtain data by running in R:
    d <- read.csv("http://www.biostat.jhsph.edu/MCAPS/estimates-subset.csv")
    est <- subset(d, outcome == "heart failure", c(beta, var))
    
    The data frame should have two columns---"beta" and "var". You can use the hierarchical model to pool the betas to get an overall log-relative risk.

    For more background, these estimates come from the paper Dominici F, et al. (2006) JAMA, 295 (10) 1127--1134. Compare the overall log-relative risks that you get with the ones in the paper. You will have to multiply your estimates by 1000 to make a fair comparison.

Homework 1

  1. Read Chapter 1 of Tanenbaum [Optional: Answer questions 1, 2, 5, 7, 11, 13, 15]
  2. Read Chapter 2 of Tanenbaum [Optional: Answer questions 7, 9, 10, 11, 18, 23, 28, 31]
  3. Complete the problems in this handout

Homework 0.5

  1. Read Appendix A and Appendix B from Tanenbaum [PDF]
  2. Write an R function named Unique which takes an arbitrary vector as input and returns a vector of all the unique elements of that vector. DO NOT use the following functions: unique, duplicated, %in%, or match.
    • Don't worry about the order of the output vector. The unique elements do not need to be returned in any particular order.
    • The output vector does not need to have names, even if the input vector did have names.
    • Please adhere to the coding standards for the class.
    • Place your R code in a file called Unique.R. If you use C code, place that code in a file called Unique.c.
  3. You can test your Unique function on these test vectors, available as an R workspace (you can load it into R using the load function). The test vectors are named 'a', 'b', 'c', ..., 'k'. [NOTE: The object 'c' has now been renamed to 'c1']
  4. If you use Emacs as your text editor and ESS for editing R code, you can use the following in your .emacs file to get 8-spaces tab indents:
    ;; ESS
    (add-hook 'ess-mode-hook
              (lambda ()
                (ess-set-style 'BSD)
                (setq tab-width 8)
                (setq indent-tabs-mode t)))