This is my title - take me out if neccessary

The series of === are just making a line.

The three “back ticks” () must be followed by curly brackets “{”, and then “r” to tell the computer that you are using R code. This line is then closed off by another curly bracket “}”.

Anything before three more back ticks “”“ are then considered R code (a script).

I'm reading in the bike lanes here.

# readin is just a 'label' for this code chunk code chunk is just a 'chunk'
# of code, where this code usually does just one thing, aka a module
# comments are still # here you can do all your reading in there
data.dir <- "~/Dropbox/WinterR_2014/Lectures/data"
### let's say we loaded some packages
library(stringr)
library(plyr)
fname <- file.path(data.dir, "Bike_Lanes.csv")
## file.path takes a directory and makes a full name with a full file path
bike = read.csv(fname, as.is = TRUE)
getwd()
## [1] "/Users/andrew.jaffe/Dropbox/WinterR_2014"

You can write your introduction here.

Introduction

Bike lanes are in Baltimore. People like them. Why are they so long?

Exploratory Analysis

Let's look at some plots of bike length. Let's say we wanted to look at what affects bike length.

Plots of bike length

Note we made the subsection by using three "hashes” (pound signs): ###.

no.missyear <- bike[bike$dateInstalled != 0, ]
plot(no.missyear$dateInstalled, no.missyear$length)

plot of chunk unnamed-chunk-1

no.missyear$dateInstalled = factor(no.missyear$dateInstalled)
boxplot(no.missyear$length ~ no.missyear$dateInstalled, main = "Boxplots of Bike Lenght by Year", 
    xlab = "Year", ylab = "Bike Length")

plot of chunk unnamed-chunk-1

What does it look like if we took the log (base 10) of the bike length:

no.missyear$log.length <- log10(no.missyear$length)
### see here that if you specify the data argument, you don't need to do the $
boxplot(log.length ~ dateInstalled, data = no.missyear, main = "Boxplots of Bike Lenght by Year", 
    xlab = "Year", ylab = "Bike Length")

plot of chunk unnamed-chunk-2

I want my boxplots colored, so I set the col argument.

boxplot(log.length ~ dateInstalled, data = no.missyear, main = "Boxplots of Bike Lenght by Year", 
    xlab = "Year", ylab = "Bike Length", col = "red")

plot of chunk unnamed-chunk-3

As we can see, 2006 had a much higher bike length. What about for the type of bike path?

### type is a character, but when R sees a 'character' in a 'formula', then it
### automatically converts it to factor a formula is something that has a y ~
### x, which says I want to plot y against x or if it were a model you would
### do y ~ x, which meant regress against y
boxplot(log.length ~ type, data = no.missyear, main = "Boxplots of Bike Lenght by Year", 
    xlab = "Year", ylab = "Bike Length")

plot of chunk unnamed-chunk-4

What if we want to extract means by each type?

Let's show a few ways:

### tapply takes in vector 1, then does a function by vector 2, and then you
### tell what that function is
tapply(no.missyear$log.length, no.missyear$type, mean)
##       BIKE LANE      CONTRAFLOW SHARED BUS BIKE         SHARROW 
##           2.331           2.087           2.363           2.256 
##        SIDEPATH    SIGNED ROUTE 
##           2.782           2.264

## aggregate
aggregate(x = no.missyear$log.length, by = list(no.missyear$type), FUN = mean)
##           Group.1     x
## 1       BIKE LANE 2.331
## 2      CONTRAFLOW 2.087
## 3 SHARED BUS BIKE 2.363
## 4         SHARROW 2.256
## 5        SIDEPATH 2.782
## 6    SIGNED ROUTE 2.264
### now let's specify the data argument and use a 'formula' - much easier to
### read and more 'intuitive'
aggregate(log.length ~ type, data = no.missyear, FUN = mean)
##              type log.length
## 1       BIKE LANE      2.331
## 2      CONTRAFLOW      2.087
## 3 SHARED BUS BIKE      2.363
## 4         SHARROW      2.256
## 5        SIDEPATH      2.782
## 6    SIGNED ROUTE      2.264

## ddply is from the plyr package takes in a data frame, (the first d refers
## to data.frame) splits it up by some variables (let's say type) then we'll
## use summarise to summarize whatever we want then returns a data.frame (the
## second d) - hence why it's ddply if we wanted to do it on a 'list' thne
## return data.frame, it'd be ldply
ddply(no.missyear, .(type), summarise, mean = mean(log.length))
##              type  mean
## 1       BIKE LANE 2.331
## 2      CONTRAFLOW 2.087
## 3 SHARED BUS BIKE 2.363
## 4         SHARROW 2.256
## 5        SIDEPATH 2.782
## 6    SIGNED ROUTE 2.264

ddply (and other functions in the plyr package) is cool because you can do multiple functions really easy.

Let's show a what if we wanted to go over type and dateInstalled:

### For going over 2 variables, we need to do it over a 'list' of vectors
tapply(no.missyear$log.length, list(no.missyear$type, no.missyear$dateInstalled), 
    mean)
##                  2006  2007  2008  2009  2010  2011  2012  2013
## BIKE LANE       3.046 2.351 2.366 2.381 2.307 2.242 2.362 2.408
## CONTRAFLOW         NA    NA    NA    NA 2.087    NA    NA    NA
## SHARED BUS BIKE    NA    NA    NA 2.351 2.404    NA    NA    NA
## SHARROW            NA 2.301 2.221 2.692 2.247    NA 2.236    NA
## SIDEPATH           NA    NA 2.625    NA 2.774 3.267    NA    NA
## SIGNED ROUTE       NA 2.288    NA    NA 2.239 2.210    NA    NA

tapply(no.missyear$log.length, list(no.missyear$type, no.missyear$dateInstalled), 
    mean, na.rm = TRUE)
##                  2006  2007  2008  2009  2010  2011  2012  2013
## BIKE LANE       3.046 2.351 2.366 2.381 2.307 2.242 2.362 2.408
## CONTRAFLOW         NA    NA    NA    NA 2.087    NA    NA    NA
## SHARED BUS BIKE    NA    NA    NA 2.351 2.404    NA    NA    NA
## SHARROW            NA 2.301 2.221 2.692 2.247    NA 2.236    NA
## SIDEPATH           NA    NA 2.625    NA 2.774 3.267    NA    NA
## SIGNED ROUTE       NA 2.288    NA    NA 2.239 2.210    NA    NA

## aggregate - looks better
aggregate(log.length ~ type + dateInstalled, data = no.missyear, FUN = mean)
##               type dateInstalled log.length
## 1        BIKE LANE          2006      3.046
## 2        BIKE LANE          2007      2.351
## 3          SHARROW          2007      2.301
## 4     SIGNED ROUTE          2007      2.288
## 5        BIKE LANE          2008      2.366
## 6          SHARROW          2008      2.221
## 7         SIDEPATH          2008      2.625
## 8        BIKE LANE          2009      2.381
## 9  SHARED BUS BIKE          2009      2.351
## 10         SHARROW          2009      2.692
## 11       BIKE LANE          2010      2.307
## 12      CONTRAFLOW          2010      2.087
## 13 SHARED BUS BIKE          2010      2.404
## 14         SHARROW          2010      2.247
## 15        SIDEPATH          2010      2.774
## 16    SIGNED ROUTE          2010      2.239
## 17       BIKE LANE          2011      2.242
## 18        SIDEPATH          2011      3.267
## 19    SIGNED ROUTE          2011      2.210
## 20       BIKE LANE          2012      2.362
## 21         SHARROW          2012      2.236
## 22       BIKE LANE          2013      2.408

## ddply is from the plyr package
ddply(no.missyear, .(type, dateInstalled), summarise, mean = mean(log.length), 
    median = median(log.length), Mode = mode(log.length), Std.Dev = sd(log.length))
##               type dateInstalled  mean median    Mode Std.Dev
## 1        BIKE LANE          2006 3.046  3.046 numeric 0.47974
## 2        BIKE LANE          2007 2.351  2.444 numeric 0.40662
## 3        BIKE LANE          2008 2.366  2.355 numeric 0.38916
## 4        BIKE LANE          2009 2.381  2.311 numeric 0.49447
## 5        BIKE LANE          2010 2.307  2.328 numeric 0.32076
## 6        BIKE LANE          2011 2.242  2.235 numeric 0.33398
## 7        BIKE LANE          2012 2.362  2.324 numeric 0.28528
## 8        BIKE LANE          2013 2.408  2.505 numeric 0.24041
## 9       CONTRAFLOW          2010 2.087  2.142 numeric 0.25655
## 10 SHARED BUS BIKE          2009 2.351  2.464 numeric 0.30610
## 11 SHARED BUS BIKE          2010 2.404  2.587 numeric 0.27380
## 12         SHARROW          2007 2.301  2.364 numeric 0.42193
## 13         SHARROW          2008 2.221  2.238 numeric 0.32664
## 14         SHARROW          2009 2.692  2.708 numeric 0.06945
## 15         SHARROW          2010 2.247  2.298 numeric 0.35905
## 16         SHARROW          2012 2.236  2.339 numeric 0.42924
## 17        SIDEPATH          2008 2.625  2.787 numeric 0.29583
## 18        SIDEPATH          2010 2.774  2.774 numeric 0.33480
## 19        SIDEPATH          2011 3.267  3.267 numeric      NA
## 20    SIGNED ROUTE          2007 2.288  2.332 numeric 0.41825
## 21    SIGNED ROUTE          2010 2.239  2.256 numeric 0.39201
## 22    SIGNED ROUTE          2011 2.210  2.208 numeric 0.20880

OK let's do an linear model

### type is a character, but when R sees a 'character' in a 'formula', then it
### automatically converts it to factor a formula is something that has a y ~
### x, which says I want to plot y against x or if it were a model you would
### do y ~ x, which meant regress against y
mod.type = lm(log.length ~ type, data = no.missyear)
mod.yr = lm(log.length ~ factor(dateInstalled), data = no.missyear)
mod.yrtype = lm(log.length ~ type + factor(dateInstalled), data = no.missyear)
summary(mod.type)
## 
## Call:
## lm(formula = log.length ~ type, data = no.missyear)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5150 -0.1906  0.0292  0.2322  1.3102 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           2.3306     0.0149  156.70  < 2e-16 ***
## typeCONTRAFLOW       -0.2434     0.1029   -2.37  0.01813 *  
## typeSHARED BUS BIKE   0.0324     0.0606    0.53  0.59319    
## typeSHARROW          -0.0742     0.0213   -3.48  0.00051 ***
## typeSIDEPATH          0.4512     0.1506    3.00  0.00277 ** 
## typeSIGNED ROUTE     -0.0669     0.0273   -2.45  0.01430 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.367 on 1499 degrees of freedom
## Multiple R-squared:  0.0196, Adjusted R-squared:  0.0163 
## F-statistic: 5.98 on 5 and 1499 DF,  p-value: 1.74e-05

That's rather UGLY, so let's use a package called xtable and then make this model into an xtable object and then print it out nicely.

### DON'T DO THIS.  YOU SHOULD ALWAYS DO library() statements in the FIRST
### code chunk.  this is just to show you the logic of a report/analysis.
require(xtable)
## Loading required package: xtable
# smod <- summary(mod.yr)
xtab <- xtable(mod.yr)

Well xtable can make html tables, so let's print this. We must tell R that the results is actually an html output, so we say the results should be embedded in the html “asis” (aka just print out whatever R spits out).

print.xtable(xtab, type = "html")
Estimate Std. Error t value Pr(> |t|)
(Intercept) 3.0463 0.2600 11.71 0.0000
factor(dateInstalled)2007 -0.7332 0.2608 -2.81 0.0050
factor(dateInstalled)2008 -0.7808 0.2613 -2.99 0.0029
factor(dateInstalled)2009 -0.6394 0.2631 -2.43 0.0152
factor(dateInstalled)2010 -0.7791 0.2605 -2.99 0.0028
factor(dateInstalled)2011 -0.8022 0.2626 -3.05 0.0023
factor(dateInstalled)2012 -0.7152 0.2625 -2.72 0.0065
factor(dateInstalled)2013 -0.6380 0.2849 -2.24 0.0253

OK, that's pretty good, but let's say we have all three models. Another package called stargazer can put models together easily and pritn them out. So xtable is really good when you are trying to print out a table (in html, otherwise make the table and use write.csv to get it in Excel and then format) really quickly and in a report. But it doesn't work so well with many models together. So let's use stargazer. Again, you need to use install.packages("stargazer") if you don't have function.

require(stargazer)
## Loading required package: stargazer
## Warning: there is no package called 'stargazer'

OK, so what's the difference here? First off, we said results are “markup”, so that it will not try to reformat the output. Also, I didn't want those # for comments, so I just made comment an empty string “”.

stargazer(mod.yr, mod.type, mod.yrtype, type = "text")
Error: could not find function "stargazer"

Data Extraction

Let's say I want to get data INTO my text. Like there are N number of bike lanes with a date installed that isn't zero. There are 1505 bike lanes with a date installed after 2006. So you use one backtick and then you say “r” to tell that it's R code. And then you run R code that gets evaulated and then returns the value. Let's say you want to compute a bunch of things:

### let's get number of bike lanes installed by year
n.lanes = ddply(no.missyear, .(dateInstalled), nrow)
names(n.lanes) <- c("date", "nlanes")
n2009 <- n.lanes$nlanes[n.lanes$date == 2009]
n2010 <- n.lanes$nlanes[n.lanes$date == 2010]
getwd()
## [1] "/Users/andrew.jaffe/Dropbox/WinterR_2014"

Now I can just say there are 86 lanes in 2009 and 625 in 2010.

fname <- file.path(data.dir, "Charm_City_Circulator_Ridership.csv")
## file.path takes a directory and makes a full name with a full file path
charm = read.csv(fname, as.is = TRUE)

library(chron)
## Error: there is no package called 'chron'
days = levels(weekdays(1, abbreviate = FALSE))
## Error: no applicable method for 'weekdays' applied to an object of class
## "c('double', 'numeric')"
charm$day <- factor(charm$day, levels = days)
## Error: object 'days' not found
charm$date <- as.Date(charm$date, format = "%m/%d/%Y")
cn <- colnames(charm)
daily <- charm[, c("day", "date", "daily")]
charm$daily <- NULL
require(reshape2)
## Loading required package: reshape2
long.charm <- melt(charm, id.vars = c("day", "date"))
long.charm$type <- "Boardings"
long.charm$type[grepl("Alightings", long.charm$variable)] <- "Alightings"
long.charm$type[grepl("Average", long.charm$variable)] <- "Average"

long.charm$line <- "orange"
long.charm$line[grepl("purple", long.charm$variable)] <- "purple"
long.charm$line[grepl("green", long.charm$variable)] <- "green"
long.charm$line[grepl("banner", long.charm$variable)] <- "banner"
long.charm$variable <- NULL

long.charm$line <- factor(long.charm$line, levels = c("orange", "purple", "green", 
    "banner"))

head(long.charm)
##         day       date value      type   line
## 1    Monday 2010-01-11   877 Boardings orange
## 2   Tuesday 2010-01-12   777 Boardings orange
## 3 Wednesday 2010-01-13  1203 Boardings orange
## 4  Thursday 2010-01-14  1194 Boardings orange
## 5    Friday 2010-01-15  1645 Boardings orange
## 6  Saturday 2010-01-16  1457 Boardings orange

### NOW R has a column of day, the date, a 'value', the type of value and the
### circulator line that corresponds to it value is now either the Alightings,
### Boardings, or Average from the charm dataset

Let's do some plotting now!

require(ggplot2)
## Loading required package: ggplot2
### let's make a 'ggplot' the format is ggplot(dataframe, aes(x=COLNAME,
### y=COLNAME)) where COLNAME are colnames of the dataframe you can also set
### color to a different factor other options in AES (fill, alpha level -which
### is the 'transparency' of points)
g <- ggplot(long.charm, aes(x = date, y = value, color = line))
### let's change the colors to what we want- doing this manually, not letting
### it choose for me
g <- g + scale_color_manual(values = c("orange", "purple", "green", "blue"))
### plotting points
g + geom_point()
## Warning: Removed 5328 rows containing missing values (geom_point).

plot of chunk plots

### Let's make Lines!
g + geom_line()
## Warning: Removed 5043 rows containing missing values (geom_path).

plot of chunk plots

### let's make a new plot of poitns
gpoint <- g + geom_point()
### let's plot the value by the type of value - boardings/average, etc
gpoint + facet_wrap(~type)
## Warning: Removed 1814 rows containing missing values (geom_point).
## Warning: Removed 1700 rows containing missing values (geom_point).
## Warning: Removed 1814 rows containing missing values (geom_point).

plot of chunk plots

OK let's turn off some warnings - making warning=FALSE as an option.

## let's compare vertically
gpoint + facet_wrap(~type, ncol = 1)

plot of chunk unnamed-chunk-14


gfacet = g + facet_wrap(~type, ncol = 1)
## let's smooth this - get a rough estimate of what's going on
gfacet + geom_smooth(se = FALSE)
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

plot of chunk unnamed-chunk-14

OK, I've seen enough code, let's turn that off, using echo=FALSE.

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

plot of chunk unnamed-chunk-15