This is my title - take me out if neccessary ======================================================== The series of === are just making a line. The three "back ticks" (`) must be followed by curly brackets "{", and then "r" to tell the computer that you are using R code. This line is then closed off by another curly bracket "}". Anything before three more back ticks "```" are then considered R code (a script). I'm reading in the bike lanes here. ```{r readin} # readin is just a "label" for this code chunk ## code chunk is just a "chunk" of code, where this code usually ## does just one thing, aka a module ### comments are still # here ### you can do all your reading in there data.dir <- "~/Dropbox/WinterR_2014/Lectures/data" ### let's say we loaded some packages library(stringr) library(plyr) fname <- file.path(data.dir, "Bike_Lanes.csv") ## file.path takes a directory and makes a full name with a full file path bike = read.csv(fname, as.is=TRUE) getwd() ``` You can write your introduction here. ## Introduction Bike lanes are in Baltimore. People like them. Why are they so long? ## Exploratory Analysis Let's look at some plots of bike length. Let's say we wanted to look at what affects bike length. ### Plots of bike length Note we made the subsection by using three "hashes" (pound signs): ###. ```{r} no.missyear <- bike[ bike$dateInstalled != 0,] plot(no.missyear$dateInstalled, no.missyear$length) no.missyear$dateInstalled = factor(no.missyear$dateInstalled) boxplot(no.missyear$length ~ no.missyear$dateInstalled, main="Boxplots of Bike Lenght by Year", xlab="Year", ylab="Bike Length") ``` What does it look like if we took the log (base 10) of the bike length: ```{r} no.missyear$log.length <- log10(no.missyear$length) ### see here that if you specify the data argument, you don't need to do the $ boxplot(log.length ~ dateInstalled, data=no.missyear, main="Boxplots of Bike Lenght by Year", xlab="Year", ylab="Bike Length") ``` I want my boxplots colored, so I set the `col` argument. ```{r} boxplot(log.length ~ dateInstalled, data=no.missyear, main="Boxplots of Bike Lenght by Year", xlab="Year", ylab="Bike Length", col="red") ``` As we can see, 2006 had a much higher bike length. What about for the type of bike path? ```{r} ### type is a character, but when R sees a "character" in a "formula", then it automatically converts it to factor ### a formula is something that has a y ~ x, which says I want to plot y against x ### or if it were a model you would do y ~ x, which meant regress against y boxplot(log.length ~ type, data=no.missyear, main="Boxplots of Bike Lenght by Year", xlab="Year", ylab="Bike Length") ``` What if we want to extract means by each type? Let's show a few ways: ```{r} ### tapply takes in vector 1, then does a function by vector 2, and then you tell what ### that function is tapply(no.missyear$log.length, no.missyear$type, mean) ## aggregate aggregate(x=no.missyear$log.length, by=list(no.missyear$type), FUN=mean) ### now let's specify the data argument and use a "formula" - much easier to read and ## more "intuitive" aggregate(log.length ~ type, data=no.missyear, FUN=mean) ## ddply is from the plyr package ##takes in a data frame, (the first d refers to data.frame) ## splits it up by some variables (let's say type) ## then we'll use summarise to summarize whatever we want ## then returns a data.frame (the second d) - hence why it's ddply ## if we wanted to do it on a "list" thne return data.frame, it'd be ldply ddply(no.missyear, .(type), summarise, mean=mean(log.length) ) ``` `ddply` (and other functions in the `plyr` package) is cool because you can do multiple functions really easy. Let's show a what if we wanted to go over `type` and `dateInstalled`: ```{r} ### For going over 2 variables, we need to do it over a "list" of vectors tapply(no.missyear$log.length, list(no.missyear$type, no.missyear$dateInstalled), mean) tapply(no.missyear$log.length, list(no.missyear$type, no.missyear$dateInstalled), mean, na.rm=TRUE) ## aggregate - looks better aggregate(log.length ~ type + dateInstalled, data=no.missyear, FUN=mean) ## ddply is from the plyr package ddply(no.missyear, .(type, dateInstalled), summarise, mean=mean(log.length), median=median(log.length), Mode=mode(log.length), Std.Dev=sd(log.length) ) ``` OK let's do an linear model ```{r} ### type is a character, but when R sees a "character" in a "formula", then it automatically converts it to factor ### a formula is something that has a y ~ x, which says I want to plot y against x ### or if it were a model you would do y ~ x, which meant regress against y mod.type = lm(log.length ~ type, data=no.missyear) mod.yr = lm(log.length ~ factor(dateInstalled), data=no.missyear) mod.yrtype = lm(log.length ~ type + factor(dateInstalled), data=no.missyear) summary(mod.type) ``` That's rather UGLY, so let's use a package called `xtable` and then make this model into an `xtable` object and then print it out nicely. ```{r} ### DON'T DO THIS. YOU SHOULD ALWAYS DO library() statements in the FIRST code chunk. ### this is just to show you the logic of a report/analysis. require(xtable) # smod <- summary(mod.yr) xtab <- xtable(mod.yr) ``` Well `xtable` can make html tables, so let's print this. We must tell R that the results is actually an html output, so we say the results should be embedded in the html "asis" (aka just print out whatever R spits out). ```{r, results='asis'} print.xtable(xtab, type="html") ``` OK, that's pretty good, but let's say we have all three models. Another package called `stargazer` can put models together easily and pritn them out. So `xtable` is really good when you are trying to print out a table (in html, otherwise make the table and use `write.csv` to get it in Excel and then format) really quickly and in a report. But it doesn't work so well with *many* models together. So let's use stargazer. Again, you need to use `install.packages("stargazer")` if you don't have function. ```{r} require(stargazer) ``` OK, so what's the difference here? First off, we said results are "markup", so that it will not try to reformat the output. Also, I didn't want those # for comments, so I just made comment an empty string "". ```{r, results='markup', comment=""} stargazer(mod.yr, mod.type, mod.yrtype, type="text") ``` ## Data Extraction Let's say I want to get data INTO my text. Like there are N number of bike lanes with a date installed that isn't zero. There are `r nrow(no.missyear)` bike lanes with a date installed after 2006. So you use one backtick ` and then you say "r" to tell that it's R code. And then you run R code that gets evaulated and then returns the value. Let's say you want to compute a bunch of things: ```{r computes} ### let's get number of bike lanes installed by year n.lanes = ddply(no.missyear, .(dateInstalled), nrow) names(n.lanes) <- c("date", "nlanes") n2009 <- n.lanes$nlanes[ n.lanes$date == 2009] n2010 <- n.lanes$nlanes[ n.lanes$date == 2010] getwd() ``` Now I can just say there are `r n2009` lanes in 2009 and `r n2010` in 2010. ```{r} fname <- file.path(data.dir, "Charm_City_Circulator_Ridership.csv") ## file.path takes a directory and makes a full name with a full file path charm = read.csv(fname, as.is=TRUE) library(chron) days = levels(weekdays(1, abbreviate=FALSE)) charm$day <- factor(charm$day, levels=days) charm$date <- as.Date(charm$date, format="%m/%d/%Y") cn <- colnames(charm) daily <- charm[, c("day", "date", "daily")] ``` ```{r} charm$daily <- NULL require(reshape2) long.charm <- melt(charm, id.vars = c("day", "date")) long.charm$type <- "Boardings" long.charm$type[ grepl("Alightings", long.charm$variable)] <- "Alightings" long.charm$type[ grepl("Average", long.charm$variable)] <- "Average" long.charm$line <- "orange" long.charm$line[ grepl("purple", long.charm$variable)] <- "purple" long.charm$line[ grepl("green", long.charm$variable)] <- "green" long.charm$line[ grepl("banner", long.charm$variable)] <- "banner" long.charm$variable <- NULL long.charm$line <-factor(long.charm$line, levels=c("orange", "purple", "green", "banner")) head(long.charm) ### NOW R has a column of day, the date, a "value", the type of value and the ### circulator line that corresponds to it ### value is now either the Alightings, Boardings, or Average from the charm dataset ``` Let's do some plotting now! ```{r plots} require(ggplot2) ### let's make a "ggplot" ### the format is ggplot(dataframe, aes(x=COLNAME, y=COLNAME)) ### where COLNAME are colnames of the dataframe ### you can also set color to a different factor ### other options in AES (fill, alpha level -which is the "transparency" of points) g <- ggplot(long.charm, aes(x=date, y=value, color=line)) ### let's change the colors to what we want- doing this manually, not letting it choose ### for me g <- g + scale_color_manual(values=c("orange", "purple", "green", "blue")) ### plotting points g + geom_point() ### Let's make Lines! g + geom_line() ### let's make a new plot of poitns gpoint <- g + geom_point() ### let's plot the value by the type of value - boardings/average, etc gpoint + facet_wrap(~ type) ``` OK let's turn off some warnings - making `warning=FALSE` as an option. ```{r, warning=FALSE} ## let's compare vertically gpoint + facet_wrap(~ type, ncol=1) gfacet = g + facet_wrap(~ type, ncol=1) ## let's smooth this - get a rough estimate of what's going on gfacet + geom_smooth(se=FALSE) ``` OK, I've seen enough code, let's turn that off, using `echo=FALSE`. ```{r, echo=FALSE, warning=FALSE, fig.width=10, fig.height=5} #### COMBINE! - let's make the line width bigger (lwd) ### also making the "alpha level" (transparency) low for the point sos we can see the lines g + geom_point(alpha=0.2) + geom_smooth(se=FALSE, lwd=1.5) + facet_wrap( ~ type) ```