# Statistical Learning: Algorithmic and Nonparametric Approaches

In this web page you will find

Class outline:
Lecture Title Description Notes Code
NA Review Stuff you should know: Basics of probability, the central limit theorem, and inference PDF NA
1 Introduction to Regression and Prediction We will describe linear regression in the context of a prediction problem. PDF R
2 Overview of Supervised Learning Regression for predicting bivariate data, K nearest neighbors (KNN), bin smoothers, and an introduction to the bias/variance trade-off. PDF R
3-4 Linear Methods for Regression Subset selection and ridge regression. We will use singular value decomposition (SVD) and principal component analysis (PCA) to understand these methods. PDF R
5 Linear Methods for Classification Linear Regression, Linear Discriminant Analysis (LDA), and Logisitc Regression PDF R
6 Kernel Methods Kernal smoothers including loess. We will briefly describe 2 dimensional smoothers. We will also define degrees of freedom in the context of smoothing and learn about density estimators. PDF R
7 Model Assessment and Selection We revist the bias-variance tradeoff. We describe how monte-carlo simulations can be used to assess bias and variance. We then introduce cross-validation, AIC, and BIC. PDF R
8 The Bootstrap We give a short introduction to the bootstrap and demonstrate its utility in smoothing problems. PDF R
9-10 Splines, Wavelets, and Friends We give intuitive and mathematical description of Splines and Wavelets. We use the SVD to understand these better and see connections with signal processing methods. PDF R
11-12 Additive Models, GAM and Neural Networks We move back to cases with many covariates. We introduce projection pursuit, additive models as well as generalized additive models. We breifly describe neural networks and explain the connection to projection pursuit. PDF NA
13-14 CART, Boosting and Additive Trees We introduce classification algorithms and regression trees (CART) as well as the more modern versions such as random forrests. PDF archive for CART, archive for others
15 Model Averaging Bayesian Statistics, Boosting and Bagging PDF NA
16 Clustering algorithms Notes and code taking from my My microarray class PDF R

Homework:

• Homework 1 [Due 4/10]: Look through the top journal in your field for a paper in which a regression analysis was performed, many covariates were available, and p-values were reported.

If your field is mathematical (statistics, biostatistics, engineering,etc..) then look through the top journal of your favorite public health application. If you don't have one then use American Journal of Epidemiology (there should be plenty of regrssion analyses in this journal).

• Discuss how the model was motivated. Deductively, empirically, both or neither?
• Give me your thoughts on their model choice? Could they have done something differently? Are the results described model driven?
• Where does the p in p-value come from? i.e. Where does the randomness come from? Random sample, randomization, or nature...? If nature, then write a paragraph explain how.
• Homework 2 [Due 4/17]
• Use this training data to predict the outcomes for this data. You should give the 500 predictions and an estimate of the number of mistakes you've made. Please send a text file with only the predictions (separated by spaces). Include a description of what you did. Whomever predicts best wins first prize. Whomever best estimates the number of mistakes they make comes in second. Prizes will be handed out.
• Derive the discrimination function for LDA (third equation on page 75) and show it is linear.
• Show that LDA and regression are equivalent when the outcomes are binary.
• Homework 3 [Due 4/24]
• Dowload the Strontium Data [text file] and fit a polynomial of degree 1,2,3,4,6,12, a spline (you pick the knots) and smoothing splines. Make plots of the data and the fitted curves.
• Write a paragraph describing your project.
• Homework 4 [Due 5/1]
• From this data, get your best estimate of y (yhat) and confidence bands, for each of the given x-values. First prize goes to the smallest RSS, second prize goes to true f(x) entirely inside the confidence bands with smallest area between bands.
• Turn in project first draft.
• Project [Last day of class]

Data-sets:

Recommended Books

Resources
Class General Info
• Course title: Statistical Learning: Algorithmic and Nonparametric Approaches (140.649)
• Lab Hour:
• Instructor: Rafael Irizarry
• Department of Biostatistics
• Phone 410-614-5157, email: rafa@jhu.edu
• I assume you know: Linear algebra and statistical principles at a 651--654 level.
• It will be useful to learn one of the following programming languages: R (recommended), S-Plus, or MATLAB.
• Grading: 3 homeworks 60%, 2 quizzes 20%, 1 project 20%
• Course description: Teaches public health students to use modern, computationally-based methods for exploring and drawing inferences from data. After a brief review of probability, the central limit theorem, and inference, the course covers resampling methods, non-parametric regression, prediction, and dimension reduction and clustering. Specifically covers: Monte Carlo simulation, bootstrap cross-validation, splines, local weighted regression, CART, random forests, neural networks, support vector machines, and hierarchical clustering.

Last updated: 4/18/2006

You are visitor number