Data and Software for "Quantitative Analysis of Literary Styles" by Peng and Hengartner

blockcount -- the Perl script used to count occurrences of words given a wordlist and a series of document files. You may need to modify the first line.

author-cda.R -- some R code for doing canonical discriminant analysis. It is not really necessary to download this code. You can get similar results by using the lda function in the MASS package for R (part of the VR bundle).

Data files of word counts -- You can download the data in two formats. One is as separate text files for each author. The other is as an R workspace file (compressed or uncompressed).

  1. Separate ASCII files: Austen | Cather | Dickens | Doyle | Kipling | London | Marlowe | Milton | Shakespeare These files can be read using `read.table()' in R.
  2. R Workspace File: authordata.RData [140 K]

The list of function words used is contained in the wordlist.txt file. You do not need to download this if you download the R workspace file.

The reference for the paper is:

Peng, R. D., Hengartner, N. W. (2002) "Quantitative analysis of literary styles." The American Statistician, 56 (3), 175--185.