Executive summary

In this analysis we create a model that predicts how well weights are lifted, based on measurements made by on-body sensors and a labelled training data set. Our model combines predictions made by a random forest model and a boosted trees model, reaching an accuracy of 99.31% (and an estimated out of sample error of 0.69%) and a kappa value of .991 on a validation data set containing 20% of the training data set. We also use our prediction model to predict 20 different test cases.

Note - The R code used to conduct this analysis can be found in the repository that this report was published in: https://github.com/spujadas/coursera-pml

Introduction

One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, our goal will be to use data from accelerometers – on the belt, forearm, arm, and dumbbell – of 6 participants, to assess how well weights are lifted.

Exploratory data analysis

We first download the training data set and load it in R using read.csv() without any options.

rm(list=ls())

harTrainingFile <- "project-data/pml-training.csv"
if (!file.exists(harTrainingFile)) {
  harTrainingUrl <- 
    "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
  download.file(harTrainingFile, url = harTrainingUrl)
}

harRaw <- read.csv(harTrainingFile)

We then perform some basic exploratory data analysis.

Note – For the sake of readability the R output of our exploratory data analysis is not included here, but the interested reader can download the source of this report to reproduce our analysis.

head(harRaw)
dim(harRaw)  # 19622 obs. of 160 variables

str(harRaw)
# countains many nums imported as factors, many NAs, metadata (e.g.
# timestamps), error strings (#DIV/0)

levels(harRaw$amplitude_yaw_belt)
# factor w/ 4 levels: "", "#DIV/0", "0.00", "0.0000"

summary(harRaw)

table(colSums(is.na(harRaw)))
# 60 columns have no NAs, 100 columns have 19216 NAs

This data set contains 19622 observations of 160 variables.

The outcome variable that we want to predict is named classe, and according to the notes accompanying the data set (see References) its possible values are A through E, where A indicates that the activity (namely 10 repetitions of the unilateral dumbbell biceps curl) was performed exactly according to the specification, and B through E correspond to common mistakes made when lifting weights.

In addition to actual physical measurements (e.g. arm sensors orientation, dumbbell sensors orientation), the observations include metadata, such as the number of the observation, the name of the user, and the date when the activity was performed. This metadata does not represent measurements of physical activity and should therefore not be used to predict how an activity is performed when building our model.

Looking at the data, we see that:

Data preparation

We re-read the data set in R, this time treating values equal to NA, the empty string, or #DIV/0 as not available.

harTrainingFile <- "project-data/pml-training.csv"
if (!file.exists(harTrainingFile)) {
  harTrainingUrl <- 
    "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
  download.file(harTrainingFile, url = harTrainingUrl)
}

harRaw <- read.csv(harTrainingFile, na.strings = c("NA", "", "#DIV/0"))

We then discard all variables (i.e. columns) that are missing values. As discussed above, we also remove variables that represent observation metadata.

# remove columns containing NAs
columnsWithoutNAs <- which(colSums(is.na(harRaw)) == 0)
har <- harRaw[, columnsWithoutNAs]

# remove observation metadata (observation number, user, timestamp, window):
har <- har[, !names(har) %in% c("X", "user_name", "raw_timestamp_part_1",
                                "raw_timestamp_part_2", "cvtd_timestamp",
                                "new_window", "num_window")]

As the final part of the data preparation phase, we split our data set into a training set (containing 60% of the data), a testing set (20%), and a validation set (20%).

library(caret)

# set seed for reproducibility
set.seed(9876)

# training + testing (80%) and validation (20%) sets
buildIndices <- createDataPartition(y = har$classe, p = .8, list = F)
validationHAR <- har[-buildIndices,]
buildDataHAR <- har[buildIndices,]

# training (60%, i.e. 0.75*80%) and testing (20%) sets
trainIndices <- createDataPartition(y = buildDataHAR$classe, p = .75, list = F)
trainHAR = buildDataHAR[trainIndices,]
testHAR = buildDataHAR[-trainIndices,]

Model selection

We train four classification-orientated models on the training set, using random forests (rf in R’s caret package), boosted trees (gbm), support vector machine (svm), and linear discriminant analysis (lda).

We train the models using 5-fold cross-validation. In other words, for each model and each candidate combination of tuning parameters (for instance, the tuning parameter for the random forest model is mtry, the number of randomly selected predictors), one fifth of the training data is held out and serves as a cross-validation set on which the accuracy is calculated after the model has been trained on the rest of the data. The best model is then fitted on the full training set.

# control parameters for training
trainCtrl <- trainControl(method = "cv", number = 5)

# we set the same random number seed before training each model to ensure that
# the same resampling sets are used.

# random forest
set.seed(65464)
modRf <- train(classe ~ ., method = "rf", data = trainHAR,
               trControl = trainCtrl)

# boosted trees
set.seed(65464)
modGbm <- train(classe ~ ., method="gbm", data = trainHAR,
                trControl = trainCtrl)

# support vector machine
set.seed(65464)
modSvm <- train(classe ~ ., method="svmLinear", data = trainHAR,
                trControl = trainCtrl)

# linear dicriminant analysis
set.seed(65464)
modLda <- train(classe ~ ., method="lda", data = trainHAR,
                trControl = trainCtrl)

Before we use these models to predict classes on the test set, we compare their resampling distributions (noting that there are 5 resamples for each model as we used 5-fold cross-validation) to get a general idea of the difference in performance between the models, and to estimate the out of sample error.

# collect resamples
results <- resamples(list(RF=modRf, GBM=modGbm, SVM=modSvm, LDA=modLda))

# summary of the resampling distributions
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: RF, GBM, SVM, LDA 
## Number of resamples: 5 
## 
## Accuracy 
##       Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
## RF  0.9881  0.9885 0.9885 0.9887  0.9890 0.9894    0
## GBM 0.9545  0.9559 0.9571 0.9577  0.9601 0.9609    0
## SVM 0.7715  0.7759 0.7767 0.7784  0.7835 0.7841    0
## LDA 0.6938  0.6991 0.7003 0.7010  0.7016 0.7101    0
## 
## Kappa 
##       Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
## RF  0.9850  0.9855 0.9855 0.9857  0.9860 0.9866    0
## GBM 0.9425  0.9441 0.9458 0.9465  0.9495 0.9506    0
## SVM 0.7099  0.7148 0.7159 0.7182  0.7252 0.7252    0
## LDA 0.6131  0.6192 0.6206 0.6217  0.6227 0.6328    0
# estimated out of sample errors
estOseRf <- 1-mean(results$values[, "RF~Accuracy"])
estOseGbm <- 1-mean(results$values[, "GBM~Accuracy"])
estOseSvm <- 1-mean(results$values[, "SVM~Accuracy"])
estOseLda <- 1-mean(results$values[, "LDA~Accuracy"])

# dot plots of results
dotplot(results)

Based on the mean accuracies on the resamples, we can estimate that the out of sample errors will be close to 1.13% for random forests, 4.23% for boosted trees, 22.16% for support vector machine, and 29.9% for linear dicriminant analysis.

We now use the models to predict the class of the activities in the testing data set, and calculate the accuracy (or, equivalently, the out of sample error) of the models on this set.

# random forest
predRfTest <- predict(modRf, newdata=testHAR)
cmRfTest <- confusionMatrix(testHAR$classe, predRfTest)
accRfTest <- cmRfTest$overall[['Accuracy']]
oseRfTest <- 1-accRfTest
kappaRfTest  <- cmRfTest$overall[['Kappa']]

# boosted trees
predGbmTest <- predict(modGbm, newdata=testHAR)
cmGbmTest <- confusionMatrix(testHAR$classe, predGbmTest)
accGbmTest <- cmGbmTest$overall[['Accuracy']]
oseGbmTest <- 1-accGbmTest
kappaGbmTest  <- cmGbmTest$overall[['Kappa']]

# support vector machine
predSvmTest <- predict(modSvm, newdata=testHAR)
cmSvmTest <- confusionMatrix(testHAR$classe, predSvmTest)
accSvmTest <- cmSvmTest$overall[['Accuracy']]
oseSvmTest <- 1-accSvmTest
kappaSvmTest  <- cmSvmTest$overall[['Kappa']]

# linear dicriminant analysis
predLdaTest <- predict(modLda, newdata=testHAR)
cmLdaTest <- confusionMatrix(testHAR$classe, predLdaTest)
accLdaTest <- cmLdaTest$overall[['Accuracy']]
oseLdaTest <- 1-accLdaTest
kappaLdaTest  <- cmLdaTest$overall[['Kappa']]

After predicting the classes on the testing set, we obtain:

The accuracy of the random forest model is excellent, and the accuracy of the boosted trees model is also very good. We also note that our estimations of the out of sample errors were quite close to the actual out of sample errors on the testing set.

We then fit a model that combines the two best predictors, namely those produced by the random forest model and the boosted trees model, using a random forest, on the test set.

# fit a model that combines predictors
predRfGbmTest <- data.frame(predRf = predRfTest, predGbm = predGbmTest,
                               classe=testHAR$classe)
modCombRfGbm <- train(classe ~ ., model = "rf", data=predRfGbmTest, 
                 trControl = trainCtrl)

# predict on testing set
predCombPredTest <- predict(modCombRfGbm, predRfGbmTest)

We use this final model to predict the classes on the validation set, and display the resulting confusion matrix.

# first predict using the random forest and boosted trees models
predRfValidation <- predict(modRf, newdata=validationHAR)
predGbmValidation <- predict(modGbm, newdata=validationHAR)

# feed these predictions into our final model
predCombRfGbmValidation <- predict(modCombRfGbm, 
                              newdata=data.frame(predRf = predRfValidation, 
                                                 predGbm = predGbmValidation))

# confusion matrix
cmCombRfGbmValidation <- confusionMatrix(testHAR$classe, 
                                         predCombRfGbmValidation)
cmCombRfGbmValidation$table
##           Reference
## Prediction    A    B    C    D    E
##          A 1116    0    0    0    0
##          B    5  754    0    0    0
##          C    0    5  677    2    0
##          D    0    0   14  629    0
##          E    0    0    0    1  720
# accuracy, out of sample error, kappa on validation set
accCombRfGbmValidation <- cmCombRfGbmValidation$overall[['Accuracy']]
oseCombRfGbmValidation <- 1-accCombRfGbmValidation
kappaCombRfGbmValidation  <- cmCombRfGbmValidation$overall[['Kappa']]

The final model has an accuracy on the validation set of 99.3% and a \(\kappa\) value of 0.991, which is slightly better than the already excellent accuracy of the random forest model.

We expect that the out of sample error for our final model will be close to 0.69%, the out of sample error on the validation set.

Predictions

To make the final predictions, we first load the test cases, removing non-available values as we did for the training data file. We then use the previously fitted random forest and boosted trees models to make our first set of predictions, which we then pass to our model that combines the two predictors.

harTestingFile <- "project-data/pml-testing.csv"
if (!file.exists(harTestingFile)) {
  harTestingUrl <- 
    "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
  download.file(harTestingFile, url = harTestingUrl)
}

testCases <- read.csv(harTestingFile, na.strings = c("NA", "", "#DIV/0"))
predTestCasesRf <- predict(modRf, newdata=testCases)
predTestCasesGbm <- predict(modGbm, newdata=testCases)

# prediction
predTestCasesCombModFit <- predict(modCombRfGbm, 
                                   newdata = data.frame(
                                     predRf = predTestCasesRf, 
                                     predGbm = predTestCasesGbm))

The resulting predicted classes for the test cases are: B, A, B, A, A, E, D, B, A, A, B, C, B, A, E, E, A, B, B, B.

It may be noted that the original (i.e. pre-combination) predictors yield the same classes, as the underlying models were very accurate to begin with.

References

Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13). Stuttgart, Germany: ACM SIGCHI, 2013. Website: http://groupware.les.inf.puc-rio.br/har