# Machine Learning With Caret In R

Constructing models in caret involves two steps. First, we have to decide how training should occur. For example, should it use cross-validation (where the data is divided into k equal parts, k rounds of training and testing occur, where data is trained on k-1 portions of the data and tested on the last, held-out portion)? There are many different options, but for now we will just use cross-validation with 5 folds.

## Machine Learning with caret in R

**Download File: **__https://www.google.com/url?q=https%3A%2F%2Fmiimms.com%2F2ue0tE&sa=D&sntz=1&usg=AOvVaw07KF7FBU-Z2QwflGTSahpt__

But this is only a taste of the power of the caret package, created by Max Kuhn who now works for RStudio. Behind the scenes caret takes these lines of code and automatically resamples the models and conducts parameter tuning (more on this later). This enables you to build and compare models with very little overhead.

As mentioned above, one of the most powerful aspects of the caret package is the consistent modeling syntax. By simply changing the method argument, you can easily cycle between, for example, running a linear model, a gradient boosting machine model and a LASSO model. In total, there are 233 different models available in caret. This blog post will focus on regression-type models (those with a continuous outcome), but classification models are also easily applied in caret using the same basic syntax.

A common approach in machine learning, assuming you have enough data, is to split your data into a training dataset for model building and testing dataset for final model testing. The model building, with the help of resampling, would be conducted only on the training dataset. Once a final model has been chosen you would do an additional test of performance by predicting the response in the testing set.

By default caret will bootstrap your data 25 times. But this can be computationally expensive if you have a large dataset (25 + 1 model runs in this case). Although caret will perform the resampling in parallel on multiple cores if possible, with a large dataset you might consider a 10-fold cross validation instead. Essentially this means splitting your data into 10 approximately equal chunks. You then develop a model based on 9 chunks and predict the 10th. Then choose a different 9 and predict and so on.

caret can take care of the dirty work of setting up the resampling with the help of the trainControl() function and the trControl argument. Careful here, the argument is trControl not trainControl. The trainControl() function allows you to set up several aspects of your model including the resampling method.

Even if you only model using linear regression caret can improve your workflow by simplifying data splitting, automating your resampling and providing a vehicle for comparing models. If you need to compare different model types and particularly if you run models with tuning parameters caret will save you an incredible amount of time by automating resampling on different settings of your tuning parameters and allowing you to use a consistent syntax across hundreds of different model types. It will take some time to get up to speed with caret but this time will pay serious dividends in future modeling workflow efficiency.

Using the wine dataset our task is to build a model to recognize the origin of the wine. The original owners of this dataset are Forina, M. et al. , PARVUS, Institute of Pharmaceutical and Food Analysis and Technologies, Via Brigata Salerno, 16147 Genoa, Italy. This wine dataset is hosted as open data on UCI machine learning repository.

The R package caret has a powerful train function that allows you to fit over 230 different models using one syntax. There are over 230 models included in the package including various tree-based models, neural nets, deep learning and much more.

Caret was originally billed as the one-stop solution for machine learning, but it is useful for general statistical modeling as well. A more modern option now available is the tidymodels package. We focus on caret here because there are currently more resources available.

Assuming this type of mathematical relationship, machine learning provides a set of methods for identifying that relationship. Said differently, machine learning provides a set of computational methods that accept data observations as inputs, and subsequently estimate that mathematical function, ; machine learning methods learn the relationship by being trained with an input dataset.

To be clear, there is quite a bit of math involved in machine learning, but most of that math is taken care of for you. What I mean, is that for the most part, R libraries and functions perform the mathematical calculations for you. You just need to know which functions to use, and when to use them.

Caret solves this problem. To simplify the process, caret provides tools for almost every part of the model building process, and moreover, provides a common interface to these different machine learning methods.

For example, caret provides a simple, common interface to almost every machine learning algorithm in R. When using caret, different learning methods like linear regression, neural networks, and support vector machines, all share a common syntax (the syntax is basically identical, except for a few minor changes).

To say that more simply, caret provides you with an easy-to-use toolkit for building many different model types and executing critical parts of the ML workflow. This simple interface enables rapid, iterative modeling. In turn, this iterative workflow will allow you to develop good models faster, with less effort, and with less frustration.

Now, with this knowledge about caret's formula syntax, let's reexamine the above code. Because we want to predict mpg on the basis of wt, we use the formula mpg wt. Again, this line of code is the "formula" that tells train() our target variable and our input variable. If we translate this line of code into English, we're effectively telling train(), "build a model that predicts mpg (miles per gallon) on the basis of wt (car weight)."

Finally, we see the method = parameter. This parameter indicates what machine learning method we want to use to predict y. In this case, we're building a linear regression model, so we are using the argument "lm".

Again, it's beyond the scope of this post to discuss all of the different model types. However, as you learn more about machine learning, and want to try out more advanced machine learning techniques, this is how you can implement them. You simply change the learning method by changing the argument of the method = parameter.

This is a good place to reiterate one of caret's primary advantages: switching between model types is extremely easy when we use caret's train() function. Again, if you want to use linear regression to model your data, you just type in "lm" for the argument to method =; if you want to change the learning method to k-nearest neighbor, you just replace "lm" with "knn".

Caret's syntax allows you to very easily change the learning method. In turn, this allows you to "try out" and evaluate many different learning methods rapidly and iteratively. You can just re-run your code with different values for the method parameter, and compare the results for each method.

Keep in mind though, if you're new to machine learning, there's still lots more to learn. Machine learning is intricate and fascinatingly complex. Moreover, caret has a variety of additional tools for model building. We've just scratched the surface here.

If you have questions about machine learning, or topics you're struggling with, sign up for the email list. Once you're on the email list, reply directly to any of the emails and send in your questions.

Expanding upon the last section, we will continue exploring machine learning in R. Specifically, we will use the caret (Classification and Regression Training) package. Many packages provide access to machine learning methods, and caret offers a standardized means to use a variety of algorithms from different packages. This link provides a list of all models that can be used through caret. In this module, we will specifically focus on k-nearest neighbor (k-NN), decision trees (DT), random forests (RF), and support vector machines (SVM); however, after learning to apply these methods you will be able to apply many more methods using similar syntax. We will explore caret using a variety of examples. The link at the bottom of the page provides the example data and R Markdown file used to generate this module.

Take some time to review the results and assessment. Note that this is a different problem than those presented above; however, the syntax is very similar. This is one of the benefits of caret: it provides a standardized way to experiment with different algorithms and machine learning problems within R.

As noted in the machine learning background lectures, algorithms can be negatively impacted by imbalance in the training data. Fortunately, caret has built-in techniques for dealing with this issue including the following:

It is also possible to use caret to produce continuous predictions, similar to linear regression and geographically weighted regression. In this last example, I will repeat a portion of the analysis from the regression module and compare the results to those obtained with machine learning. As you might remember, the goal is to predict the percentage of people over 25 that have at least a bachelors degree by county using multiple other variables. This data violated several assumptions of linear regression, so machine learning might be more appropriate.

Next, I create models and predictions using the four machine learning algorithms. Note that I have changed the tuning metric to RMSE, as Kappa is not appropriate for a continuous prediction. I then predict to the withheld data and obtain RMSE values. 041b061a72