On this page I will try to list the R-packages that have been useful to me. If you would like to get better at programming in R, then masterdataanalysis have put together a comprehensive list of courses available.

KDnuggets have done an interesting analysis of one CRAN mirror which show the most frequently downloaded packages by users of R-studio. Have a look at that as well. The graph is shown below, but I haven’t tried all of these packages. Many are probably for specialty fields. Yhat also has a nice guide to many popular data science R packages which can be useful.

top-20-r-packages-machine-learning-downloads

The list of R-Packages I find useful:

  1. Caret (set of functions to streamline the process for creating predictive models)
  2. ggplot2 (data visulization)
  3. forecast (for easy forecasting of time series)
  4. plyr (data aggregation)
  5. dplyr (data manipulation)
  6. stringr (string manipulation)
  7. lubridate (time and date manipulation)
  8. e1071 (machine learning algorithms)
  9. reshape2 (data restructuring)
  10. randomForest (random forest predictive models)
  11. ROCR (Visualizing the performance of scoring classifiers)

New Packages that look interesting:

  • AzureML V0.1.1

Cloud computing is, or will be, important to every practicing data scientist. Microsoft’sAzure ML is a particularly rich machine learning environment for R (and Python) programmers. If your are not yet an Azure user this new package goes a long way to overcoming the inertia involved in getting started. It provides functions to push R code from your local environment up to the Azure cloud and publish functions and models as web services. This vignette walks you step by step from getting a trial account and the necessary credentials to publishing your first simple examples.

The forests algorithm is the “go to” ensemble method for many data scientists as it consistently performs well on diverse data sets. This new variation based on performing Principal Component Analysis on random subsets of the feature space shows great promise. See the paper by Rodriguez et. al. for an explanation of how the PCA amounts to rotating the feature space and a comparison of the rotation forest algorithm with standard random forests and the Adaboost algorithm.

Given a matrix that is a superposition of a low rank component and a sparse component, rcpa uses a robust PCA method to recover these components.  Netflix data scientists publicized this algorithm, which is based on a paper by Candes et al,  Robust Principal Component Analysis, earlier this year when they reported spectacular success using robust PCA in an outlier detection problem.

Netflix_outliers

The support vector machine is also a mainstay machine learning algorithm. SwarmSVM, which is based on a clustering approach as described in a  paper by Gu and Han provides three ensemble methods for training support vector machines. The vignettethat accompanies the package provides a practical introduction to the method.