There are lots of great learning resources online and also suggestions for learning plans.

Learning plans:

Learning resources:

 

Below is a data science curriculum you can follow to get the basics going. If you know some great learning resources which should be mentioned drop a comment.


 

12 WEEK DATA SCIENCE CURRICULUM

1 INTRODUCTION (0.5+ HOURS)

Week 1 & 2: Unit 2 – Probability & Statistics

Week 3 & 4: Unit 3 – Exploratory Data Analysis, Intro to R and Data Visualization

Week 5: Unit 4 – Data Wrangling and Unit 5.1 – Introduction to Analytics

Weeks 6-10: Complete Unit 5 – Analyzing Data Sets

Weeks 11-12: Capstone project

  • Start by watching this entertaining talk from the awesome Hilary Mason on everything from the history, to the future of Data Science.
  • Submit 3 potential Capstone project ideas and discuss with your mentor (30 Minutes)
  • Think about what kind of data sets excite you. What would you like to get deeper insights on? Do you care about sentiment analysis on unstructured text, like news items, social media posts or support tickets? Or about sports analytics a la Moneyball? Healthcare, environment, or just analysing and predicting engagement and retention rates for different customer segments? Submit 3 Capstone project ideas that you feel excited about working on. You can explore datasets from Quandl, US Government Open Data, UCI Machine Learning Repository or anywhere else you like. If you are super-charged, you can also browse Kaggle competitions.

Obviously, you haven’t learned enough to understand the complexity of these projects yet, but thinking of a few project ideas upfront can be a great motivator. It can also help you assimilate the curriculum material in that context, and relate to it much better.

2 PROBABILITY & STATISTICS (13.5+ HOURS)

  • This refresher on probability explains independent and dependent events, compound events and mutually exclusive events.
  • This section covers measures of central tendency and dispersion. Mean, median, mode, variance, and standard deviation.
  • This section covers random variables and probability distributions (both discrete and continuous). You can leave out the last 2 videos on Poisson processes – they are a bit advanced and not super relevant for the rest of the course.
  • Regression is a method for fitting a line to a set of points. This section covers linear and R-squared regression.
  • This section contains many advanced concepts, if you are in the mood for a deep dive. The videos on normal distribution, confidence intervals, and hypothesis testing are the most useful to watch.

3 EXPLORE & VISUALIZE DATA WITH R (34 HOURS)

3.1 EXPLORATORY DATA ANALYSIS (1 HOUR)

Exploratory Data Analysis (EDA) is an approach for summarizing and visualizing the important characteristics of a data set. Before getting into any formal methods, it helps one form an intuition about the data set.

  • Data Analysis with R is a great course co-developed by Facebook & Udacity. All of Udacity’s courseware can be accessed for free by clicking the blue button called “Access Course Material”. You also have the option of paying for a coach + a verified certificate if you would like.
  • This handbook compares different analysis approaches like EDA, classical, Bayesian etc. and provides several methods to check assumptions about incoming data, which is an important part of the analysis process. It is also a great reference for various graphical & quantitative techniques, and EDA case studies.

3.2 GET STARTED WITH R (7 HOURS)

For this course, we will use R statistical software as the primary tool, though python & its libraries are equally popular.

  • DataCamp has created a fun and interactive web tutorial on top of the open-source swirl project. Warm up with in-browser R exercises without worrying about installation yet.
  • Now that you have experienced the fun of R programming, it is time to set up your local R environment. This step-by-step tutorial covers installation instructions for R and RStudio, shortcuts, common commands, syntax quirks, and even basic analysis and visualization.

3.3 DIVE INTO EDA (18 HOURS)

  • Start with exploring one variable within a data set.
  • Submit links to all project files including visualization images, code, writeups etc. These will be shared with your mentor for feedback and also be included in your completion certificate.
  • In this lesson, we will learn techniques for exploring the relationship between any two variables in a data set. We’ll create scatter plots, calculate correlations, and investigate conditional means.
  • Submit links to all project files including visualization images, code, writeups etc. These will be shared with your mentor for feedback and also be included in your completion certificate.
  • In this lesson, you will learn powerful methods and visualizations for examining relationships among multiple variables. You’ll learn how to reshape data frames and how to use aesthetics like color and shape to uncover more information.
  • Submit links to all project files including visualization images, code, writeups etc. These will be shared with your mentor for feedback and also be included in your completion certificate.

3.4 ELECTIVE: DATA VISUALIZATION (8 HOURS)

  • If the last few sections on data visualization captured your imagination, dive deeper with this MOOC created by the Knight School of Journalism. The MOOC itself is not always available online but the videos are. These videos are enjoyable and they make a nice break from the more technically challenging courses in this path. However, while the material in the course may be easy to understand, data visualization is a deeper topic than it seems. These examples should help illuminate what makes a good visualization and give ideas for some more creative ways to display information. You will also learn general principles of graphic design and visual perception.
  • A very comprehensive survey of powerful visualization techniques by Heer, Bostock & Ogievetsky – professors at Stanford and Princeton.
  • Data Visualization Project (2 Hours)
  • Choose one of the data sets you picked for your Capstone project in Section 1 and turn it into an interesting visualization. Submit the code and image files.  If you haven’t done that yet, strongly encourage you to.

4 DATA WRANGLING (4.5+ HOURS)

One of the most time consuming steps in any data analysis is cleaning the data and getting it into a format that allows analysis. Data wrangling, also known as Data Munging, is the process of converting data from a raw form into another format that allows for more convenient analysis of the data with the help of semi-automated tools.

  • In this section, you will learn all about tools in R that make data wrangling a snap.

4.1 DPLYR (1.5+ HOURS)

  • Dplyr is a package which helps you do faster data exploration, data manipulation, and feature engineering. Its syntax is intuitive and its functions are well-named, and so dplyr code is easy-to-read even if you didn’t write it. It is being developed by Hadley Wickham, author of plyr, ggplot, devtools, stringr, and many other popular R packages.
  • This tutorial is the second in the series from Kevin @ Dataschool.io. It covers the most useful new features in dplyr versions 0.3 and 0.4, as well as some advanced functionality from previous versions that weren’t covered in Part 1.
  • Data Wrangling Project (1 Hour)
  • For this section, submit the course project for Coursera’s Getting and Cleaning Data course from Johns Hopkins University. Since the course is not accessible all the time, we have reproduced the project statement below. It is a good exercises to practice your newly acquired data wrangling skills!

Description: One of the most exciting areas in all of data science right now is wearable computing. Companies like Fitbit, Nike, and Jawbone Up are racing to develop the most advanced algorithms to attract new users. In this project, you will analyze this dataset collected from the accelerometers from the Samsung Galaxy S smartphone. A full description of the data is available here.

You should create one R script called run_analysis.R that does the following.

  1. Merges the training and the test sets to create one data set.
  2. Extracts only the measurements on the mean and standard deviation for each measurement.
  3. Uses descriptive activity names to name the activities in the data set
  4. Appropriately labels the data set with descriptive variable names.
  5. From the data set in step 4, creates a second, independent tidy data set with the average of each variable for each activity and each subject.

Submit:

  1. The tidy data set as described above
  2. A link to a Github repository with your script for performing the analysis, and
  3. A code book that describes the variables, the data, and any transformations or work that you performed to clean up the data called CodeBook.md. You should also include a README.md in the repo with your scripts. This repo explains how all of the scripts work and how they are connected.

4.2 ELECTIVE: WEB-SCRAPING AND APIS

Though there are a good number of open data sets available, a lot more data is available through APIs and as HTML web pages. If you are interested in web-scraping and using APIs, follow this optional 4-part elective course by Rolf Fredheim at the University of Cambridge.

  • This section covers coercing scraper output into dataframes, how to download files (along with a cursory look at the state of IP law), cover basic text-manipulation in R, and take a first look at working with the APIs (share counts on Facebook).
  • This section covers using APIs in R, focusing on the Google Maps API. We then attempt to transfer this approach to query the Yandex Maps API. Finally, the practice section includes examples of working with the YouTube V2 API, a few ‘social’ APIs such as LinkedIn and Twitter, as well as APIs less off the beaten track (Cricket scores, anyone?).
  • Rvest is a new package that makes it easy to scrape (or harvest) data from HTML web pages, inspired by libraries like beautiful soup.
  • Project: Submit a cleaned-up data set from your favorite website or API (2 Hours)
  • You might find APIgee useful for exploring APIs without having to write any code.

5 ANALYZING DATA SETS (43+ HOURS)

Now that you have some background in statistics and knowledge of tools, it’s time to get some practice with analyzing data sets. MIT’s course “The Analytics Edge” on EdX is arguably the best course on the subject. The material in this course particularly exemplifies the day-to-day work of a data analyst. We will cover only the first few weeks in this introductory course and leave the more advanced techniques for a follow-on course.

  • Complete Week 1 of MIT’s Analytics Edge on Edx. Since you already have some background in probability, statistics and R by now, you can breeze through this section fast.
  • Warm up with Linear Regression in Week 2 of the edX Analytics edge course. The background you have from Unit 2 on Probability & Statistics will come in handy here.
  • Complete Week 3 of the edX Analytics Edge course
  • Trees (8 – 10 Hours)
  • Complete Week 4 of the edX Analytics Edge course.
  • Complete Week 5 of the edX Analytics Edge course.
  • Complete Week 6 of the edX Analytics Edge course. This will be the last section to be included in this introductory Data Science course but you are welcome to complete the advanced optimization concepts and the edX course itself if you are hungry for more!

5.1 OPTIONAL READING

An Introduction to Statistical Learning (with Applications in R) makes modern statistical and machine learning methods accessible for beginners. The authors give precise, practical explanations of what methods are available, and when to use them, including explicit R code. The Elements of Statistical Learning is an advanced and comprehensive text on the subject. Both the books have PDF versions online which can be downloaded for free.

6 CAPSTONE PROJECT (10+ HOURS)

If you made it this far, a huge congratulations – it was a lot of perseverance and hard work! Bring it all together with the capstone project and flaunt it on your resume!

The final project is to analyze a set of data, create graphs, and draw conclusions. Choose a set of data of your choice, and then use R to analyze it. You can choose a dataset from Quandl, US Government Open Data, UCI Machine Learning Repository or anywhere else you like. If you are super-charged, you can also try your hands at a Kaggle competition. The only requirement is that it includes at least several thousand observations.

Once you choose a dataset that interests you, look through the variables and choose one or more variables that you want to predict based on three more more other variables. Download the dataset in CSV, JSON, TXT, or XML format. Answer your research question(s) by going through the complete research process:

1. Formulate research question(s)

2. Design and implement a study (or in our case, acquire study data)

3. Examine the data for problems

4. Carry out preliminary and then formal analyses

5. Interpret and present results

  • Complete & Submit Capstone Project (10 – 15 Hours)

7 REFERENCES & ADDITIONAL RESOURCES

With all the skills learned in this course, you are a big jump closer to a Data Science career. If you would like to consider it seriously, read Analyzing the Analyzers – a short e-book that gives a great overview of data science roles, skillsets, and career paths.

Websites to follow:

  • DataTau – A Hacker News for Data Scientists!
  • Kdnuggets– News, tools and jobs for the Data Science community
  • FiveThirtyEight – Nate Silver’s analyses on Politics, Economics, Science, Life, Sports.
  • Data Science 101 – Resources for learning to be a Data Scientist
  • R-Bloggers – R news and tutorials contributed by 500+ R bloggers

People to follow:

 

8 CAREER RESOURCES

  1. Preparing for interviews: Find some great questions for practice here and here
  2. Interviews with data scientists
  3. Building data products
  4. Design Thinking for Data Scientists
  5. Advice for strengthening portfolio:

source

Advertisements