Challenges Set 6

Instructions

Use the starwars dataset from the dplyr package (loaded already with tidyverse) to complete the below challenges. I highly recommend all of you to first get to know the starwars dataset by trying the “Get to know your data” functions covered at the beginning of the Week 2 Starter file. Good luck and have fun in completing them:

Challenge 1:

Create a recipe named recipe_c1 by following the below steps:

  • Specify mass as the outcome variable and height and species as predictors in the formula.

  • Apply a log transformation to mass using base 10.

  • Standardize the average and sd of the height column.

  • Impute missing values inside the species column.

  • Transform the species variable by grouping any levels with fewer than 2% of observations into a category labeled “other”.

  • Encode the new species column as dummy variables.

  • Use prep and bake to examine the resulting preprocessed dataset.

Important

How many species are left? how many were available originally? Moreover, will you get the same results if you change the order of the preprocessing steps related to the species column?

Challenge 2:

Build a linear regression model:

  • Use parsnip to create a linear regression model and name it “linear_mod”

  • Set the model engine to “lm”.

  • Specify the mode as regression.

  • Display the model object.

Challenge 3:

Create a workflow to complete the following:

  • Use the recipe created in Challenge 1.

  • Add the linear regression model from Challenge 2.

  • Fit the workflow to the starwars dataset.

  • Display the results of the fitted workflow in a tidy format.

Caution

Remove the prep and bake steps from the recipe in Challenge 1 and recreate the object by run the code without them. The workflow will ensure that all preprocessing steps are applied before running the models on the preprocessed starwars dataset. So, those steps are not needed when using workflows.

Challenge 4:

Create a recipe named recipe_c4 by following the below steps:

  • Specify species as the outcome variable, with height, mass, skin_color, eye_color and gender as predictors.

  • Impute missing values in all numerical columns with the average.

  • Impute missing values in all nominal columns with the mode.

  • Create a new variable “height_m” equal to height / 100 .

  • Filter the dataset to include only characters with mass less than 500 (hint: step_filter).

  • Normalize height_m to force average to 0 and sd to 1.

  • Convert all categorical variables into dummy variables.

  • Use prep and bake to check your transformed data.

Important

What do you notice after completing the above steps? Are all the transformation necessary or useful? How many columns are available in your preprocessed dataset? How many were available before?

Challenge 5:

Build a decision tree model:

  • Use parsnip to create a decision tree model and name it “decision_tree_model”

  • Set the model engine to “rpart”.

  • Specify the mode as classification.

  • Display the model object.

Challenge 6:

Create a workflow to complete the following:

  • Use the recipe created in Challenge 4.

  • Add the decision tree model from Challenge 5.

  • Fit the workflow to the starwars dataset.

  • Display the results of the fitted workflow in a tidy format.

Caution

Remove the prep and bake steps from the recipe in Challenge 4 and recreate the object by run the code without them. The workflow will ensure that all preprocessing steps are applied before running the models on the preprocessed starwars dataset. So, those steps are not needed when using workflows.

Challenge 7:

Create a correlation matix object named as “starwars_corr_matrix”. Use all the numerical columns to show the correlation between the variables and write the code to visualize the correlation matrix with a chart.

Caution

What do you notice? Try to interpret the correletion matrix plot.

Challenge 8:

Create a data splitting object named as “starwars_split”. Make sure that you use 007 as your seed and that your split is allocating 90% of the data to training and 10% of the data to test. Finally, create “starwars_train” and a “starwars_test” set.

Caution

How many observations do you have in your train set? how many do you have in your test set?

🛑 Don’t Click Submit Just Yet 🚧

Please read carefully the below information:

  • Once you have completed all the coding challenges, and your confident in your work, copy and paste your responses from the chunk into the form fields below each challenge.

  • You are responsible for correctly coping and pasting only the required code to solve each challenge We will grade only what you have submitted!

  • We will only grade 1 submission per student so do not click Submit until you are confident in your responses.

  • By submitting this form you are certifying that you have followed the academic integrity guidelines available in the syllabus. The code and answers submitted are the results of your work and your work only!

  • Make sure you have completed all the challenges and included all the required personal information (e.g., full name, email, zid) in the respective form’s fields. If you don’t know/want to complete a challenge just leave the field below it empty.

  • Now you are ready to click the above “Submit” button. Congrats you have completed this set of challenges!!!