Machine Learning Course

With all the hype around it, machine learning seems like the 'next biggest thing' that may revolutionise our world. With increased computational power and more and more data being collected everyday, algorithms of machine learning that originated in the 70s and 80s can finally be put to use. In this course, we will look at how we can use large datasets in order to create a simple machine learning model that helps us predict things.

Installation Instructions

Installing pip. Please scroll down to your relevant operating system.

Run "pip install pandas scikit-learn" in your terminal (PowerShell on Windows, Terminal on macOS and if you are using another operating system you probably know what you are doing).

Download this file and place it in your working directory.

Preliminary Data Processing

To import and process our data, we will be using pandas. Pandas is a very comprehensive library that can do a lot of different things, but we'll only use pandas for its dataframes. For our purposes, dataframes are pretty much identical to two-dimensional arrays. Let's import it first.

import pandas as pd
                    

Now, let's import the dataset and save it as a dataframe variable. The dataset we will be using is a fictional dataset of student profiles and their grades. We are using this fictional dataset as it is complete and a good starting point for us to learn about machine learning.

dataframe = pd.read_csv(r'student_performance.csv')
print(dataframe)
                    

We can retrieve individual columns by indexing them:

print(dataframe['gender'])
                    

As a quick exercise, try to find the amount of students that are female.

For our machine learning model, we cannot use qualitative data. The computer simply is not smart enough to understand words such as male and female. Your course instructor should now explain how we can turn these types of data to qualitative data.

Now, let's create a new dataframe in order to store the processed data:

columns = [
  gender',
  'A',
  'B',
  'C',
  'D',
  'E',
  'education',
  'lunch',
  'preparation',
  'score'
]

df = pd.DataFrame(0, index = list(range(1000)), columns = columns)
                    

As an exercise, attempt to process the columns yourself. For example, you can set the new gender column like this:

for index, gender in enumerate(dataframe['gender']):
 if gender == 'female':
  df['gender'][index] = 1
 else:
df['gender'][index] = 0
                    

Let's check our new dataframe "df" to see if we have done it correctly:

print(df)
                    

Now, attempt to process the other rows yourself! You can find two examples of solutions at the bottom of this page.

Once you have finished processing your data, we can save it as a csv file like this:

df.to_csv('df.csv')
                    
Creating Machine Learning Models

Because we have opened up a new file, we need to import the pandas library again and retrieve the data we saved as "df.csv":

import pandas as pd

df = pd.read_csv(r'df.csv')
                    

As mentioned previously, machine learning relies on training on datasets. Datasets are split into features (our input) and the result. We will save the features in a dataframe called X and the results in one called y.

Let's just check what df is like again:

print(df)
                    

Okay, so we have this new column called 'Unnamed: 0'. This appeared because when we saved our panda dataframe, indices were assigned. We do not actually need these indices so we can drop this column when splitting our dataframe into X and y. We can use drop() in order to drop the columns. Specifying that axis = 1 means that we want to drop columns, not rows:

X = df.drop(['Unnamed: 0', 'score'], axis = 1)
y = df['score']
                    

We will be using the scikit-learn library for everything machine learning related. First, we will need to import the functions from this library we will be using:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
                    

Now let's split X and y into training and testing datasets. Your course instructor should explain the difference between these two.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
                    

Now, let's create our linear regression model and train our dataset. For more information about linear regression please visit the link in the previous paragraph.

model = LinearRegression()
model.fit(X_train, y_train)
                    

In order to predict the test score from features, we can use model.predict():

y_predict = model.predict(X_test)
                    

There are many different ways we can see how good our predictions are. Try to find the average difference between the predicted and actual values. Scroll to the bottom of the page for a solution.

We can also predict a single result. Because model.predict() only takes two-dimensional data structures as input, we need to use double square brackets when we are predicting the result for a single entry:

print(model.predict([[0,0,0,1,0,0,4,0,1]]))
                    

So our model is pretty good, predicting within 30 marks of the actual score consistently. To make sure that we do not have to retrain our model everytime we run our program, let's save our model.

import pickle
filename = 'model'
with open(filename, 'wb') as handle:
  pickle.dump(model, handle)                        
                    

This saves our model in a file called "model" in the current directory. We can import the model back like this:

with open(filename, 'rb') as handle:
  model_from_file = pickle.load(handle)

print(model_from_file.predict([[0,0,0,1,0,0,4,0,1]])) # This should be the same as your previous result.
                    

Now, you can load your model in multiple files. Scroll to the bottom of the page for Exercise 3.

Exercise 1

Here are 2 solutions for data processing:

education_values = {
  'high school': 1,
  'some high school': 1,
  'associate\'s degree': 2,
  'some college': 3,
  'bachelor\'s degree': 3,
  'master\'s degree': 4
}
for i in range(1000):
  df['gender'][i] = int(dataframe['gender'][i] == 'female')
  df[dataframe['race/ethnicity'][i][-1]][i] = 1
  df['education'][i] = education_values[dataframe['parental level of education'][i]]
  df['lunch'][i] = int(dataframe['lunch'][i] == 'standard')
  df['preparation'][i] = int(dataframe['test preparation course'][i] == 'standard')
  df['score'][i] = dataframe['math score'][i] + dataframe['reading score'][i] + dataframe['writing score'][i]
df.to_csv('df.csv')
                    
education_values = {
  'high school': 1,
  'some high school': 1,
  'associate\'s degree': 2,
  'some college': 3,
  'bachelor\'s degree': 3,
  'master\'s degree': 4
}
ethnicity_values = {
  'group A': 0,
  'group B': 1,
  'group C': 2,
  'group D': 3,
  'group E': 4
}
df['gender'] = dataframe['gender'].replace({'male': 0, 'female': 1})
ethnicity_columns = pd.get_dummies(dataframe['race/ethnicity'].replace(ethnicity_values))
for i in range(5):
  df['ABCDE'[i]] = ethnicity_columns[i]
df['education'] = dataframe['parental level of education'].replace(education_values)
df['lunch'] = dataframe['lunch'].replace({'free/reduced': 0, 'standard': 1})
df['preparation'] = dataframe['test preparation course'].replace({'none': 0, 'completed': 1})
df['score'] = dataframe['math score'] + dataframe['reading score'] + dataframe['writing score']
df.to_csv('df.csv')
                    
Exercise 2

Here is the solution for finding the average difference:

print(sum(abs(y_predict - y_test)) / len(y_predict))
                    
Exercise 3

Try to build a console application that lets people input their information and predict their total score.