Polynomial Regression — Machine Learning Works (2024)

Written By Anirudh Pai

What is Polynomial Regression?

My last tutorial discussed multiple linear regression, an algorithm that can find a linear relationship between several independent variables and one dependent variable.

But what if we want to be able to identify more complex correlations within data?

One algorithm that we could use is called polynomial regression, which can identify polynomial correlations with several independent variables up to a certain degree n.

In this article, we’re first going to discuss the intuition behind polynomial regression and then move on to its implementation in Python via libraries like Scikit-Learn and Numpy.

How Does Polynomial Regression Work?

Polynomial Transformation

Before we dive into the equation of polynomial regression, let’s first discuss how this regression algorithm scales the dataset we provide to a user-specified degree n.

To understand this, let’s take a look at this sample dataset:

Polynomial Regression — Machine Learning Works (1)

In the first column x, we have values representing the independent variables, while in the second column y, we have values representing the dependent variables.

If we were creating a linear regression algorithm, the data would be inputted into the algorithm as-is, and a linear relationship would be analyzed.

To find a polynomial correlation, however, our algorithm will create new columns to scale x up to degree n. Don’t worry if that’s confusing, because we can visualize polynomial transformation to better understand it.

Polynomial Regression — Machine Learning Works (2)

As we can see from the array above, new columns of x are created wherein the values of x are exponentiated to a certain power until the column of degree n is reached.

If we were to transform the dataset to degree 4, for example, we would have 3 new columns: x^2, x^3, and x^4.

This process is repeated for each independent variable originally provided in the dataset. However, there’s a slight twist: not only will there be a column for each variable transformed to degree n, but there will be a column for the product of each unique pair of features that have a total degree less than or equal to n.

For example, if a dataset with two independent variables, x_1 and x_2, were to get transformed to the third degree, the combination x_1 ^ 2 * x_2 would be included since the total degree sum is equal to 3; on the other hand, the combination x_1 ^2 * x_2 ^2 would not be included since the total degree sum is 4, which is greater than 3.

Now that we’ve covered the basics of the polynomial transformation of datasets, let’s talk about the intuition behind the equation of polynomial regression.

Model Representation

Much like the linear regression algorithms discussed in previous articles, a polynomial regressor tries to create an equation which it believes creates the best representation of the data given.

Unsurprisingly, the equation of a polynomial regression algorithm can be modeled by an (almost) regular polynomial equation. Let’s dive into it and break down each part of the equation.

Polynomial Regression — Machine Learning Works (3)

Let’s talk about each variable in the equation:

  • y represents the dependent variable (output value)

  • b_0 represents the y-intercept of the parabolic function

  • b_1 - b_dc - b_(d+c_C_d) represent parameter values that our model will tune

  • d represents the degree of the polynomial being tuned

  • c represents the number of independent variables in the dataset before polynomial transformation

  • x_1 - x_c are the independent variables in the dataset

  • p is the product of a pair of features with a total degree less than or equal to d

  • i is the i’th product of a pair of features with a total degree less than or equal to d

  • d+c_C_d is the number of unique pairs of features with a total degree less than or equal to d

As we can see, the equation incorporates the polynomial transformation results that we discussed in the previous section. The parameter values (b_0 - b_n) will be tuned by our polynomial regression algorithm such that we have a complete equation of a curve of best fit. This curve will be one that best represents the data being given.

If there is more than one independent variable, we will end up with a graph similar to the one below.

If we only have one independent variable, however, we will have a simple graph in two dimensions.

Polynomial Regression — Machine Learning Works (5)

Now that we know what our polynomial regression equation will look like, let’s discuss how our algorithm will create such an equation.

The Cost Function

In order to finalize a polynomial equation of the form discussed in the previous section, our model will need to be able to determine how well an equation represents the data given.

In order to do this, a polynomial regressor will implement what is called The Mean Squared Error (MSE) Cost Function, a mathematical formula that returns a numerical value representing the error of our model.

This cost function is a little complex, so I wrote an article dedicated to explaining it. Please make sure to check it out right away, as MSE is a large part of polynomial regression.

By using MSE, our model will be able to determine which parameter values create a better representation of the data than others.

But how do machine learning algorithms converge upon optimal parameter values in the first place?

Gradient Descent

This is where gradient descent, another complex mathematical process, comes into play. Due to gradient descent’s complexity, I have written another article dedicated to explaining the math behind it.

I highly suggest that you read the article before continuing, as gradient descent, although a little complicated, is a very important part of polynomial regression.

After our regressor completes the gradient descent process, it will have reached optimal parameter values that best minimize the MSE cost function discussed in the previous section.

Predicting

Let’s say that our model was trained on a dataset with two variables to the second degree. That would mean that it’s regression equation would be in the form:

Polynomial Regression — Machine Learning Works (6)

The parameter values b_0 through b_5 would be calculated by the regressor with gradient descent, but for the sake of this example, let’s assign random values.

We now have the following equation:

Polynomial Regression — Machine Learning Works (7)

To predict new values, our regressor simply needs to plug in the values of the first and second independent variable into x_1 and x_2, respectively.

For example, if the value of the first independent variable was 2 and the value of the second was 4, the following values would be plugged in:

  • x_1: 2

  • x_2: 4

Polynomial Regression — Machine Learning Works (8)

Now we just simplify the above equation so that we get the value of y. This value will be the predicted value of the regression model.

Polynomial Regression — Machine Learning Works (9)

Great! So our regressor will output 90.4 as the predicted value.

Now that we know how our polynomial regression model works, let’s implement it in Python.

Implementation of Polynomial Regression in Python

Library Installation

Now that we’ve talked about the intuition behind polynomial regression, it’s time to implement the model in code.

Note: The dataset used in this article was downloaded from superdatascience.com. For convenience, all the code and data for this section of the article can be found here.

Before we do this, however, we must install three important libraries: Scikit-Learn, Pandas, and Numpy.

  • Scikit-Learn is a machine learning library that provides machine learning algorithms to perform regression, classification, clustering, and more.

  • Pandas is a Python library that helps in data manipulation and analysis, and it offers data structures that are needed in machine learning.

  • Numpy is another library that makes it easy to work with arrays. It provides several unique functions that will help in data preprocessing.

  • Matplotlib is a graphing library that will help us visualize our regressor’s curve on a graph with the data scatterplot.

Fortunately, these libraries can be quickly installed by using Pip, Python’s default package-management system. All we have to do is enter the following lines of code into terminal:

pip3 install numpypip3 install pandaspip3 install sklearnpip3 install matplotlib

After this is complete, we can begin coding our algorithm in Python!

Step 1: Importing the Data

As always, we must begin by importing Numpy and Pandas, which are two main libraries we will be using for this regression model.

import numpy as np import pandas as pd

Now, we must import the dataset by using the read_csv() function from the Pandas library. This function will take in the .csv file and convert it to a Pandas dataframe.

dataset = pd.read_csv('Position_Salaries.csv')

Now we need to split our dataframe up into Numpy arrays: we need one array containing the independent variable(s) and another containing the dependent variable. To do this, we must take a look at our dataset.

Polynomial Regression — Machine Learning Works (10)

We can see that there are three columns: position, level, and salary. Our task with this data is to predict an employee’s salary given their position.

If we pay close attention to the first two columns, we’ll see that there is a direct correlation between level and position. In other words, every level value corresponds to a unique position value. Thus, we can omit the position column and just input level into our regression model. The dependent variable is the salary since the values within this column are what our regressor needs to predict.

Now, we must split up our dataset into an independent variable Numpy array called x and a dependent variable Numpy array called y.

x = dataset.iloc[:, 1:2].valuesy = dataset.iloc[:, 2].values

Step 2: Data Preprocessing

As with any other machine learning model, a polynomial regressor requires input data to be preprocessed, or “cleaned”.

As always, we must now split these two arrays into training and testing data subsets so that we can accurately test our regression model after training it. Since our dataset is quite small, we will use a test_size of 0.2 when creating our subsets.

from sklearn.model_selection import train_test_splitx_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.2, random_state=0)

Now, we must apply feature scaling on our input and output datasets in order to optimize the training of our polynomial regressor. Feature scaling will center our data closer to 0, which will accelerate the converge of the gradient descent algorithm.

To scale our data, we can use Scikit-Learn’s StandardScaler class; more specifically, we can use the .fit_transform and .fit methods on our training and test datasets.

First, we will apply standard scaling on our input training and test sets as shown below. We first create an instance of the StandardScaler class called sc_x, and then we apply the necessary methods to transform our data.

from sklearn.preprocessing import StandardScalersc_x = StandardScaler()sc_x.fit_transform(x_train)sc_x.transform(x_test)

Now, all we have to do is implement the same steps for our dependent variable datasets. We call this instance of the StandardScaler class sc_y.

sc_y = StandardScaler()sc_y.fit_transform(y_train)sc_y.transform(y_test)

Finally, we must polynomially transform our dataset by using the PolynomialFeatures class provided by Scikit-Learn. Since we don’t know the optimal degree to transform our dataset to, we can just choose 3 as a good arbitrary value.

To implement this, we must first instantiate the PolynomialFeatures class and then use the .fit_transform and .transform methods to transform the input datasets. This is shown below.

from sklearn.preprocessing import PolynomialFeaturespoly_transform = PolynomialFeatures(degree=3, include_bias=False)x_poly_train = poly_transform.fit_transform(x_train)x_poly_test = poly_transform.transform(x_test)

The include_bias parameter determines whether PolynomialFeatures will add a column of 1’s to the front of the dataset to represent the y-intercept parameter value for our regression equation.

Since the LinearRegression class we will use to create a polynomial model will add this column of 1’s for us, we set include_bias to False to avoid duplicate columns.

Now that we’ve finished data preprocessing, we can finally move on to the training and testing of our actual polynomial regression model.

Step 3: Creating and Training the Regressor

Since we have already polynomially transformed our dataset, we can just apply the LinearRegression class from Scikit-Learn to create a polynomial model. To do this, we must first instantiate the class and then apply the .fit() method passing in our training data as arguments.

from sklearn.linear_model import LinearRegressionregressor = LinearRegression()regressor.fit(x_poly_train, y_train)

We have successfully trained our polynomial regression model! We can now use this model to make predictions based on input values of our choosing. First, however, let’s visualize the model by graphing its predictions on both the training and test datasets on a scatterplot of the actual data points.

Step 4: Visualizing the Regressor’s Curve

To visualize our regressor’s curve, we can use the Matplotlib library we imported at the beginning of this article. First, however, we must create a dataset with smaller independent variable increments for the sole purpose of graphing a smooth curve. If we take a look at our current dataset below:

Polynomial Regression — Machine Learning Works (11)

We see that the independent variables that we are using, contained in the Level column, have increments of 1 between them. Unfortunately, if we use these independent variables to predict with our model, we won’t be able to create a smooth curve. So, we must create two datasets—one for the training data and one for the test data—that contain independent variable values with a smaller increment.

To do this, we can use the arange() function from the Numpy library as shown below. We are essentially creating new datasets that contain values between the minimum and maximum of the original datasets with an increment of 0.01. We add 0.01 to the maximum values in x_train and x_test because the arange() function does not include the maximum bound itself.

x_grid_train = np.arange(min(x_train), max(x_train) + .01, step=0.01)x_grid_test = np.arange(min(x_test), max(x_test) + .01, step=0.01)

In order for x_grid_train and x_grid_test to serve as proper inputs for our regressor, we must reshape them to proper two-dimensional arrays as shown below.

x_grid_train = x_grid_train.reshape(len(x_grid_train), 1)x_grid_test = x_grid_train.reshape(len(x_grid_test), 1)

Great! We can finally begin to visualize our model by using Matplotlib. We’ll break down this process by walking through the graphing of our training data. We must first create a scatterplot containing the x and y-values of our training dataset. We’ll make the data points red and give them a label so that we can create a key for our graph.

plt.scatter(x, y, color='red', label='Training Data Points')

Now we must graph the curve that represents our model’s predictions of the training dataset. The x-values of the curve will be those in x_grid_train and the y-values will be the model’s output given a polynomially transformed version of x_grid_train. We’ll make the curve blue and give it a different label for the key.

plt.plot(x_grid_train, regressor.predict(poly_transform.transform(x_grid_train)), color='blue', label='Model Curve')

Now, the main aspects of our graph are complete: we just need to add labels for our graph and create a legend.

plt.title('Job Level vs Salary (Training Dataset)')plt.xlabel('Job Level (1 - 10)')plt.ylabel('Salary')plt.legend()plt.show()

Before we take a look at the visualization, let’s create another graph for the test data. Fortunately, the steps are exactly the same as those for creating the training data graph.

plt.scatter(x, y, color='red', label='Test Data Points')plt.plot(x_grid_test, regressor.predict(poly_transform.transform(x_grid_test)), color='blue', label='Model Curve')plt.title('Job Level vs Salary (Test Dataset)')plt.xlabel('Job Level (1 - 10)')plt.ylabel('Salary')plt.legend()plt.show()

Now we can view our graphs!

Polynomial Regression — Machine Learning Works (12)

Polynomial Regression — Machine Learning Works (13)

As we can see, our model’s curve matches up quite closely with the points in both the training and test datasets. This means that our choice to polynomially transform our dataset to the third degree was a good one.

Step 5: Making a Single Prediction

There’s no point in creating a machine learning model if you don’t use it to predict values that aren’t in the training or test sets. After all, the main purpose of machine learning algorithms is to be beneficial in real-world applications.

To make a prediction with our regressor, we can call the same .predict() method as we did when visualizing the model’s curve. However, instead of inputting a Numpy array like last time, all we need to do is input a double nested list containing the values of our independent variables in the same order as that in the training dataset. The reason we input a double nested list is because Scikit-Learn regressors expect a two-dimensional data structure as input.

When training the regressor, we provided just the position Values column as input. In addition, we polynomially transformed the input by using PolynomialFeatures. Thus, we just input a polynomially transformed double nested list into the .predict() function.

prediction = regressor.predict(poly_transform.transform([[11]]))print(prediction)

By inputting 11 as shown above, we are using our polynomial regressor to predict the salary level of an employee with a level 11 experience. If we run the above code, we get a prediction value of $1,520,293. This seems reasonable as a level 10 employee had a salary of 1,000,000 in our training dataset.

Conclusion

In this article, we learned how polynomial regressors work and how they can be implemented through the use of Python libraries such as Scikit-Learn.

I hope that you enjoyed this article; feel free to leave any comments in the article so that I can provide even better content in the future. Stay tuned for my upcoming articles on decision tree regression.

Anirudh Pai

Polynomial Regression — Machine Learning Works (2024)

References

Top Articles
Latest Posts
Article information

Author: Arline Emard IV

Last Updated:

Views: 5969

Rating: 4.1 / 5 (52 voted)

Reviews: 83% of readers found this page helpful

Author information

Name: Arline Emard IV

Birthday: 1996-07-10

Address: 8912 Hintz Shore, West Louie, AZ 69363-0747

Phone: +13454700762376

Job: Administration Technician

Hobby: Paintball, Horseback riding, Cycling, Running, Macrame, Playing musical instruments, Soapmaking

Introduction: My name is Arline Emard IV, I am a cheerful, gorgeous, colorful, joyous, excited, super, inquisitive person who loves writing and wants to share my knowledge and understanding with you.