Written By Anirudh Pai

### What is Polynomial Regression?

My last tutorial discussed multiple linear regression, an algorithm that can find a *linear* relationship between several independent variables and one dependent variable.

*But what if we want to be able to identify more complex correlations within data?*

One algorithm that we could use is called **polynomial regression**, which can identify polynomial correlations with several independent variables up to a certain degree *n*.

In this article, we’re first going to discuss the intuition behind polynomial regression and then move on to its implementation in Python via libraries like Scikit-Learn and Numpy.

### How Does Polynomial Regression Work?

**Polynomial Transformation**

Before we dive into the equation of polynomial regression, let’s first discuss how this regression algorithm scales the dataset we provide to a user-specified degree *n*.

To understand this, let’s take a look at this sample dataset:

In the first column *x*, we have values representing the independent variables, while in the second column *y*, we have values representing the dependent variables.

If we were creating a linear regression algorithm, the data would be inputted into the algorithm as-is, and a linear relationship would be analyzed.

To find a polynomial correlation, however, our algorithm will create new columns to scale *x* up to degree *n. * Don’t worry if that’s confusing, because we can visualize polynomial transformation to better understand it.

As we can see from the array above, new columns of x are created wherein the values of x are exponentiated to a certain power until the column of degree *n* is reached.

If we were to transform the dataset to degree 4, for example, we would have 3 new columns: x^2, x^3, and x^4.

This process is repeated for each independent variable originally provided in the dataset. However, there’s a slight twist: not only will there be a column for each variable transformed to degree *n*, but there will be a column for the product of each unique pair of features that have a total degree less than or equal to *n*.

For example, if a dataset with two independent variables, *x_1 *and *x_2*, were to get transformed to the third degree, the combination *x_1* ^ 2 * *x_2* would be included since the total degree sum is equal to 3; on the other hand, the combination *x_1 *^2 * *x_2* ^2 would not be included since the total degree sum is 4, which is greater than 3.

Now that we’ve covered the basics of the polynomial transformation of datasets, let’s talk about the intuition behind the equation of polynomial regression.

**Model Representation**

Much like the linear regression algorithms discussed in previous articles, a polynomial regressor tries to create an equation which it believes creates the best representation of the data given.

Unsurprisingly, the equation of a polynomial regression algorithm can be modeled by an (almost) regular polynomial equation. Let’s dive into it and break down each part of the equation.

Let’s talk about each variable in the equation:

*y*represents the dependent variable (output value)*b_0*represents the y-intercept of the parabolic function*b_1*-*b_dc - b_(d+c_C_d)*represent parameter values that our model will tune*d*represents the degree of the polynomial being tuned*c*represents the number of independent variables in the dataset before polynomial transformation*x_1*-*x_c*are the independent variables in the datasetp is the product of a pair of features with a total degree less than or equal to

*d*i is the i’th product of a pair of features with a total degree less than or equal to

*d**d+c_C_d*is the number of unique pairs of features with a total degree less than or equal to*d*

As we can see, the equation incorporates the polynomial transformation results that we discussed in the previous section. The parameter values* *(*b_0* - *b_n*) will be tuned by our polynomial regression algorithm such that we have a complete equation of a **curve of best fit**. This curve will be one that best represents the data being given.

If there is more than one independent variable, we will end up with a graph similar to the one below.

If we only have one independent variable, however, we will have a simple graph in two dimensions.

Now that we know what our polynomial regression equation will look like, let’s discuss how our algorithm will create such an equation.

**The Cost Function**

In order to finalize a polynomial equation of the form discussed in the previous section, our model will need to be able to determine how well an equation represents the data given.

In order to do this, a polynomial regressor will implement what is called **The Mean Squared Error (MSE) Cost Function**, a mathematical formula that returns a numerical value representing the error of our model.

This cost function is a little complex, so I wrote an article dedicated to explaining it. Please make sure to check it out right away, as MSE is a large part of polynomial regression.

By using MSE, our model will be able to determine which parameter values create a better representation of the data than others.

*But how do machine learning algorithms converge upon optimal parameter values in the first place?*

**Gradient Descent**

This is where gradient descent, another complex mathematical process, comes into play. Due to gradient descent’s complexity, I have written another article dedicated to explaining the math behind it.

I highly suggest that you read the article before continuing, as gradient descent, although a little complicated, is a very important part of polynomial regression.

After our regressor completes the gradient descent process, it will have reached optimal parameter values that best minimize the MSE cost function discussed in the previous section.

**Predicting**

Let’s say that our model was trained on a dataset with two variables to the second degree. That would mean that it’s regression equation would be in the form:

The parameter values *b_0* through *b_5* would be calculated by the regressor with gradient descent, but for the sake of this example, let’s assign random values.

We now have the following equation:

To predict new values, our regressor simply needs to plug in the values of the first and second independent variable into *x_1* and *x_2,* respectively.

For example, if the value of the first independent variable was 2 and the value of the second was 4, the following values would be plugged in:

*x_1*: 2*x_2*: 4

Now we just simplify the above equation so that we get the value of *y*. This value will be the predicted value of the regression model.

Great! So our regressor will output 90.4 as the predicted value.

Now that we know how our polynomial regression model works, let’s implement it in Python.

### Implementation of Polynomial Regression in Python

**Library Installation**

Now that we’ve talked about the intuition behind polynomial regression, it’s time to implement the model in code.

*Note: The dataset used in this article was downloaded from superdatascience.com. For convenience, all the code and data for this section of the article can be found **here**.*

Before we do this, however, we must install three important libraries: Scikit-Learn, Pandas, and Numpy.

**Scikit-Learn**is a machine learning library that provides machine learning algorithms to perform regression, classification, clustering, and more.**Pandas**is a Python library that helps in data manipulation and analysis, and it offers data structures that are needed in machine learning.**Numpy**is another library that makes it easy to work with arrays. It provides several unique functions that will help in data preprocessing.**Matplotlib**is a graphing library that will help us visualize our regressor’s curve on a graph with the data scatterplot.

Fortunately, these libraries can be quickly installed by using **Pip**, Python’s default package-management system. All we have to do is enter the following lines of code into terminal:

pip3 install numpypip3 install pandaspip3 install sklearnpip3 install matplotlib

After this is complete, we can begin coding our algorithm in Python!

**Step 1: Importing the Data**

As always, we must begin by importing Numpy and Pandas, which are two main libraries we will be using for this regression model.

import numpy as np import pandas as pd

Now, we must import the dataset by using the *read_csv()* function from the Pandas library. This function will take in the .csv file and convert it to a Pandas dataframe.

dataset = pd.read_csv('Position_Salaries.csv')

Now we need to split our dataframe up into Numpy arrays: we need one array containing the independent variable(s) and another containing the dependent variable. To do this, we must take a look at our dataset.

We can see that there are three columns: *position*, *level*, and *salary*. Our task with this data is to predict an employee’s salary given their position.

If we pay close attention to the first two columns, we’ll see that there is a direct correlation between *level* and *position*. In other words, every level value corresponds to a unique position value. Thus, we can omit the *position* column and just input *level* into our regression model. The dependent variable is the *salary* since the values within this column are what our regressor needs to predict.

Now, we must split up our dataset into an independent variable Numpy array called *x* and a dependent variable Numpy array called *y*.

x = dataset.iloc[:, 1:2].valuesy = dataset.iloc[:, 2].values

**Step 2: Data Preprocessing**

As with any other machine learning model, a polynomial regressor requires input data to be preprocessed, or “cleaned”.

As always, we must now split these two arrays into training and testing data subsets so that we can accurately test our regression model after training it. Since our dataset is quite small, we will use a *test_size* of 0.2 when creating our subsets.

from sklearn.model_selection import train_test_splitx_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.2, random_state=0)

Now, we must apply feature scaling on our input and output datasets in order to optimize the training of our polynomial regressor. Feature scaling will center our data closer to 0, which will accelerate the converge of the gradient descent algorithm.

To scale our data, we can use Scikit-Learn’s *StandardScaler* class; more specifically, we can use the *.fit_transform *and *.fit* methods on our training and test datasets.

First, we will apply standard scaling on our input training and test sets as shown below. We first create an instance of the *StandardScaler* class called *sc_x*, and then we apply the necessary methods to transform our data.

from sklearn.preprocessing import StandardScalersc_x = StandardScaler()sc_x.fit_transform(x_train)sc_x.transform(x_test)

Now, all we have to do is implement the same steps for our dependent variable datasets. We call this instance of the *StandardScaler* class *sc_y*.

sc_y = StandardScaler()sc_y.fit_transform(y_train)sc_y.transform(y_test)

Finally, we must polynomially transform our dataset by using the *PolynomialFeatures* class provided by Scikit-Learn. Since we don’t know the optimal degree to transform our dataset to, we can just choose 3 as a good arbitrary value.

To implement this, we must first instantiate the *PolynomialFeatures* class and then use the *.fit_transform* and *.transform* methods to transform the input datasets. This is shown below.

from sklearn.preprocessing import PolynomialFeaturespoly_transform = PolynomialFeatures(degree=3, include_bias=False)x_poly_train = poly_transform.fit_transform(x_train)x_poly_test = poly_transform.transform(x_test)

The *include_bias* parameter determines whether *PolynomialFeatures* will add a column of 1’s to the front of the dataset to represent the *y-intercept* parameter value for our regression equation.

Since the *LinearRegression* class we will use to create a polynomial model will add this column of 1’s for us, we set *include_bias* to *False* to avoid duplicate columns.

Now that we’ve finished data preprocessing, we can finally move on to the training and testing of our actual polynomial regression model.

**Step 3: Creating and Training the Regressor**

Since we have already polynomially transformed our dataset, we can just apply the *LinearRegression* class from Scikit-Learn to create a polynomial model. To do this, we must first instantiate the class and then apply the *.fit()* method passing in our training data as arguments.

from sklearn.linear_model import LinearRegressionregressor = LinearRegression()regressor.fit(x_poly_train, y_train)

We have successfully trained our polynomial regression model! We can now use this model to make predictions based on input values of our choosing. First, however, let’s visualize the model by graphing its predictions on both the training and test datasets on a scatterplot of the actual data points.

**Step 4: Visualizing the Regressor’s Curve**

To visualize our regressor’s curve, we can use the Matplotlib library we imported at the beginning of this article. First, however, we must create a dataset with smaller independent variable increments for the sole purpose of graphing a smooth curve. If we take a look at our current dataset below:

We see that the independent variables that we are using, contained in the *Level* column, have increments of 1 between them. Unfortunately, if we use these independent variables to predict with our model, we won’t be able to create a smooth curve. So, we must create two datasets—one for the training data and one for the test data—that contain independent variable values with a smaller increment.

To do this, we can use the *arange()* function from the Numpy library as shown below. We are essentially creating new datasets that contain values between the minimum and maximum of the original datasets with an increment of 0.01. We add 0.01 to the maximum values in *x_train* and *x_test* because the *arange()* function does not include the maximum bound itself.

x_grid_train = np.arange(min(x_train), max(x_train) + .01, step=0.01)x_grid_test = np.arange(min(x_test), max(x_test) + .01, step=0.01)

In order for *x_grid_train* and *x_grid_test* to serve as proper inputs for our regressor, we must reshape them to proper two-dimensional arrays as shown below.

x_grid_train = x_grid_train.reshape(len(x_grid_train), 1)x_grid_test = x_grid_train.reshape(len(x_grid_test), 1)

Great! We can finally begin to visualize our model by using Matplotlib. We’ll break down this process by walking through the graphing of our training data. We must first create a scatterplot containing the x and y-values of our training dataset. We’ll make the data points red and give them a label so that we can create a key for our graph.

plt.scatter(x, y, color='red', label='Training Data Points')

Now we must graph the curve that represents our model’s predictions of the training dataset. The x-values of the curve will be those in *x_grid_train *and the y-values will be the model’s output given a polynomially transformed version of* x_grid_train*. We’ll make the curve blue and give it a different label for the key.

plt.plot(x_grid_train, regressor.predict(poly_transform.transform(x_grid_train)), color='blue', label='Model Curve')

Now, the main aspects of our graph are complete: we just need to add labels for our graph and create a legend.

plt.title('Job Level vs Salary (Training Dataset)')plt.xlabel('Job Level (1 - 10)')plt.ylabel('Salary')plt.legend()plt.show()

Before we take a look at the visualization, let’s create another graph for the test data. Fortunately, the steps are exactly the same as those for creating the training data graph.

plt.scatter(x, y, color='red', label='Test Data Points')plt.plot(x_grid_test, regressor.predict(poly_transform.transform(x_grid_test)), color='blue', label='Model Curve')plt.title('Job Level vs Salary (Test Dataset)')plt.xlabel('Job Level (1 - 10)')plt.ylabel('Salary')plt.legend()plt.show()

Now we can view our graphs!

As we can see, our model’s curve matches up quite closely with the points in both the training and test datasets. This means that our choice to polynomially transform our dataset to the *third* degree was a good one.

**Step 5: Making a Single Prediction**

There’s no point in creating a machine learning model if you don’t use it to predict values that aren’t in the training or test sets. After all, the main purpose of machine learning algorithms is to be beneficial in real-world applications.

To make a prediction with our regressor, we can call the same *.predict()* method as we did when visualizing the model’s curve. However, instead of inputting a Numpy array like last time, all we need to do is input a double nested list containing the values of our independent variables in the same order as that in the training dataset. The reason we input a double nested list is because Scikit-Learn regressors expect a two-dimensional data structure as input.

When training the regressor, we provided just the position *Values* column as input. In addition, we polynomially transformed the input by using *PolynomialFeatures*. Thus, we just input a polynomially transformed double nested list into the *.predict()* function.

prediction = regressor.predict(poly_transform.transform([[11]]))print(prediction)

By inputting 11 as shown above, we are using our polynomial regressor to predict the salary level of an employee with a level 11 experience. If we run the above code, we get a prediction value of $1,520,293. This seems reasonable as a level 10 employee had a salary of 1,000,000 in our training dataset.

### Conclusion

In this article, we learned how polynomial regressors work and how they can be implemented through the use of Python libraries such as Scikit-Learn.

*I hope that you enjoyed this article; feel free to leave any comments in the article so that I can provide even better content in the future. Stay tuned for my upcoming articles on decision tree regression*.

Anirudh Pai