Simple Linear Regression in Python

Learn simple linear regression in Python using pandas, scikit-learn, and matplotlib. Train a model, make predictions, evaluate accuracy, plot the regression line, and save predictions to a CSV file.

Published

Updated

Read time 6 min read

Reviewed byDeepak Prasad

Simple Linear Regression in Python

Simple linear regression predicts one target value from one input feature using a straight line. This tutorial uses pandas to load a CSV, scikit-learn to train LinearRegression, matplotlib to plot the fit, and pandas again to save predictions. For ML context, see introduction to Python for machine learning and supervised learning algorithms.

By the end you will: load data → split train/test → fit a model → predict → evaluate with MAE, MSE, RMSE, and R² → plot the regression line → save predictions to CSV. The sample file hours_scores.csv in this article’s folder maps Hours (study time) to Score.

Tested on: Python 3.13.3; scikit-learn 1.9.0; pandas 3.0.3; kernel 6.14.0-37-generic.


Simple linear regression quick reference

Step Python tool
Load CSV data pandas.read_csv()
Select input and target columns DataFrame column selection
Split train and test data train_test_split()
Create model LinearRegression()
Train model model.fit()
Predict values model.predict()
Check slope and intercept model.coef_, model.intercept_
Evaluate model MAE, MSE, RMSE, R²
Plot regression line matplotlib
Save predictions DataFrame.to_csv()

What is simple linear regression?

Simple linear regression models the relationship between one independent variable (input) and one dependent variable (target) with a straight line. Use it to predict numeric outcomes or to see how two numeric variables move together.

One formula—use it consistently:

y = mx + b

  • y — predicted value
  • x — input feature
  • m — slope (how much y changes when x increases by 1)
  • b — intercept (predicted y when x is 0)

Both variables should be numeric. This is not a full statistics course—just the line you fit in code.


Simple vs multiple linear regression

Type Meaning
Simple linear regression One input feature predicts one target
Multiple linear regression Two or more input features predict one target

This article focuses on simple linear regression. Multiple regression uses the same sklearn workflow with more columns in X.


Install required Python libraries

bash
pip install pandas matplotlib scikit-learn

The PyPI package name is scikit-learn. The import name in Python is sklearn. Do not pip install sklearn—that deprecated package is not what you want.


Prepare the CSV dataset

Use a CSV with two numeric columns—for example Hours and Score. For general CSV read/write patterns in Python, see read and write CSV:

csv
Hours,Score
1,76
2,78
4,88
...

Hours is the input feature; Score is the target. For this walkthrough, drop missing values and keep both columns numeric. Save the file as hours_scores.csv (a copy ships with this article).

More representative data usually improves reliability; a tiny demo set can still show the workflow even when metrics are modest.


Load the dataset with pandas

python
import pandas as pd

dataset = pd.read_csv("hours_scores.csv")
print(dataset.head())
print(dataset.columns)

X = dataset[["Hours"]].values
y = dataset["Score"].values

X must be two-dimensional for scikit-learn—even with one feature, select dataset[["Hours"]] (double brackets), not dataset["Hours"] alone.


Split data into training and test sets

python
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0
)

Training data fits the model; test data checks prediction on rows the model did not see during training. A common split is 70/30 or 80/20. Set random_state for reproducible splits.


Train simple linear regression model with sklearn

python
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

LinearRegression fits an ordinary least squares line—it minimizes the sum of squared differences between actual and predicted training values.


Check slope and intercept

python
print("Slope (m):", model.coef_[0])
print("Intercept (b):", model.intercept_)

With random_state=0 on the sample CSV, you should see a slope near 5.1 and intercept near 67.3—meaning predicted score rises about 5 points per extra study hour in this fit. A positive slope means higher hours tend to link to higher scores in the example.

These map to y = mx + b in your fitted line.


Make predictions using the model

Test set predictions:

python
y_pred = model.predict(X_test)

One new value (student who studied 5 hours):

python
predicted_score = model.predict([[5]])
print(predicted_score[0])

You should see about 92.8 on the sample data—an estimate, not a guaranteed exact score.


Save predictions to a CSV file

python
import pandas as pd

results = pd.DataFrame({
    "Hours": X_test.ravel(),
    "Actual_Score": y_test,
    "Predicted_Score": y_pred,
})
results.to_csv("predictions.csv", index=False)

index=False avoids an extra row-number column. This answers the common “save predictions to CSV” step many tutorials skip.


Evaluate the linear regression model

python
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print("MAE:", mae)
print("MSE:", mse)
print("RMSE:", rmse)
print("R²:", r2)

On the sample split (random_state=0), expect roughly MAE 5.1, RMSE 6.5, and R² 0.50—moderate fit on a small dataset.

Metric Meaning
MAE Average absolute prediction error
MSE Average squared prediction error
RMSE Error in the same unit as the target
R² score Share of target variance the model explains

Lower MAE and RMSE are better. closer to 1 usually means a stronger fit, but use it with MAE/RMSE—not alone.


Plot actual data and regression line

train_test_split shuffles rows, so X_test may be out of order—plotting X_test against predictions can zigzag the line. Sort by x first. For scatter and line plots beyond this regression example, see Python matplotlib:

python
import matplotlib.pyplot as plt
import numpy as np

order = np.argsort(X_test.ravel())
x_line = X_test.ravel()[order]
y_line = y_pred[order]

plt.scatter(X_test, y_test, color="red", label="Actual")
plt.plot(x_line, y_line, color="blue", label="Predicted line")
plt.title("Hours vs Score")
plt.xlabel("Hours")
plt.ylabel("Score")
plt.legend()
plt.savefig("regression_plot.png", bbox_inches="tight")
plt.show()

Red points are actual test scores; the blue line is the model’s predictions across sorted hours.


Complete example script

python
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

dataset = pd.read_csv("hours_scores.csv")
X = dataset[["Hours"]].values
y = dataset["Score"].values

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0
)

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("Slope:", model.coef_[0], "Intercept:", model.intercept_)
print("Predict 5 hours:", model.predict([[5]])[0])
print("MAE:", mean_absolute_error(y_test, y_pred))
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
print("R²:", r2_score(y_test, y_pred))

pd.DataFrame({
    "Hours": X_test.ravel(),
    "Actual_Score": y_test,
    "Predicted_Score": y_pred,
}).to_csv("predictions.csv", index=False)

order = np.argsort(X_test.ravel())
plt.scatter(X_test, y_test, color="red")
plt.plot(X_test.ravel()[order], y_pred[order], color="blue")
plt.xlabel("Hours")
plt.ylabel("Score")
plt.title("Simple linear regression")
plt.savefig("regression_plot.png", bbox_inches="tight")

Run from the folder that contains hours_scores.csv.


Common mistakes to avoid

  • pip install sklearn instead of pip install scikit-learn
  • Passing 1D X instead of shape (n_samples, 1)
  • Evaluating only on training data and calling it “accuracy”
  • Swapping X (features) and y (target)
  • Treating predictions as exact truth
  • Using linear regression when the relationship is clearly curved
  • Plotting the regression line with unsorted test X values
  • Skipping metrics (MAE, RMSE, R²) after plotting

Summary

Simple linear regression uses one input feature to predict one numeric target along a line y = mx + b. In Python, load CSV data with pandas, split with train_test_split, train LinearRegression().fit(), predict with predict(), evaluate with MAE, RMSE, and R², plot with matplotlib (sort x for a clean line), and save outputs with to_csv() when needed.


Frequently Asked Questions

1. What is simple linear regression in Python?

A model that predicts one numeric target from one numeric input feature by fitting a straight line—usually with sklearn.linear_model.LinearRegression after loading data with pandas.

2. Do I install sklearn or scikit-learn with pip?

Install scikit-learn with pip install scikit-learn; import it in code as sklearn.

3. Why must X be two-dimensional for LinearRegression?

scikit-learn expects X with shape (n_samples, n_features)—even one feature uses X = df[["Hours"]].values, not a 1D array.

4. Which metrics evaluate simple linear regression?

Common choices are MAE, MSE, RMSE, and R² from sklearn.metrics—lower MAE/RMSE is better; R² closer to 1 means more explained variance.

5. How do I save regression predictions to a CSV file?

Build a pandas DataFrame with actual and predicted columns and call to_csv("predictions.csv", index=False).
Bashir Alam

Data Analyst and Machine Learning Engineer

Computer Science graduate from the University of Central Asia, currently employed as a full-time Machine Learning Engineer at uExel. His expertise lies in OCR, text extraction, data preprocessing, and …