Simple linear regression predicts one target value from one input feature using a straight line. This tutorial uses pandas to load a CSV, scikit-learn to train LinearRegression, matplotlib to plot the fit, and pandas again to save predictions. For ML context, see introduction to Python for machine learning and supervised learning algorithms.
By the end you will: load data → split train/test → fit a model → predict → evaluate with MAE, MSE, RMSE, and R² → plot the regression line → save predictions to CSV. The sample file hours_scores.csv in this article’s folder maps Hours (study time) to Score.
Tested on: Python 3.13.3; scikit-learn 1.9.0; pandas 3.0.3; kernel 6.14.0-37-generic.
Simple linear regression quick reference
| Step | Python tool |
|---|---|
| Load CSV data | pandas.read_csv() |
| Select input and target columns | DataFrame column selection |
| Split train and test data | train_test_split() |
| Create model | LinearRegression() |
| Train model | model.fit() |
| Predict values | model.predict() |
| Check slope and intercept | model.coef_, model.intercept_ |
| Evaluate model | MAE, MSE, RMSE, R² |
| Plot regression line | matplotlib |
| Save predictions | DataFrame.to_csv() |
What is simple linear regression?
Simple linear regression models the relationship between one independent variable (input) and one dependent variable (target) with a straight line. Use it to predict numeric outcomes or to see how two numeric variables move together.
One formula—use it consistently:
y = mx + b
- y — predicted value
- x — input feature
- m — slope (how much y changes when x increases by 1)
- b — intercept (predicted y when x is 0)
Both variables should be numeric. This is not a full statistics course—just the line you fit in code.
Simple vs multiple linear regression
| Type | Meaning |
|---|---|
| Simple linear regression | One input feature predicts one target |
| Multiple linear regression | Two or more input features predict one target |
This article focuses on simple linear regression. Multiple regression uses the same sklearn workflow with more columns in X.
Install required Python libraries
pip install pandas matplotlib scikit-learnThe PyPI package name is scikit-learn. The import name in Python is sklearn. Do not pip install sklearn—that deprecated package is not what you want.
Prepare the CSV dataset
Use a CSV with two numeric columns—for example Hours and Score. For general CSV read/write patterns in Python, see read and write CSV:
Hours,Score
1,76
2,78
4,88
...Hours is the input feature; Score is the target. For this walkthrough, drop missing values and keep both columns numeric. Save the file as hours_scores.csv (a copy ships with this article).
More representative data usually improves reliability; a tiny demo set can still show the workflow even when metrics are modest.
Load the dataset with pandas
import pandas as pd
dataset = pd.read_csv("hours_scores.csv")
print(dataset.head())
print(dataset.columns)
X = dataset[["Hours"]].values
y = dataset["Score"].valuesX must be two-dimensional for scikit-learn—even with one feature, select dataset[["Hours"]] (double brackets), not dataset["Hours"] alone.
Split data into training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=0
)Training data fits the model; test data checks prediction on rows the model did not see during training. A common split is 70/30 or 80/20. Set random_state for reproducible splits.
Train simple linear regression model with sklearn
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)LinearRegression fits an ordinary least squares line—it minimizes the sum of squared differences between actual and predicted training values.
Check slope and intercept
print("Slope (m):", model.coef_[0])
print("Intercept (b):", model.intercept_)With random_state=0 on the sample CSV, you should see a slope near 5.1 and intercept near 67.3—meaning predicted score rises about 5 points per extra study hour in this fit. A positive slope means higher hours tend to link to higher scores in the example.
These map to y = mx + b in your fitted line.
Make predictions using the model
Test set predictions:
y_pred = model.predict(X_test)One new value (student who studied 5 hours):
predicted_score = model.predict([[5]])
print(predicted_score[0])You should see about 92.8 on the sample data—an estimate, not a guaranteed exact score.
Save predictions to a CSV file
import pandas as pd
results = pd.DataFrame({
"Hours": X_test.ravel(),
"Actual_Score": y_test,
"Predicted_Score": y_pred,
})
results.to_csv("predictions.csv", index=False)index=False avoids an extra row-number column. This answers the common “save predictions to CSV” step many tutorials skip.
Evaluate the linear regression model
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print("MAE:", mae)
print("MSE:", mse)
print("RMSE:", rmse)
print("R²:", r2)On the sample split (random_state=0), expect roughly MAE 5.1, RMSE 6.5, and R² 0.50—moderate fit on a small dataset.
| Metric | Meaning |
|---|---|
| MAE | Average absolute prediction error |
| MSE | Average squared prediction error |
| RMSE | Error in the same unit as the target |
| R² score | Share of target variance the model explains |
Lower MAE and RMSE are better. R² closer to 1 usually means a stronger fit, but use it with MAE/RMSE—not alone.
Plot actual data and regression line
train_test_split shuffles rows, so X_test may be out of order—plotting X_test against predictions can zigzag the line. Sort by x first. For scatter and line plots beyond this regression example, see Python matplotlib:
import matplotlib.pyplot as plt
import numpy as np
order = np.argsort(X_test.ravel())
x_line = X_test.ravel()[order]
y_line = y_pred[order]
plt.scatter(X_test, y_test, color="red", label="Actual")
plt.plot(x_line, y_line, color="blue", label="Predicted line")
plt.title("Hours vs Score")
plt.xlabel("Hours")
plt.ylabel("Score")
plt.legend()
plt.savefig("regression_plot.png", bbox_inches="tight")
plt.show()Red points are actual test scores; the blue line is the model’s predictions across sorted hours.
Complete example script
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
dataset = pd.read_csv("hours_scores.csv")
X = dataset[["Hours"]].values
y = dataset["Score"].values
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=0
)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Slope:", model.coef_[0], "Intercept:", model.intercept_)
print("Predict 5 hours:", model.predict([[5]])[0])
print("MAE:", mean_absolute_error(y_test, y_pred))
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
print("R²:", r2_score(y_test, y_pred))
pd.DataFrame({
"Hours": X_test.ravel(),
"Actual_Score": y_test,
"Predicted_Score": y_pred,
}).to_csv("predictions.csv", index=False)
order = np.argsort(X_test.ravel())
plt.scatter(X_test, y_test, color="red")
plt.plot(X_test.ravel()[order], y_pred[order], color="blue")
plt.xlabel("Hours")
plt.ylabel("Score")
plt.title("Simple linear regression")
plt.savefig("regression_plot.png", bbox_inches="tight")Run from the folder that contains hours_scores.csv.
Common mistakes to avoid
pip install sklearninstead ofpip install scikit-learn- Passing 1D
Xinstead of shape(n_samples, 1) - Evaluating only on training data and calling it “accuracy”
- Swapping X (features) and y (target)
- Treating predictions as exact truth
- Using linear regression when the relationship is clearly curved
- Plotting the regression line with unsorted test
Xvalues - Skipping metrics (MAE, RMSE, R²) after plotting
Summary
Simple linear regression uses one input feature to predict one numeric target along a line y = mx + b. In Python, load CSV data with pandas, split with train_test_split, train LinearRegression().fit(), predict with predict(), evaluate with MAE, RMSE, and R², plot with matplotlib (sort x for a clean line), and save outputs with to_csv() when needed.

