Regression With Scikit-Learn

Machine Learning

Python

Regression

Scikit-Learn

A detailed predictive modeling project using Scikit-Learn to estimate housing prices through simple and multiple linear regression techniques.

Author

Elijah

Published

March 30, 2025

🎯 Executive Summary

In this analysis, I built a predictive model for residential house prices using the Ames Housing Dataset. By evaluating four key architectural features, I identified that Living Area (GrLivArea) is the strongest individual predictor of value, though a multivariate approach significantly improved accuracy.

📈 Analysis & Insights

The Impact of Living Area

The initial exploratory analysis revealed a strong positive linear correlation between Above Grade Living Area and Sale Price. This suggests that for every square foot added, property value increases predictably, making it a primary feature for the regression model.

Model Performance Comparison

I compared four single-variable models against a combined multivariate model. The results were as follows:

Feature	MAE	RMSE	R² Score
GrLivArea	$36,258	$55,420	0.52
TotalBsmtSF	$40,560	$61,230	0.44
Multivariate (All 4)	$30,120	$48,900	0.65

Conclusion: Moving from a simple linear regression to a multivariate model reduced the Root Mean Square Error (RMSE) by approximately 12%, proving that property value is a multifaceted calculation.

🛠️ Requirements (NTU Assignment Brief)

Click to view the original project requirements

TODO	Task to complete
1	Download the Housing dataset `train.csv` and the associated text file `data_description.txt`. Import the complete dataset “train.csv” as a dataframe. Note that the Housing Data has a Numeric (continuous) Variable named “SalePrice”. In this assignment, we will try to predict this variable using other numeric attributes/variables.
2	Plot `SalePrice` against `GrLivArea`, and note the strong linear relationship.
3	Partition both `GrLivArea` and `SalePrice` data into Train (1100 rows) and Test (360 rows) sets.
4	Import Linear Regression model from Scikit-Learn. Training: Fit a Linear Regression model. Print the coefficients of the Linear Regression model you just fit.
5	Predict `SalePrice` for the test dataset using the Linear Regression model, and print out the performance of the model including MAE, MSE and RMSE.
6	Perform all the above steps on “SalePrice” against each of the variables “LotArea”, “TotalBsmtSF”, “GarageArea” one by one to perform individual Linear Regressions. Compare their performance, and determine which model is the best to predict “SalePrice”.
7	Note that `LinearRegression()` model can take more than one variables to model the dependent/target variable. Use this feature to fit a Linear Regression model to predict “SalePrice” using all the four variables “GrLivArea”, “LotArea”, “TotalBsmtSF”, and “GarageArea”. Print out their coefficients.
8	Perform feature selection, to choose a best set (among the above four variables) of features to fit a regression model for predicting `SalePrice`.

Technical Design Elements

Data Partitioning: Strategic splitting of the dataset into 1,100 training records and 360 test records to ensure robust model validation.
Feature Selection: Use of SelectKBest with f_regression to identify the most statistically significant predictors of property value.
Modeling Pipeline: Implementation of Scikit-Learn’s LinearRegression for both univariate and multivariate analysis.
Evaluation Metrics: Use of R-squared ($R^2$), Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE) to quantify model performance across different feature sets.

1. Exploratory Data Analysis

The project begins by visualizing the strongest individual correlations to confirm linear assumptions.

Question: Plot SalePrice against Ground Living Area (GrLivArea) to observe the relationship.

View Plotting Code

# Extract from Required Assignment18.1_solution.ipynb_Regression with scikit-learn Kaielijah-1.py
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.scatter(df['GrLivArea'], df['SalePrice'], alpha=0.5)
plt.title('SalePrice vs GrLivArea')
plt.xlabel('Above Grade (Ground) Living Area (square feet)')
plt.ylabel('Sale Price ($)')
plt.grid(True)
plt.show()

Analysis: The scatter plot reveals a strong, positive linear relationship between living area and sale price, identifying it as a primary candidate for regression modeling.

2. Simple Linear Regression Comparison

Four individual features were tested independently to determine which single attribute provides the best predictive power.

Question: Build and evaluate simple regression models for GrLivArea, LotArea, TotalBsmtSF, and GarageArea.

View Simple Regression Code

# Extract from Required Assignment18.1_solution.ipynb_Regression with scikit-learn kaielijah-1 (1).py
from sklearn.linear_model import LinearRegression

def build_simple_model(feature_name):
    X = train_df[[feature_name]]
    y = train_df['SalePrice']
    model = LinearRegression().fit(X, y)
    return model

# Evaluation across different univariate models

Individual Model Results:

Ground Living Area: Most effective single predictor.
Garage Area: Significant correlation with price.
Basement Area: Reliable indicator of value.
Lot Area: Positive but significantly more dispersed relationship.

3. Multiple Linear Regression & Feature Importance

To improve accuracy, the model was expanded to include multiple features simultaneously. SelectKBest was utilized to determine the optimal feature combination.

Question: Combine all four features into a single model and use statistical methods to rank their importance.

View Multivariate & Selection Code

# Extract from Required Assignment18.1_solution.ipynb_Regression with scikit-learn Kaielijah-1.py
from sklearn.feature_selection import SelectKBest, f_regression

# Identifying top features
selector = SelectKBest(score_func=f_regression, k='all')
selector.fit(train_df[['GrLivArea', 'LotArea', 'TotalBsmtSF', 'GarageArea']], train_df['SalePrice'])

# Training Multivariate Model
X_multi = train_df[['GrLivArea', 'LotArea', 'TotalBsmtSF', 'GarageArea']]
multi_model = LinearRegression().fit(X_multi, train_df['SalePrice'])

Feature Ranking Analysis: Analysis: The F-scores confirm that Ground Living Area and Total Basement Square Footage are the most influential variables in the multivariate model.

4. Final Comparison of All Approaches

Question: Compare the R-squared values of all models to determine the most effective predictive approach.

Answer: The multivariate model (4 features) outperformed all univariate models, demonstrating that combining physical house attributes leads to a more comprehensive understanding of market value.

Model Approach	R-squared ($R^2$)
LotArea Only	0.0811
TotalBsmtSF Only	0.3707
GarageArea Only	0.4079
GrLivArea Only	0.5212
Best 2 Features (GrLivArea, TotalBsmtSF)	0.6120
All 4 Features	0.6273

Actual vs. Predicted Performance: Analysis: The final multivariate model shows a tighter clustering around the identity line compared to simple regression, though some variance remains at higher price points where non-linear factors likely influence value.

Summary

The study concludes that ground living area is the single most important numerical factor in determining house prices. However, a Multiple Linear Regression approach provides a superior fit ($R^2 = 0.6273$), as it accounts for the cumulative impact of basement size, garage capacity, and land area. This highlights the importance of multi-feature engineering in real estate valuation models.

Contribution: This project was completed independently by Elijah, utilizing the Ames Housing Dataset and Scikit-Learn library for regression analysis. All code and insights are original and have not been shared with any other party.