Credit Card Fraud Detection

Python
Data Analysis
Machine Learning
Finance
A comprehensive Python analysis implementing Random Forest and class-balancing techniques to secure financial transactions.
Author

Elijah

Published

April 1, 2025

Project Summary

This project addresses the critical challenge of detecting fraudulent credit card transactions within a highly imbalanced dataset (0.17% fraud rate). Utilizing a dataset of European cardholder transactions, the analysis focuses on identifying patterns in PCA-transformed features while handling the extreme rarity of fraud cases. The solution implements a robust Random Forest classification pipeline, featuring custom feature engineering and cost-sensitive learning to minimize financial losses from undetected fraud.

Technical Design Elements

The project architecture is built on four technical pillars:

  • Data Integrity: Automated auditing for duplicates and memory optimization.

  • Advanced Feature Engineering: Addressing feature skewness via Log transformations and creating temporal features (Time-of-day bins) to capture cyclical fraud patterns.

  • Resampling Strategy: Evaluation of SMOTE, Random Over-sampling, and Hybrid methods to provide the model with sufficient minority class signal.

  • Cost-Sensitive Optimization: Shifting from standard accuracy metrics to a cost-based evaluation where False Negatives (missed fraud) are weighted significantly higher than False Positives.

1. Dataset Inspection

The initial audit identified the dataset structure and a small percentage of duplicate entries that were addressed to ensure model validity.

# Extract from MH6804_Required Group_Project_Report Team 4.py
def inspect_dataset(file_path):
    df = pd.read_csv(file_path)
    print("DATASET OVERVIEW")
    print(f"Dataset Dimensions: {df.shape[0]} rows, {df.shape[1]} columns")
    print(f"Memory Usage: {df.memory_usage().sum() / 1024 ** 2:.2f} MB")
    print(f"Duplicate Rows: {df.duplicated().sum()}")
    return df

df = inspect_dataset('creditcard.csv')
Metric Value
Total Transactions 284,807
Fraudulent Cases 492 (0.17%)
Duplicate Rows 1,081
Data Size 67.36 MB

2. Feature Engineering & Exploration

Since most features are PCA-anonymized, the engineering focuses on the ‘Amount’ and ‘Time’ variables. Correlation analysis reveals which components carry the strongest signals for fraud.

# Handling Skewed Amount
df['log_amount'] = np.log1p(df['Amount'])

# Temporal Engineering
df['hour'] = (df['Time'] / 3600) % 24
df['part_of_day'] = pd.cut(df['hour'], bins=[0,6,12,18,24], labels=['Night', 'Morning', 'Afternoon', 'Evening'])

# Analyzing Correlations
corr_matrix = df.corr()

Predictive Power by Feature Feature Importance

Cost Curve Optimization Cost Curve

Final Test Set Performance Confusion Matrix

3. Handling Class Imbalance

To prevent the model from defaulting to “Non-Fraud” for every case, SMOTE (Synthetic Minority Over-sampling Technique) was implemented to balance the training distribution.

from imblearn.over_sampling import SMOTE

# Resampling to create a balanced training environment
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

Effect of Sampling on Data Distribution Sampling Comparison Analysis: This comparison highlights how SMOTE populates the minority class feature space, allowing the classifier to learn the decision boundary for fraud rather than being overwhelmed by the majority class.

4. Final Model Evaluation

The Random Forest model was selected as the best performer. Beyond standard metrics, the project uses a feature importance analysis to maintain model transparency and a cost curve to find the most economical operating threshold.

# Feature Importance Extraction
importances = rf_model.feature_importances_

# Cost Optimization: $100 penalty for FN, $5 penalty for FP
costs = []
for t in thresholds:
    tn, fp, fn, tp = confusion_matrix(y_test, (y_proba >= t)).ravel()
    costs.append((fn * 100) + (fp * 5))

min_cost_threshold = thresholds[np.argmin(costs)]

Predictive Power by Feature Feature Importance Analysis: The importance plot confirms that V17 and V14 are the most critical variables for the Random Forest model, validating the insights previously seen in the correlation heatmap.

Cost Curve Optimization Cost Curve Analysis: By plotting the financial impact across various probability thresholds, we identify the “Sweet Spot” that balances the high cost of missing a fraud case against the operational cost of investigating false alarms.

Final Test Set Performance Confusion Matrix Analysis: The matrix demonstrates the final model’s high precision (0.89) and recall (0.82) on the unseen test set, successfully capturing the majority of fraudulent transactions with minimal false positives. ***

Contribution: This project is jointly contributed by the following members: Daniel Lim, Mark Joseph Fabre, Jes Bee Lian