Your First Machine Learning Model: Linear Regression from Scratch in Python
Machine learning often seems like a black box filled with complex libraries and abstract concepts. But at its core, every sophisticated model builds upon fundamental principles that you can understand and implement yourself. In this tutorial, we'll implement linear regression from scratch in pure Python—no scikit-learn, no TensorFlow, just NumPy for basic array operations and mathematics.
Why Start with Linear Regression?
Linear regression is the "hello world" of machine learning for good reasons:
- Conceptual Simplicity: It demonstrates the core supervised learning paradigm—learning from labeled data to make predictions
- Mathematical Transparency: Every step can be explained with basic calculus and linear algebra
- Historical Significance: It's been used for centuries (as least squares) before computers existed
- Foundation for Complexity: Neural networks are essentially stacked, non-linear versions of regression
The Problem We're Solving
Imagine you're a real estate analyst trying to predict house prices based on their size (square footage). You have historical data showing the relationship:
Size (sq ft) | Price ($)
-------------|----------
1000 | 250,000
1500 | 375,000
2000 | 500,000
We want to find the line that best fits this data: price = w × size + b, where w is the weight (slope) and b is the bias (y-intercept).
Mathematical Foundations
Linear regression aims to minimize the Mean Squared Error (MSE):
1MSE = (1/n) × Σ(y_i - (w×x_i + b))²
2
Where:
n is the number of data points
x_i is the i-th input feature
.
y_i is the i-th target value
w and b are the parameters we're learning
We find optimal w and b using gradient descent—iteratively adjusting parameters in the direction that reduces error.
Implementation: Step by Step
First, let's set up our environment:
1import numpy as np
2import matplotlib.pyplot as plt
3
4# Set random seed for reproducibility
5np.random.seed(42)
6
1. Generate Synthetic Data
We'll create realistic-looking data with some noise:
1def generate_linear_data(n_samples=100, true_w=2.5, true_b=10.0, noise_scale=5.0):
2 """
3 Generate synthetic linear data with Gaussian noise.
4
5 Parameters:
6 - n_samples: Number of data points
7 - true_w: True slope (weight)
8 - true_b: True intercept (bias)
9 - noise_scale: Standard deviation of Gaussian noise
10
11 Returns:
12 - X: Input features (n_samples, 1)
13 - y: Target values with noise
14 """
15 X = np.random.randn(n_samples, 1) * 10 # Features centered around 0, scaled
16 y = true_w * X + true_b + np.random.randn(n_samples, 1) * noise_scale
17 return X, y
18
19# Generate our dataset
20X, y = generate_linear_data(n_samples=50, true_w=250, true_b=50000, noise_scale=30000)
21print(f"Dataset shape: X={X.shape}, y={y.shape}")
22print(f"First 5 samples:\nX: {X[:5].flatten()}\ny: {y[:5].flatten()}")
23
2. Initialize Parameters
We start with random parameters that we'll improve:
1def initialize_parameters():
2 """Initialize weight and bias with small random values."""
3 w = np.random.randn(1, 1) * 0.01 # Small random weight
4 b = np.random.randn(1, 1) * 0.01 # Small random bias
5 return w, b
6
7w, b = initialize_parameters()
8print(f"Initial parameters: w={w[0,0]:.4f}, b={b[0,0]:.4f}")
9
3. Define the Model and Loss Function
The forward pass computes predictions, and we calculate MSE:
1def forward_pass(X, w, b):
2 """Compute predictions: y_pred = X * w + b"""
3 return X.dot(w) + b
4
5def compute_loss(y_pred, y_true):
6 """Calculate Mean Squared Error."""
7 n = len(y_true)
8 errors = y_pred - y_true
9 mse = np.sum(errors ** 2) / n
10 return mse
11
12# Test our initial predictions
13y_pred_initial = forward_pass(X, w, b)
14initial_loss = compute_loss(y_pred_initial, y)
15print(f"Initial loss (MSE): {initial_loss:.2f}")
16
4. Implement Gradient Descent
This is where the learning happens. We compute gradients and update parameters:
1def compute_gradients(X, y_pred, y_true):
2 """
3 Compute gradients of MSE with respect to w and b.
4
5 ∂MSE/∂w = (2/n) * Σ (X_i * (y_pred_i - y_true_i))
6 ∂MSE/∂b = (2/n) * Σ (y_pred_i - y_true_i)
7 """
8 n = len(y_true)
9 error = y_pred - y_true
10
11 # Gradient for weight
12 dw = (2/n) * X.T.dot(error)
13
14 # Gradient for bias
15 db = (2/n) * np.sum(error)
16
17 return dw, db
18
19def update_parameters(w, b, dw, db, learning_rate=0.01):
20 """Update parameters using gradient descent."""
21 w = w - learning_rate * dw
22 b = b - learning_rate * db
23 return w, b
24
5. Training Loop
Now we put everything together:
1def train_linear_regression(X, y, learning_rate=0.01, epochs=1000):
2 """Complete training loop for linear regression."""
3 # Initialize parameters
4 w, b = initialize_parameters()
5
6 # Store loss history for plotting
7 losses = []
8
9 for epoch in range(epochs):
10 # Forward pass
11 y_pred = forward_pass(X, w, b)
12
13 # Compute loss
14 loss = compute_loss(y_pred, y)
15 losses.append(loss)
16
17 # Compute gradients
18 dw, db = compute_gradients(X, y_pred, y)
19
20 # Update parameters
21 w, b = update_parameters(w, b, dw, db, learning_rate)
22
23 # Print progress every 100 epochs
24 if epoch % 100 == 0:
25 print(f"Epoch {epoch}: Loss = {loss:.2f}, w = {w[0,0]:.4f}, b = {b[0,0]:.4f}")
26
27 return w, b, losses
28
29# Train our model
30w_trained, b_trained, losses = train_linear_regression(X, y, learning_rate=0.001, epochs=2000)
31print(f"\nTrained parameters: w = {w_trained[0,0]:.4f}, b = {b_trained[0,0]:.4f}")
32print(f"Final loss: {losses[-1]:.2f}")
33
6. Visualization and Evaluation
Let's see how well our model learned:
1def plot_results(X, y, w, b, losses):
2 """Visualize data, fitted line, and training progress."""
3 fig, axes = plt.subplots(1, 2, figsize=(12, 4))
4
5 # Left: Data and regression line
6 axes[0].scatter(X, y, alpha=0.5, label='Data')
7
8 # Generate line from learned parameters
9 X_line = np.array([[X.min()], [X.max()]])
10 y_line = forward_pass(X_line, w, b)
11 axes[0].plot(X_line, y_line, 'r-', linewidth=2, label=f'Learned: y = {w[0,0]:.1f}x + {b[0,0]:.1f}')
12
13 axes[0].set_xlabel('Feature (X)')
14 axes[0].set_ylabel('Target (y)')
15 axes[0].set_title('Linear Regression Fit')
16 axes[0].legend()
17 axes[0].grid(True, alpha=0.3)
18
19 # Right: Loss curve
20 axes[1].plot(losses)
21 axes[1].set_xlabel('Epoch')
22 axes[1].set_ylabel('Mean Squared Error')
23 axes[1].set_title('Training Loss Over Time')
24 axes[1].grid(True, alpha=0.3)
25 axes[1].set_yscale('log') # Log scale for better visibility
26
27 plt.tight_layout()
28 plt.show()
29
30plot_results(X, y, w_trained, b_trained, losses)
31
Key Concepts Demonstrated
Through this implementation, you've learned:
- Loss Function: How we quantify model error (MSE)
- Gradient Descent: How we iteratively improve parameters
- Learning Rate: The step size for parameter updates (too large diverges, too small converges slowly)
- Forward Pass: Computing predictions from inputs and parameters
- Backward Pass: Computing gradients to guide updates
Common Pitfalls and Solutions
1. Learning Rate Too High
1# Symptoms: Loss diverges (gets larger, not smaller)
2# Solution: Reduce learning rate from 0.01 to 0.001 or lower
3
2. Feature Scaling Matters
1# If features have very different scales, gradient descent struggles
2# Solution: Normalize features before training
3X_normalized = (X - X.mean()) / X.std()
4
3. Local Minima vs Global Minima
For convex problems like linear regression, gradient descent always finds the global minimum. For non-convex problems (like neural networks), this isn't guaranteed.
From Here to Real Machine Learning
This simple linear regression contains the DNA of all supervised learning:
- Neural Networks: Stack multiple layers of such computations with non-linear activations
- Regularization: Add penalty terms to loss to prevent overfitting
- Optimization: More sophisticated algorithms (Adam, RMSProp) build on gradient descent
- Vectorization: Real implementations use efficient matrix operations
Practical Exercise
Try modifying the code to:
- Add L2 regularization (ridge regression)
- Implement stochastic gradient descent (update after each sample)
- Handle multiple features (multivariate linear regression)
- Compare with scikit-learn's implementation
Conclusion
Building linear regression from scratch demystifies machine learning. You now understand what happens inside library functions like LinearRegression.fit()—it's gradient descent minimizing MSE. This foundation prepares you for more complex models where the principles remain the same, but the implementations scale.
Remember: All deep learning is just regression with more layers, non-linearities, and clever optimization. Start simple, understand thoroughly, then build complexity.
Next Steps: Experiment with the code, break it, fix it, and extend it. The best way to learn is by doing.