Your First Machine Learning Model: Linear Regression from Scratch in Python

Machine learning often seems like a black box filled with complex libraries and abstract concepts. But at its core, every sophisticated model builds upon fundamental principles that you can understand and implement yourself. In this tutorial, we'll implement linear regression from scratch in pure Python—no scikit-learn, no TensorFlow, just NumPy for basic array operations and mathematics.

Why Start with Linear Regression?

Linear regression is the "hello world" of machine learning for good reasons:

Conceptual Simplicity: It demonstrates the core supervised learning paradigm—learning from labeled data to make predictions
Mathematical Transparency: Every step can be explained with basic calculus and linear algebra
Historical Significance: It's been used for centuries (as least squares) before computers existed
Foundation for Complexity: Neural networks are essentially stacked, non-linear versions of regression

The Problem We're Solving

Imagine you're a real estate analyst trying to predict house prices based on their size (square footage). You have historical data showing the relationship:

Size (sq ft) | Price ($)
-------------|----------
1000         | 250,000
1500         | 375,000
2000         | 500,000

We want to find the line that best fits this data: price = w × size + b, where w is the weight (slope) and b is the bias (y-intercept).

Mathematical Foundations

Linear regression aims to minimize the Mean Squared Error (MSE):

math

1MSE = (1/n) × Σ(y_i - (w×x_i + b))²
2

Where:

n is the number of data points
x_i is the i-th input feature . y_i is the i-th target value
w and b are the parameters we're learning

We find optimal w and b using gradient descent—iteratively adjusting parameters in the direction that reduces error.

Implementation: Step by Step

First, let's set up our environment:

python

1import numpy as np
2import matplotlib.pyplot as plt
3
4# Set random seed for reproducibility
5np.random.seed(42)
6

1. Generate Synthetic Data

We'll create realistic-looking data with some noise:

python

1def generate_linear_data(n_samples=100, true_w=2.5, true_b=10.0, noise_scale=5.0):
2    """
3    Generate synthetic linear data with Gaussian noise.
4    
5    Parameters:
6    - n_samples: Number of data points
7    - true_w: True slope (weight)
8    - true_b: True intercept (bias)
9    - noise_scale: Standard deviation of Gaussian noise
10    
11    Returns:
12    - X: Input features (n_samples, 1)
13    - y: Target values with noise
14    """
15    X = np.random.randn(n_samples, 1) * 10  # Features centered around 0, scaled
16    y = true_w * X + true_b + np.random.randn(n_samples, 1) * noise_scale
17    return X, y
18
19# Generate our dataset
20X, y = generate_linear_data(n_samples=50, true_w=250, true_b=50000, noise_scale=30000)
21print(f"Dataset shape: X={X.shape}, y={y.shape}")
22print(f"First 5 samples:\nX: {X[:5].flatten()}\ny: {y[:5].flatten()}")
23

2. Initialize Parameters

We start with random parameters that we'll improve:

python

1def initialize_parameters():
2    """Initialize weight and bias with small random values."""
3    w = np.random.randn(1, 1) * 0.01  # Small random weight
4    b = np.random.randn(1, 1) * 0.01  # Small random bias
5    return w, b
6
7w, b = initialize_parameters()
8print(f"Initial parameters: w={w[0,0]:.4f}, b={b[0,0]:.4f}")
9

3. Define the Model and Loss Function

The forward pass computes predictions, and we calculate MSE:

python

1def forward_pass(X, w, b):
2    """Compute predictions: y_pred = X * w + b"""
3    return X.dot(w) + b
4
5def compute_loss(y_pred, y_true):
6    """Calculate Mean Squared Error."""
7    n = len(y_true)
8    errors = y_pred - y_true
9    mse = np.sum(errors ** 2) / n
10    return mse
11
12# Test our initial predictions
13y_pred_initial = forward_pass(X, w, b)
14initial_loss = compute_loss(y_pred_initial, y)
15print(f"Initial loss (MSE): {initial_loss:.2f}")
16

4. Implement Gradient Descent

This is where the learning happens. We compute gradients and update parameters:

python

1def compute_gradients(X, y_pred, y_true):
2    """
3    Compute gradients of MSE with respect to w and b.
4    
5    ∂MSE/∂w = (2/n) * Σ (X_i * (y_pred_i - y_true_i))
6    ∂MSE/∂b = (2/n) * Σ (y_pred_i - y_true_i)
7    """
8    n = len(y_true)
9    error = y_pred - y_true
10    
11    # Gradient for weight
12    dw = (2/n) * X.T.dot(error)
13    
14    # Gradient for bias  
15    db = (2/n) * np.sum(error)
16    
17    return dw, db
18
19def update_parameters(w, b, dw, db, learning_rate=0.01):
20    """Update parameters using gradient descent."""
21    w = w - learning_rate * dw
22    b = b - learning_rate * db
23    return w, b
24

5. Training Loop

Now we put everything together:

python

1def train_linear_regression(X, y, learning_rate=0.01, epochs=1000):
2    """Complete training loop for linear regression."""
3    # Initialize parameters
4    w, b = initialize_parameters()
5    
6    # Store loss history for plotting
7    losses = []
8    
9    for epoch in range(epochs):
10        # Forward pass
11        y_pred = forward_pass(X, w, b)
12        
13        # Compute loss
14        loss = compute_loss(y_pred, y)
15        losses.append(loss)
16        
17        # Compute gradients
18        dw, db = compute_gradients(X, y_pred, y)
19        
20        # Update parameters
21        w, b = update_parameters(w, b, dw, db, learning_rate)
22        
23        # Print progress every 100 epochs
24        if epoch % 100 == 0:
25            print(f"Epoch {epoch}: Loss = {loss:.2f}, w = {w[0,0]:.4f}, b = {b[0,0]:.4f}")
26    
27    return w, b, losses
28
29# Train our model
30w_trained, b_trained, losses = train_linear_regression(X, y, learning_rate=0.001, epochs=2000)
31print(f"\nTrained parameters: w = {w_trained[0,0]:.4f}, b = {b_trained[0,0]:.4f}")
32print(f"Final loss: {losses[-1]:.2f}")
33

6. Visualization and Evaluation

Let's see how well our model learned:

python

1def plot_results(X, y, w, b, losses):
2    """Visualize data, fitted line, and training progress."""
3    fig, axes = plt.subplots(1, 2, figsize=(12, 4))
4    
5    # Left: Data and regression line
6    axes[0].scatter(X, y, alpha=0.5, label='Data')
7    
8    # Generate line from learned parameters
9    X_line = np.array([[X.min()], [X.max()]])
10    y_line = forward_pass(X_line, w, b)
11    axes[0].plot(X_line, y_line, 'r-', linewidth=2, label=f'Learned: y = {w[0,0]:.1f}x + {b[0,0]:.1f}')
12    
13    axes[0].set_xlabel('Feature (X)')
14    axes[0].set_ylabel('Target (y)')
15    axes[0].set_title('Linear Regression Fit')
16    axes[0].legend()
17    axes[0].grid(True, alpha=0.3)
18    
19    # Right: Loss curve
20    axes[1].plot(losses)
21    axes[1].set_xlabel('Epoch')
22    axes[1].set_ylabel('Mean Squared Error')
23    axes[1].set_title('Training Loss Over Time')
24    axes[1].grid(True, alpha=0.3)
25    axes[1].set_yscale('log')  # Log scale for better visibility
26    
27    plt.tight_layout()
28    plt.show()
29
30plot_results(X, y, w_trained, b_trained, losses)
31

Key Concepts Demonstrated

Through this implementation, you've learned:

Loss Function: How we quantify model error (MSE)
Gradient Descent: How we iteratively improve parameters
Learning Rate: The step size for parameter updates (too large diverges, too small converges slowly)
Forward Pass: Computing predictions from inputs and parameters
Backward Pass: Computing gradients to guide updates

Common Pitfalls and Solutions

1. Learning Rate Too High

python

1# Symptoms: Loss diverges (gets larger, not smaller)
2# Solution: Reduce learning rate from 0.01 to 0.001 or lower
3

2. Feature Scaling Matters

python

1# If features have very different scales, gradient descent struggles
2# Solution: Normalize features before training
3X_normalized = (X - X.mean()) / X.std()
4

3. Local Minima vs Global Minima

For convex problems like linear regression, gradient descent always finds the global minimum. For non-convex problems (like neural networks), this isn't guaranteed.

From Here to Real Machine Learning

This simple linear regression contains the DNA of all supervised learning:

Neural Networks: Stack multiple layers of such computations with non-linear activations
Regularization: Add penalty terms to loss to prevent overfitting
Optimization: More sophisticated algorithms (Adam, RMSProp) build on gradient descent
Vectorization: Real implementations use efficient matrix operations

Practical Exercise

Try modifying the code to:

Add L2 regularization (ridge regression)
Implement stochastic gradient descent (update after each sample)
Handle multiple features (multivariate linear regression)
Compare with scikit-learn's implementation

Conclusion

Building linear regression from scratch demystifies machine learning. You now understand what happens inside library functions like LinearRegression.fit()—it's gradient descent minimizing MSE. This foundation prepares you for more complex models where the principles remain the same, but the implementations scale.

Remember: All deep learning is just regression with more layers, non-linearities, and clever optimization. Start simple, understand thoroughly, then build complexity.

Next Steps: Experiment with the code, break it, fix it, and extend it. The best way to learn is by doing.

Your First Machine Learning Model: Linear Regression from Scratch in Python

Your First Machine Learning Model: Linear Regression from Scratch in Python

Why Start with Linear Regression?

The Problem We're Solving

Mathematical Foundations

Implementation: Step by Step

1. Generate Synthetic Data

2. Initialize Parameters

3. Define the Model and Loss Function

4. Implement Gradient Descent

5. Training Loop

6. Visualization and Evaluation

Key Concepts Demonstrated

Common Pitfalls and Solutions

1. Learning Rate Too High

2. Feature Scaling Matters

3. Local Minima vs Global Minima

From Here to Real Machine Learning

Practical Exercise

Conclusion

AI Editorial Team

Related Articles

Feature Engineering for Tabular Data: Techniques That Actually Work in Production

Beyond Accuracy: Advanced Model Evaluation Metrics for Imbalanced Real-World Datasets

Choosing Your First Cloud ML Service: AWS SageMaker vs Azure ML vs Google Vertex AI