AICloudInsider

Your First Machine Learning Model: Linear Regression from Scratch in Python

Learn the fundamentals of supervised learning by implementing linear regression from scratch without libraries. Understand gradient descent, loss functions, and the mathematics behind ML.

AI Editorial Team

AI Editorial Team

Collective Intelligence

12 min read
Algorithms & Models

Your First Machine Learning Model: Linear Regression from Scratch in Python

Machine learning often seems like a black box filled with complex libraries and abstract concepts. But at its core, every sophisticated model builds upon fundamental principles that you can understand and implement yourself. In this tutorial, we'll implement linear regression from scratch in pure Python—no scikit-learn, no TensorFlow, just NumPy for basic array operations and mathematics.

Why Start with Linear Regression?

Linear regression is the "hello world" of machine learning for good reasons:

  1. Conceptual Simplicity: It demonstrates the core supervised learning paradigm—learning from labeled data to make predictions
  2. Mathematical Transparency: Every step can be explained with basic calculus and linear algebra
  3. Historical Significance: It's been used for centuries (as least squares) before computers existed
  4. Foundation for Complexity: Neural networks are essentially stacked, non-linear versions of regression

The Problem We're Solving

Imagine you're a real estate analyst trying to predict house prices based on their size (square footage). You have historical data showing the relationship:

Size (sq ft) | Price ($) -------------|---------- 1000 | 250,000 1500 | 375,000 2000 | 500,000

We want to find the line that best fits this data: price = w × size + b, where w is the weight (slope) and b is the bias (y-intercept).

Mathematical Foundations

Linear regression aims to minimize the Mean Squared Error (MSE):

math
1MSE = (1/n) × Σ(y_i - (w×x_i + b))²
2

Where:

  • n is the number of data points
  • x_i is the i-th input feature . y_i is the i-th target value
  • w and b are the parameters we're learning

We find optimal w and b using gradient descent—iteratively adjusting parameters in the direction that reduces error.

Implementation: Step by Step

First, let's set up our environment:

python
1import numpy as np
2import matplotlib.pyplot as plt
3
4# Set random seed for reproducibility
5np.random.seed(42)
6

1. Generate Synthetic Data

We'll create realistic-looking data with some noise:

python
1def generate_linear_data(n_samples=100, true_w=2.5, true_b=10.0, noise_scale=5.0):
2    """
3    Generate synthetic linear data with Gaussian noise.
4    
5    Parameters:
6    - n_samples: Number of data points
7    - true_w: True slope (weight)
8    - true_b: True intercept (bias)
9    - noise_scale: Standard deviation of Gaussian noise
10    
11    Returns:
12    - X: Input features (n_samples, 1)
13    - y: Target values with noise
14    """
15    X = np.random.randn(n_samples, 1) * 10  # Features centered around 0, scaled
16    y = true_w * X + true_b + np.random.randn(n_samples, 1) * noise_scale
17    return X, y
18
19# Generate our dataset
20X, y = generate_linear_data(n_samples=50, true_w=250, true_b=50000, noise_scale=30000)
21print(f"Dataset shape: X={X.shape}, y={y.shape}")
22print(f"First 5 samples:\nX: {X[:5].flatten()}\ny: {y[:5].flatten()}")
23

2. Initialize Parameters

We start with random parameters that we'll improve:

python
1def initialize_parameters():
2    """Initialize weight and bias with small random values."""
3    w = np.random.randn(1, 1) * 0.01  # Small random weight
4    b = np.random.randn(1, 1) * 0.01  # Small random bias
5    return w, b
6
7w, b = initialize_parameters()
8print(f"Initial parameters: w={w[0,0]:.4f}, b={b[0,0]:.4f}")
9

3. Define the Model and Loss Function

The forward pass computes predictions, and we calculate MSE:

python
1def forward_pass(X, w, b):
2    """Compute predictions: y_pred = X * w + b"""
3    return X.dot(w) + b
4
5def compute_loss(y_pred, y_true):
6    """Calculate Mean Squared Error."""
7    n = len(y_true)
8    errors = y_pred - y_true
9    mse = np.sum(errors ** 2) / n
10    return mse
11
12# Test our initial predictions
13y_pred_initial = forward_pass(X, w, b)
14initial_loss = compute_loss(y_pred_initial, y)
15print(f"Initial loss (MSE): {initial_loss:.2f}")
16

4. Implement Gradient Descent

This is where the learning happens. We compute gradients and update parameters:

python
1def compute_gradients(X, y_pred, y_true):
2    """
3    Compute gradients of MSE with respect to w and b.
4    
5    ∂MSE/∂w = (2/n) * Σ (X_i * (y_pred_i - y_true_i))
6    ∂MSE/∂b = (2/n) * Σ (y_pred_i - y_true_i)
7    """
8    n = len(y_true)
9    error = y_pred - y_true
10    
11    # Gradient for weight
12    dw = (2/n) * X.T.dot(error)
13    
14    # Gradient for bias  
15    db = (2/n) * np.sum(error)
16    
17    return dw, db
18
19def update_parameters(w, b, dw, db, learning_rate=0.01):
20    """Update parameters using gradient descent."""
21    w = w - learning_rate * dw
22    b = b - learning_rate * db
23    return w, b
24

5. Training Loop

Now we put everything together:

python
1def train_linear_regression(X, y, learning_rate=0.01, epochs=1000):
2    """Complete training loop for linear regression."""
3    # Initialize parameters
4    w, b = initialize_parameters()
5    
6    # Store loss history for plotting
7    losses = []
8    
9    for epoch in range(epochs):
10        # Forward pass
11        y_pred = forward_pass(X, w, b)
12        
13        # Compute loss
14        loss = compute_loss(y_pred, y)
15        losses.append(loss)
16        
17        # Compute gradients
18        dw, db = compute_gradients(X, y_pred, y)
19        
20        # Update parameters
21        w, b = update_parameters(w, b, dw, db, learning_rate)
22        
23        # Print progress every 100 epochs
24        if epoch % 100 == 0:
25            print(f"Epoch {epoch}: Loss = {loss:.2f}, w = {w[0,0]:.4f}, b = {b[0,0]:.4f}")
26    
27    return w, b, losses
28
29# Train our model
30w_trained, b_trained, losses = train_linear_regression(X, y, learning_rate=0.001, epochs=2000)
31print(f"\nTrained parameters: w = {w_trained[0,0]:.4f}, b = {b_trained[0,0]:.4f}")
32print(f"Final loss: {losses[-1]:.2f}")
33

6. Visualization and Evaluation

Let's see how well our model learned:

python
1def plot_results(X, y, w, b, losses):
2    """Visualize data, fitted line, and training progress."""
3    fig, axes = plt.subplots(1, 2, figsize=(12, 4))
4    
5    # Left: Data and regression line
6    axes[0].scatter(X, y, alpha=0.5, label='Data')
7    
8    # Generate line from learned parameters
9    X_line = np.array([[X.min()], [X.max()]])
10    y_line = forward_pass(X_line, w, b)
11    axes[0].plot(X_line, y_line, 'r-', linewidth=2, label=f'Learned: y = {w[0,0]:.1f}x + {b[0,0]:.1f}')
12    
13    axes[0].set_xlabel('Feature (X)')
14    axes[0].set_ylabel('Target (y)')
15    axes[0].set_title('Linear Regression Fit')
16    axes[0].legend()
17    axes[0].grid(True, alpha=0.3)
18    
19    # Right: Loss curve
20    axes[1].plot(losses)
21    axes[1].set_xlabel('Epoch')
22    axes[1].set_ylabel('Mean Squared Error')
23    axes[1].set_title('Training Loss Over Time')
24    axes[1].grid(True, alpha=0.3)
25    axes[1].set_yscale('log')  # Log scale for better visibility
26    
27    plt.tight_layout()
28    plt.show()
29
30plot_results(X, y, w_trained, b_trained, losses)
31

Key Concepts Demonstrated

Through this implementation, you've learned:

  1. Loss Function: How we quantify model error (MSE)
  2. Gradient Descent: How we iteratively improve parameters
  3. Learning Rate: The step size for parameter updates (too large diverges, too small converges slowly)
  4. Forward Pass: Computing predictions from inputs and parameters
  5. Backward Pass: Computing gradients to guide updates

Common Pitfalls and Solutions

1. Learning Rate Too High

python
1# Symptoms: Loss diverges (gets larger, not smaller)
2# Solution: Reduce learning rate from 0.01 to 0.001 or lower
3

2. Feature Scaling Matters

python
1# If features have very different scales, gradient descent struggles
2# Solution: Normalize features before training
3X_normalized = (X - X.mean()) / X.std()
4

3. Local Minima vs Global Minima

For convex problems like linear regression, gradient descent always finds the global minimum. For non-convex problems (like neural networks), this isn't guaranteed.

From Here to Real Machine Learning

This simple linear regression contains the DNA of all supervised learning:

  1. Neural Networks: Stack multiple layers of such computations with non-linear activations
  2. Regularization: Add penalty terms to loss to prevent overfitting
  3. Optimization: More sophisticated algorithms (Adam, RMSProp) build on gradient descent
  4. Vectorization: Real implementations use efficient matrix operations

Practical Exercise

Try modifying the code to:

  1. Add L2 regularization (ridge regression)
  2. Implement stochastic gradient descent (update after each sample)
  3. Handle multiple features (multivariate linear regression)
  4. Compare with scikit-learn's implementation

Conclusion

Building linear regression from scratch demystifies machine learning. You now understand what happens inside library functions like LinearRegression.fit()—it's gradient descent minimizing MSE. This foundation prepares you for more complex models where the principles remain the same, but the implementations scale.

Remember: All deep learning is just regression with more layers, non-linearities, and clever optimization. Start simple, understand thoroughly, then build complexity.

Next Steps: Experiment with the code, break it, fix it, and extend it. The best way to learn is by doing.

AI Editorial Team

AI Editorial Team

Collective Intelligence

A consortium of fine-tuned language models and human editors curating the latest in AI/ML and cloud infrastructure. Our hybrid approach ensures accuracy, depth, and relevance.

847 articles