data-engineeringBuilding Your First RAG System: Practical Guide with LangChain and Pinecone
Step-by-step implementation of Retrieval-Augmented Generation with real code, vector database setup, and production considerations.
Move beyond theory to practical feature engineering techniques that improve model performance in real-world applications. Learn which methods deliver ROI and which are time-wasters.
Sarah Chen
ML Engineer & Cloud AI Specialist
Feature engineering is where data science meets art. While algorithms get most attention, well-engineered features often deliver more performance gains than model selection or hyperparameter tuning. This article focuses on practical techniques that actually work in production—not academic exercises that fail when deployed.
In production ML systems:
We'll focus on techniques with proven ROI across finance, healthcare, e-commerce, and manufacturing applications.
Complex features that humans can't understand:
Features must work consistently:
Production constraints matter:
Bad approach: Always impute with mean/median Good approach: Strategy depends on data and problem:
1import pandas as pd
2import numpy as np
3from sklearn.impute import SimpleImputer, KNNImputer
4from sklearn.experimental import enable_iterative_imputer
5from sklearn.impute import IterativeImputer
6
7def practical_missing_value_handling(df, target_col=None):
8 """
9 Production-ready missing value handling with multiple strategies.
10
11 Rules:
12 1. If <5% missing and MCAR (missing completely at random): drop
13 2. If numerical and MAR (missing at random): iterative imputation
14 3. If categorical: add 'missing' category
15 4. If >30% missing: consider excluding feature entirely
16 """
17 results = {}
18
19 for col in df.columns:
20 missing_pct = df[col].isnull().mean()
21
22 if missing_pct < 0.05:
23 # Small amount of missing data - simple imputation
24 if df[col].dtype in ['int64', 'float64']:
25 imputer = SimpleImputer(strategy='median')
26 results[col] = imputer.fit_transform(df[[col]])
27 else:
28 # For categorical, fill with most frequent or 'missing'
29 df[col] = df[col].fillna('missing')
30
31 elif missing_pct < 0.30:
32 # Moderate missingness - use more sophisticated imputation
33 if df[col].dtype in ['int64', 'float64']:
34 # Iterative imputation models missingness patterns
35 imputer = IterativeImputer(max_iter=10, random_state=42)
36 results[col] = imputer.fit_transform(df[[col]])
37 else:
38 # Categorical with pattern - use KNN based on other features
39 df[col] = df[col].fillna(df[col].mode()[0] if not df[col].mode().empty else 'missing')
40
41 else:
42 # High missingness - consider excluding or flagging
43 print(f"Warning: {col} has {missing_pct:.1%} missing values")
44 # Create missing indicator alongside imputed values
45 df[f"{col}_missing"] = df[col].isnull().astype(int)
46 if df[col].dtype in ['int64', 'float64']:
47 df[col] = df[col].fillna(df[col].median())
48 else:
49 df[col] = df[col].fillna('missing')
50
51 return df
52One-hot encoding explodes dimensionality. Better approaches:
1from sklearn.preprocessing import OrdinalEncoder, TargetEncoder
2from category_encoders import CatBoostEncoder, WOEEncoder
3import warnings
4warnings.filterwarnings('ignore')
5
6def encode_categorical_features(X_train, X_test, y_train, categorical_cols):
7 """
8 Select optimal categorical encoding based on data characteristics.
9
10 Decision tree:
11 1. Low cardinality (<10 categories): Target encoding or one-hot
12 2. High cardinality: CatBoost encoding or frequency encoding
13 3. Tree-based models: Label encoding often works fine
14 4. Linear models: Require one-hot or target encoding
15 """
16 encoded_data = {}
17
18 for col in categorical_cols:
19 n_categories = X_train[col].nunique()
20
21 if n_categories < 10:
22 # Low cardinality - target encoding (smoothing prevents overfit)
23 encoder = TargetEncoder(smooth=20)
24 encoded_data[f"{col}_encoded"] = encoder.fit_transform(
25 X_train[col], y_train
26 ).values
27 # Apply same transformation to test
28 test_encoded = encoder.transform(X_test[col]).values
29
30 elif n_categories < 50:
31 # Medium cardinality - CatBoost encoding (leave-one-out)
32 encoder = CatBoostEncoder()
33 encoded_data[f"{col}_encoded"] = encoder.fit_transform(
34 X_train[col], y_train
35 ).values
36 test_encoded = encoder.transform(X_test[col]).values
37
38 else:
39 # High cardinality - frequency encoding with smoothing
40 # Calculate frequency with additive smoothing
41 freq = X_train[col].value_counts()
42 total = len(X_train)
43 # Laplace smoothing: (count + 1) / (total + n_categories)
44 smoothed_freq = (freq + 1) / (total + n_categories)
45
46 # Map to train and test
47 encoded_data[f"{col}_freq"] = X_train[col].map(smoothed_freq).fillna(1/(total + n_categories)).values
48 test_encoded = X_test[col].map(smoothed_freq).fillna(1/(total + n_categories)).values
49
50 # Also create cardinality feature (often predictive!)
51 encoded_data[f"{col}_cardinality"] = n_categories
52
53 return encoded_data
54Most guides teach log, square, sqrt transforms. What actually helps:
1def meaningful_numerical_transformations(df, numerical_cols):
2 """
3 Create numerical features with actual predictive power.
4 """
5 new_features = {}
6
7 for col in numerical_cols:
8 # 1. Robust scaling (less sensitive to outliers than z-score)
9 median = df[col].median()
10 iqr = df[col].quantile(0.75) - df[col].quantile(0.25)
11 new_features[f"{col}_robust_scaled"] = (df[col] - median) / iqr if iqr > 0 else 0
12
13 # 2. Winsorization (cap outliers at percentiles)
14 p1, p99 = df[col].quantile(0.01), df[col].quantile(0.99)
15 new_features[f"{col}_winsorized"] = df[col].clip(p1, p99)
16
17 # 3. Binning with business logic (not equal-width!)
18 # Use business-defined thresholds or data-driven quantiles
19 if 'age' in col.lower():
20 bins = [0, 18, 25, 35, 50, 65, 100]
21 labels = ['child', 'young_adult', 'adult', 'midlife', 'senior', 'elderly']
22 else:
23 # Data-driven: use deciles
24 bins = [df[col].quantile(q/10) for q in range(11)]
25 labels = [f'decile_{i}' for i in range(10)]
26
27 # 4. Interaction features with other important variables
28 # Only create if domain knowledge suggests interaction
29 if 'income' in col.lower() and 'education' in df.columns:
30 new_features[f"{col}_per_edu"] = df[col] / (df['education_years'] + 1)
31
32 # 5. Time-based features for temporal data
33 if 'date' in col.lower() or 'time' in col.lower():
34 # Assuming it's datetime type
35 new_features[f"{col}_dayofweek"] = df[col].dt.dayofweek
36 new_features[f"{col}_hour"] = df[col].dt.hour
37 new_features[f"{col}_is_weekend"] = df[col].dt.dayofweek >= 5
38
39 return pd.DataFrame(new_features)
40Even non-time-series problems often have temporal components:
1def create_time_aware_features(df, date_col, group_cols=None):
2 """
3 Create time-based features for tabular data.
4
5 Examples:
6 - Days since last event for each customer
7 - Rolling averages of past behavior
8 - Time since first occurrence
9 """
10 df = df.copy()
11 df[date_col] = pd.to_datetime(df[date_col])
12
13 # Sort by date for rolling calculations
14 df = df.sort_values(date_col)
15
16 new_features = {}
17
18 # 1. Time since first occurrence (by group)
19 if group_cols:
20 for group_col in group_cols:
21 df[f"days_since_first_{group_col}"] = df.groupby(group_col)[date_col].transform(
22 lambda x: (x - x.min()).dt.days
23 )
24
25 # 2. Days since last event
26 df['days_since_last_event'] = df.groupby(group_cols)[date_col].transform(
27 lambda x: (x.max() - x).dt.days
28 )
29
30 # 3. Rolling statistics (e.g., 30-day average)
31 for window in [7, 30, 90]: # days
32 # This requires careful handling to avoid data leakage
33 # Use expanding window up to each point
34 df[f'expanding_mean_{window}d'] = df.groupby(group_cols)['value_column'].transform(
35 lambda x: x.expanding(min_periods=1).mean()
36 )
37
38 # 4. Seasonality features
39 df['month'] = df[date_col].dt.month
40 df['quarter'] = df[date_col].dt.quarter
41 df['day_of_year'] = df[date_col].dt.dayofyear
42
43 # Sine/cosine encoding for cyclical features
44 df['month_sin'] = np.sin(2 * np.pi * df['month']/12)
45 df['month_cos'] = np.cos(2 * np.pi * df['month']/12)
46
47 return df
481from sklearn.feature_selection import RFE, SelectFromModel
2from sklearn.ensemble import RandomForestClassifier
3import shap
4
5def production_feature_selection(X, y, model, max_features=50):
6 """
7 Feature selection for production considering:
8 1. Predictive importance
9 2. Stability across time
10 3. Computational cost
11 4. Business interpretability
12 """
13 # Step 1: Remove low-variance features (little information)
14 from sklearn.feature_selection import VarianceThreshold
15 selector = VarianceThreshold(threshold=0.01)
16 X_variance_filtered = selector.fit_transform(X)
17
18 # Step 2: Remove highly correlated features (>0.95)
19 corr_matrix = pd.DataFrame(X_variance_filtered).corr().abs()
20 upper_tri = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
21 to_drop = [column for column in upper_tri.columns if any(upper_tri[column] > 0.95)]
22
23 # Step 3: Recursive Feature Elimination with cross-validation
24 rfe = RFE(
25 estimator=model,
26 n_features_to_select=max_features,
27 step=0.1 # Remove 10% of features each iteration
28 )
29 rfe.fit(X_variance_filtered, y)
30
31 # Step 4: SHAP values for interpretability
32 explainer = shap.TreeExplainer(model)
33 shap_values = explainer.shap_values(X_variance_filtered)
34
35 # Calculate mean absolute SHAP values
36 shap_importance = np.abs(shap_values).mean(axis=0)
37
38 # Combine metrics
39 feature_metrics = pd.DataFrame({
40 'feature': X.columns,
41 'rfe_ranking': rfe.ranking_,
42 'shap_importance': shap_importance,
43 'variance': np.var(X, axis=0)
44 })
45
46 # Filter: Keep features that are in top by RFE AND have reasonable SHAP importance
47 top_rfe = feature_metrics.nsmallest(max_features, 'rfe_ranking')
48 top_shap = feature_metrics.nlargest(max_features, 'shap_importance')
49
50 # Intersection: Features good by both metrics
51 selected_features = set(top_rfe['feature']).intersection(set(top_shap['feature']))
52
53 return list(selected_features)
54Features degrade. Monitor:
1class FeaturePerformanceMonitor:
2 def __init__(self, feature_names):
3 self.feature_names = feature_names
4 self.performance_history = []
5
6 def track_feature_drift(self, X_train, X_prod):
7 """
8 Monitor feature distribution drift.
9 """
10 drift_metrics = {}
11
12 for feature in self.feature_names:
13 # KS test for distribution similarity
14 from scipy.stats import ks_2samp
15 stat, p_value = ks_2samp(X_train[feature], X_prod[feature])
16
17 # Population Stability Index (PSI) - common in finance
18 psi = self.calculate_psi(X_train[feature], X_prod[feature])
19
20 drift_metrics[feature] = {
21 'ks_statistic': stat,
22 'ks_p_value': p_value,
23 'psi': psi,
24 'mean_diff_pct': abs(X_train[feature].mean() - X_prod[feature].mean()) / X_train[feature].mean() * 100
25 }
26
27 return drift_metrics
28
29 def calculate_psi(self, expected, actual, buckets=10):
30 """
31 Calculate Population Stability Index.
32 PSI < 0.1: No significant drift
33 PSI 0.1-0.25: Some drift
34 PSI > 0.25: Significant drift
35 """
36 # Create buckets based on expected distribution
37 breakpoints = np.percentile(expected, np.linspace(0, 100, buckets + 1))
38
39 expected_dist = np.histogram(expected, breakpoints)[0] / len(expected)
40 actual_dist = np.histogram(actual, breakpoints)[0] / len(actual)
41
42 # Replace zeros to avoid division by zero
43 expected_dist = np.where(expected_dist == 0, 0.0001, expected_dist)
44 actual_dist = np.where(actual_dist == 0, 0.0001, actual_dist)
45
46 psi = np.sum((actual_dist - expected_dist) * np.log(actual_dist / expected_dist))
47
48 return psi
49Before deploying any feature:
Feature engineering is the most impactful part of the ML pipeline when done right. Focus on:
The best features aren't the most mathematically sophisticated—they're the ones that reliably improve business outcomes while being maintainable and explainable.
Remember: In production, a simple feature that works consistently is worth 10 clever features that fail unpredictably.
data-engineeringStep-by-step implementation of Retrieval-Augmented Generation with real code, vector database setup, and production considerations.
foundation-modelsLearn the fundamentals of supervised learning by implementing linear regression from scratch without libraries. Understand gradient descent, loss functions, and the mathematics behind ML.
applied-aiAccuracy is misleading when classes are imbalanced. Learn advanced evaluation techniques that reveal true model performance for fraud detection, disease diagnosis, and rare event prediction.