Goal: Understand how a machine learning system moves from raw data to a live, monitored production model—and where things can break.
Why the ML Lifecycle Matters
Many beginners think machine learning ends after training a model.
In reality, training is just the middle.
Most real-world ML failures happen after deployment, not during modeling.
MLOps exists to manage the full lifecycle.
1. Data
What happens:
-
Collect raw data (logs, images, text, transactions, etc.)
-
Clean, label, and validate data
-
Split into train / validation / test sets
Common failure points:
-
Missing or incorrect labels
-
Biased or unrepresentative data
-
Data leakage (future data in training)
2. Training
What happens:
-
Select algorithms
-
Train models on historical data
-
Tune hyperparameters
-
Evaluate performance (accuracy, precision, recall, etc.)
Outputs:
-
Model artifact (file)
-
Metrics
-
Training logs
Common failure points:
-
Overfitting to training data
-
Training on outdated data
-
Metrics that don’t reflect real-world usage
3. Model (Artifact Management)
What happens:
-
Save trained models
-
Version models
-
Track metadata (data version, parameters, metrics)
Why this matters:
Without versioning, you can’t answer:
“Which model caused this bad prediction?”
Common failure points:
-
No version control
-
No experiment tracking
-
Can’t reproduce results





No comments:
Post a Comment