Goal of the Day
Learn how Linux and shell basics power real-world MLOps pipelines—from organizing ML projects to automating training and deployment.
MLOps engineers live in the terminal. Models don’t fail because of math—they fail because of bad structure, broken scripts, or missing environment variables.
1️⃣ Folder Structures for ML Projects
A clean project structure is non-negotiable in MLOps. It enables:
-
Reproducibility
-
Collaboration
-
CI/CD automation
-
Easier debugging & monitoring
๐ Standard ML Project Structure
Why This Matters in MLOps
-
data/ → versioned and tracked
-
src/ → production code (not notebooks)
-
scripts/ → automation entry points
-
configs/ → environment-agnostic settings
-
models/ → saved artifacts for deployment
This structure maps directly to CI/CD pipelines and ML platforms.
2️⃣ Bash Commands Used in MLOps Pipelines
MLOps pipelines rely heavily on shell commands—especially in Dockerfiles, CI tools, cron jobs, and cloud VMs.
๐ Core Linux Commands You Must Know
Example: Run a Training Script
import os
import logging
from datetime import datetime
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import joblib
# -----------------------------
# Configuration (via env vars)
# -----------------------------
MODEL_NAME = os.getenv("MODEL_NAME", "demo_model")
MODEL_VERSION = os.getenv("MODEL_VERSION", "v1")
DATA_PATH = os.getenv("DATA_PATH", "data/processed/train.csv")
MODEL_DIR = os.getenv("MODEL_DIR", "models")
LOG_DIR = os.getenv("LOG_DIR", "logs")
os.makedirs(MODEL_DIR, exist_ok=True)
os.makedirs(LOG_DIR, exist_ok=True)
# -----------------------------
# Logging setup
# -----------------------------
log_file = f"{LOG_DIR}/train_{datetime.now().strftime('%Y%m%d_%H%M%S')}.log"
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s - %(levelname)s - %(message)s",
handlers=[
logging.FileHandler(log_file),
logging.StreamHandler()
]
)
logging.info("Starting training pipeline")
logging.info(f"Model: {MODEL_NAME}, Version: {MODEL_VERSION}")
logging.info(f"Loading data from {DATA_PATH}")
# -----------------------------
# Load data
# -----------------------------
try:
data = pd.read_csv(DATA_PATH)
except FileNotFoundError:
logging.error("Training data not found. Exiting.")
raise
X = data.drop("label", axis=1)
y = data["label"]
# -----------------------------
# Train / Test split
# -----------------------------
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# -----------------------------
# Model training
# -----------------------------
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
logging.info("Model training completed")
# -----------------------------
# Evaluation
# -----------------------------
preds = model.predict(X_test)
accuracy = accuracy_score(y_test, preds)
logging.info(f"Validation accuracy: {accuracy:.4f}")
# -----------------------------
# Save model artifact
# -----------------------------
model_path = f"{MODEL_DIR}/{MODEL_NAME}_{MODEL_VERSION}.joblib"
joblib.dump(model, model_path)
logging.info(f"Model saved to {model_path}")
logging.info("Training pipeline finished successfully")
๐งช Example: Run Full Pipeline
bash
scripts/run_pipeline.sh
✅ Sample run_pipeline.sh
#!/bin/bash# -----------------------------# Safe bash settings# -----------------------------set -euo pipefail# -----------------------------# Project paths# -----------------------------PROJECT_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"SRC_DIR="$PROJECT_ROOT/src"DATA_DIR="$PROJECT_ROOT/data/processed"LOG_DIR="$PROJECT_ROOT/logs"mkdir -p "$LOG_DIR"# -----------------------------# Environment variables# -----------------------------export ENV="dev"export MODEL_NAME="demo_model"export MODEL_VERSION="v1"export DATA_PATH="$DATA_DIR/train.csv"export MODEL_DIR="$PROJECT_ROOT/models"export LOG_DIR="$LOG_DIR"# -----------------------------# Logging# -----------------------------PIPELINE_LOG="$LOG_DIR/pipeline_$(date +%Y%m%d_%H%M%S).log"exec > >(tee -a "$PIPELINE_LOG") 2>&1echo "๐ Starting MLOps pipeline"echo "Environment: $ENV"echo "Model: $MODEL_NAME ($MODEL_VERSION)"echo "Project root: $PROJECT_ROOT"# -----------------------------# Data validation (basic)# -----------------------------if [ ! -f "$DATA_PATH" ]; thenecho "❌ Training data not found at $DATA_PATH"exit 1fiecho "✅ Training data found"# -----------------------------# Training step# -----------------------------echo "๐️ Running model training..."python "$SRC_DIR/train.py"echo "✅ Training completed"# -----------------------------# Pipeline finished# -----------------------------echo "๐ Pipeline finished successfully"echo "Logs saved to $PIPELINE_LOG"
๐งช Example: Check Logs
tail -f logs/train.log
๐
logs/train_20260118_093215.log(Sample)
2026-01-18 09:32:15,104 - INFO - Starting training pipeline
2026-01-18 09:32:15,105 - INFO - Model: demo_model, Version: v1
2026-01-18 09:32:15,105 - INFO - Loading data from data/processed/train.csv
2026-01-18 09:32:15,321 - INFO - Training dataset loaded successfully
2026-01-18 09:32:15,322 - INFO - Total records: 12,000
2026-01-18 09:32:15,323 - INFO - Features: 15 | Label column: label
2026-01-18 09:32:15,330 - INFO - Splitting data (80% train / 20% validation)
2026-01-18 09:32:15,342 - INFO - Train samples: 9,600
2026-01-18 09:32:15,342 - INFO - Validation samples: 2,400
2026-01-18 09:32:15,351 - INFO - Initializing LogisticRegression model
2026-01-18 09:32:15,359 - INFO - Training model...
2026-01-18 09:32:16,912 - INFO - Model training completed successfully
2026-01-18 09:32:16,918 - INFO - Running validation
2026-01-18 09:32:16,934 - INFO - Validation accuracy: 0.8725
2026-01-18 09:32:16,940 - INFO - Saving model artifact
2026-01-18 09:32:16,946 - INFO - Model saved to models/demo_model_v1.joblib
2026-01-18 09:32:16,947 - INFO - Training pipeline finished successfully
How an MLOps Engineer Reads This Log
| Log Section | Why It Matters |
|---|---|
Pipeline start | Confirms job execution |
Model name & version | Traceability & rollback |
Dataset size | Detects data drift |
Train/val split | Reproducibility |
Training duration | Performance & cost |
Accuracy metric | Model health |
Artifact path | Deployment readiness |
3️⃣ Environment Variables (Critical for MLOps)
Environment variables let you separate code from configuration.
They are used for:
Secrets (API keys)
Model paths
Environment flags (dev / prod)
Cloud credentials
๐ Set Environment Variables
bash
export MODEL_NAME="fraud_detector"
export ENV="production"
Check:
bash
echo $MODEL_NAME
Why MLOps Uses Env Vars
-
Prevents hardcoding secrets
-
Makes Docker & CI/CD portable
-
Enables safe multi-environment deployments
Example in Docker / CI
4️⃣ Practice Exercise (Hands-On)
๐ Task: Create an ML Project Directory
Run the following commands:
bash
mkdir -p ml-project/{data/{raw,processed},src,models,configs,scripts,logs}
cd ml-project
touch README.md requirements.txt
Verify:
bash
tree
(If tree isn’t installed, use ls -R)
Bonus Challenge (Optional)
-
Create
train.pyinsidesrc/ -
Write a bash script
run_pipeline.shthat runs training -
Add an environment variable for model version
What You Learned Today
✔ Linux project structuring for ML
✔ Essential bash commands for pipelines
✔ Environment variables for secure deployments
✔ Hands-on ML project setup
Other related links:
Junior MLOps Engineer - Day 2 Training: ML Lifecycle Deep Dive
https://www.wisemoneyai.com/2026/01/junior-mlops-engineer-day-2-training-ml.html
30-Day Full Course (1 Hour per Day) - Day 1 included
https://www.wisemoneyai.com/2026/01/junior-mlops-engineer-30-day-full.html



No comments:
Post a Comment