Building ML in the Dark: A Survival Guide for the Solo Practitioner

No GPU cluster. No data team. No ML platform. Here’s what actually ships.

Most ML content is written for teams that have things. A labelled dataset. An MLOps platform. A data engineer who answers Slack messages. A GPU budget that someone has already approved.

You probably don’t have those things. You’re embedded in a product or analytics team, you were handed a vague mandate to “do something with ML,” and you have a laptop, a free-tier cloud account, and colleagues who think pandas is the animal. This post is for you. Not a roadmap — a survival guide. What to hack around, what to refuse, and how to get something real into production before your stakeholders lose interest.

TL;DR

Bad data is not your biggest problem. Unclear problem definition is. Fix that first, or nothing else matters.
You don’t need a GPU for most things that actually ship at company scale. Learn what does and doesn’t need one.
Build the evaluation harness before the model. Without it, you can’t tell if anything is working.
Know which requests to push back on entirely. Some “ML problems” should stay heuristics.
The smallest deployable model that solves the problem is almost always the right model.

Your Actual Constraints (And Which Ones Are Real)

Before anything else, audit your constraints honestly. Some are hard walls. Most aren’t.

Compute is usually softer than it feels. For tabular data problems at company scale (< 10M rows, < 1000 features), gradient-boosted trees on a single CPU core outperform most deep learning approaches and train in minutes. For embedding-based tasks, the free tier of any major cloud provider gets you surprisingly far. For LLM-based features, API calls are compute, and the economics of gpt-4o-mini or claude-haiku per call are approachable for MVP volumes.

The things that genuinely require a GPU: training or fine-tuning transformer-scale models from scratch. If that’s actually the job, you need either a cloud budget (Google Colab Pro+, a spot instance, or a Modal.com run-function) or a scoped problem that doesn’t require it. For almost everything else, the “we don’t have a GPU” constraint is a proxy for a different constraint you haven’t named yet.

Data is almost always actually a problem — but not usually in the way people think. The issue is rarely “not enough rows.” It’s label quality, label consistency, and the gap between what was logged and what you need. More on this shortly.

Engineering support is the constraint that actually kills most solo ML projects. Not because you can’t build the model alone, but because getting it called in production, monitored, and redeployed when it breaks requires someone on the other side to care. Scope your project to what you can maintain alone, or make a specific ask of one engineer before you start — not after you have a working model.

Start With the Evaluation Harness, Not the Model

This is the discipline that separates practitioners who ship from practitioners who perpetually have “a model working in the notebook.”

Before writing a single line of training code, build the thing that tells you whether a model is working:

import pandas as pd
from sklearn.metrics import classification_report, roc_auc_score
from typing import Callable

def evaluate(
    predict_fn: Callable,
    test_df: pd.DataFrame,
    label_col: str = "label",
    threshold: float = 0.5
) -> dict:
    """
    Minimal evaluation harness. Pass any callable as predict_fn.
    Works for heuristics, sklearn models, and API-based LLM classifiers alike.
    """
    y_true = test_df[label_col].values
    y_scores = predict_fn(test_df.drop(columns=[label_col]))
    y_pred = (y_scores >= threshold).astype(int)
    
    report = classification_report(y_true, y_pred, output_dict=True)
    auc = roc_auc_score(y_true, y_scores)
    
    return {
        "auc": round(auc, 4),
        "precision": round(report["1"]["precision"], 4),
        "recall": round(report["1"]["recall"], 4),
        "f1": round(report["1"]["f1-score"], 4),
        "n_test": len(y_true),
        "positive_rate": round(y_true.mean(), 4)
    }
# Your first "model" should be a heuristic baseline
def heuristic_predict(df: pd.DataFrame) -> pd.Series:
    """Example: flag anything above threshold in an existing signal column."""
    return (df["some_existing_signal"] > 50).astype(float)

# Now you have a number to beat
baseline_results = evaluate(heuristic_predict, test_df)
print(baseline_results)

Write this harness first because it forces two critical conversations: what does “working” mean, and what does the baseline look like? If you can’t define a test set and a success metric before training, you don’t have a problem definition — you have a research project. Research projects don’t get deployed.

The harness also gives you something to hand to a sceptical stakeholder before you’ve trained anything: “Here’s what a simple rule achieves. Here’s what we’d need to see to justify the model complexity.”

The Data Problem You Actually Have

You’ve been handed a dataset. It has labels. Here’s what’s probably wrong with it.

Label leakage from time. The label was set after the event you’re trying to predict. The model learns to recognise the aftermath, not the signal. Check: can you reconstruct your feature set as it existed at prediction time? If event data is joined without strict temporal cutoffs, you have a problem.

Label disagreement between sources. Two systems that should agree on the label don’t, and someone just unioned them. Spot-check 50 positives and 50 negatives manually. If you disagree with 15% of the labels, your ceiling is around 85% accuracy regardless of model complexity.

Class imbalance that isn’t handled. A 99:1 imbalance doesn’t mean you need a complex technique — it means your evaluation metric needs to be AUC-ROC or F1, not accuracy, and your baseline “predict everything negative” is already at 99% accuracy and completely useless.

A quick label audit that catches most of these:

def audit_labels(df: pd.DataFrame, label_col: str, date_col: str) -> None:
    """Quick data quality checks before touching a model."""
    print(f"Label distribution:\n{df[label_col].value_counts(normalize=True).round(3)}\n")
    
    # Check for temporal consistency
    if date_col in df.columns:
        df["month"] = pd.to_datetime(df[date_col]).dt.to_period("M")
        monthly_rate = df.groupby("month")[label_col].mean()
        print(f"Label rate over time (should be stable or trend smoothly):")
        print(monthly_rate.to_string())
        
        # Spike in label rate is often a labelling artefact, not a real signal
        rate_std = monthly_rate.std()
        if rate_std > 0.05:
            print(f"\n⚠️  High label rate variance ({rate_std:.3f}). Check for labelling changes.")
    
    # Duplicates
    dup_rate = df.duplicated(subset=[c for c in df.columns if c != label_col]).mean()
    print(f"\nDuplicate feature row rate: {dup_rate:.3%}")
    if dup_rate > 0.01:
        print("⚠️  Duplicates may inflate eval metrics. Deduplicate before splitting.")
audit_labels(df, label_col="churned", date_col="event_date")

Spending two hours on data auditing before training saves you three weeks of debugging a model that was never going to work

The Model Selection Rule for Constrained Environments

The rule is simple, and the ML literature backs it up: use the simplest model that clears your eval bar, then stop.

For tabular data, the hierarchy in practice:

Heuristic or threshold rule—if this already achieves 70% of the value, ship it and move on. The maintenance cost of a rule is near zero.
Logistic regression — interpretable, fast, deployable as a formula. No infrastructure needed to serve.
Gradient-boosted trees (XGBoost, LightGBM)—handle mixed feature types, missing values, and non-linearities with minimal tuning. Trains in seconds on reasonable dataset sizes.
Everything else requires justification beyond “it’s more powerful.”

import lightgbm as lgb
from sklearn.model_selection import train_test_split

# Minimal LightGBM setup for a tabular classification problem
# Handles missing values natively; no scaling required
def train_gbm(X: pd.DataFrame, y: pd.Series) -> lgb.LGBMClassifier:
    X_train, X_val, y_train, y_val = train_test_split(
        X, y, test_size=0.2, stratify=y, random_state=42
    )
    
    model = lgb.LGBMClassifier(
        n_estimators=500,
        learning_rate=0.05,
        num_leaves=31,
        class_weight="balanced",  # handles imbalance automatically
        random_state=42,
        verbose=-1
    )
    
    model.fit(
        X_train, y_train,
        eval_set=[(X_val, y_val)],
        callbacks=[lgb.early_stopping(50, verbose=False)]
    )
    
    return model

model = train_gbm(X_train_features, y_train)
results = evaluate(model.predict_proba_positive, test_df)

For LLM-based tasks (classification, extraction, summarization) without a fine-tuning budget: start with a well-structured prompt to a small model (claude-haiku, gpt-4o-mini) before assuming you need a fine-tuned model. In many classification scenarios, a carefully written zero-shot prompt with 10 labelled examples in context outperforms a fine-tuned smaller model—and costs a fraction of the setup time. Validate first; fine-tune only if the gap is worth it.

Deployment Without an MLOps Platform

You don’t have MLflow. You don’t have SageMaker. You have Python and probably AWS or GCP. Here’s a minimal path to a callable model:

Option 1: Pickle + FastAPI — for small models (logistic regression, GBMs). Serialise the model, wrap it in a two-endpoint FastAPI service, containerise with Docker, and deploy to a small cloud instance or a serverless container service (Cloud Run, Fargate). Entire setup: half a day.

# serve.py — minimal model serving
import pickle
import pandas as pd
from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()
with open("model.pkl", "rb") as f:
    model = pickle.load(f)

class PredictRequest(BaseModel):
    features: dict  # {feature_name: value}

@app.post("/predict")
def predict(request: PredictRequest):
    df = pd.DataFrame([request.features])
    score = float(model.predict_proba(df)[0, 1])
    return {"score": score, "label": int(score >= 0.5)}

Option 2: Modal — for anything that needs a GPU or heavier dependencies. Define your function, decorate it, push. No infrastructure management.

import modal

app = modal.App("ml-inference")

@app.function(
    image=modal.Image.debian_slim().pip_install("scikit-learn", "pandas"),
    cpu=1,
)
def predict_batch(records: list[dict]) -> list[float]:
    import pickle, pandas as pd
    with open("model.pkl", "rb") as f:
        model = pickle.load(f)
    df = pd.DataFrame(records)
    return model.predict_proba(df)[:, 1].tolist()

Option 3: Batch scoring to a table — for use cases where real-time latency isn’t required. Run your model nightly, write scores to a database table, and let downstream systems read from it. This is underrated: it separates inference from serving entirely, is trivially monitorable, and doesn’t require anyone to integrate an API.

What to Refuse

This is the section most ML blog posts skip. Some requests should not be ML projects, and saying so clearly — with a concrete alternative — is one of the most valuable things a solo practitioner can do.

Refuse: “Can you build a model to predict X?” when X has no labelled historical data and won’t for months. Labels are the constraint. No labels means no supervised model means no timeline. Offer: a rule-based system or a lightweight annotation workflow to collect labels over the next 4–6 weeks before the model work starts.

Refuse: “Can you improve the model’s performance?” when no one has defined what performance means. Before touching the model, lock down: what metric, measured on what test set, compared to what baseline. Without this, you’ll be asked indefinitely to “improve it more.”

Refuse: “Can we use AI for this?” as a full project spec. AI for what, specifically? What decision does it change? Who acts on the output? These aren’t pedantic questions — they determine whether the project is deployable at all. A model that outputs a score no one knows how to act on is not an ML project; it’s a dashboard decoration.

The framing that works: “I want to build this, and I need us to agree on X before I start.” Not a refusal — a precondition.

Gotchas Nobody Tells You

The model works; the feature pipeline doesn’t. In production, the model is usually the least broken part. The thing that fails is the join that produces your features—schema drift, null handling that wasn’t tested, and a timestamp that’s off by a timezone. Build your feature pipeline to be independently testable and add assertions that fire before inference runs.

Stakeholders will stop caring before you finish. The window of organizational enthusiasm for an ML project is shorter than the time it takes to do it properly. Optimise ruthlessly for a first-deployed version, even if it’s not the right version. A model in production at 70% quality that people are using gives you leverage to iterate. A model at 90% quality in a notebook gives you nothing.

“We just need to clean the data first” is often a trap. Data cleaning that has no end state defined will consume the entire project. Define a “good enough” data quality threshold for your first model, ship it, and improve data quality as a parallel workstream informed by where the model fails—not as a prerequisite.

Conclusion

The constraints you’re working under — no compute, bad data, no engineering support — are real, but they’re not the reason most constrained ML projects fail. Most fail because the problem was never well-defined, the evaluation metric was never agreed on, or the deployed scope was never bounded to what one person could actually maintain.

The practitioners who consistently ship under these conditions do three things differently: they define success before they write code, they choose the simplest model that clears the bar rather than the most impressive one, and they treat “should we even build this as an ML model?” as a question worth asking every time.

The best ML engineering in constrained environments resembles good software engineering in constrained environments: scope is small, validation occurs early, deployment is quick, and improvement is driven by real signals rather than anticipated perfection.

Building ML in the Dark: A Survival Guide for the Solo Practitioner was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.