Zero to Neuron: The Ultimate Guide to ML Evaluation Metrics

We’ve built the models. We’ve cleaned the data. We’ve tuned the hyperparameters until our GPUs cried for mercy. But here is the uncomfortable question: Is the model actually any good?

You can’t improve what you can’t measure.

In this post, we are moving away from "it looks right" to "here is the math." We will break down the essential evaluation metrics for Regression, Classification, and Clustering, explain when to use them, and show you exactly how to implement them in Python.

Our Goal: To understand the math, the code, and the visual intuition behind every major metric.

Let's get coding.

Part 1: Regression Metrics (Predicting Values)

Regression is when your model predicts a continuous number (like house prices, temperature, or stock value).

1. Mean Absolute Error (MAE)

The Concept:

MAE calculates the absolute difference between the actual and predicted values. It treats all errors equally. If you are off by 5, it counts as 5. It doesn't care if the error is positive or negative.

The Formula:

$$MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|$$

Why it's useful:

It is the most intuitive metric. If your MAE is 500, it means your predictions are, on average, off by 500 units. It is robust to outliers, meaning one crazy wrong prediction won't ruin the entire score.

Best Use Case:

Use MAE when you want an error metric that is easy to explain to non-technical stakeholders and when outliers in your data are "safe" to ignore or shouldn't be penalized heavily.

Visualization:

The Code:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error

# Dummy data
actual = np.array([3, -0.5, 2, 7])
predicted = np.array([2.5, 0.0, 2, 8])

# Calculate MAE
mae = mean_absolute_error(actual, predicted)
print(f"MAE: {mae}")

# Visualization
plt.figure(figsize=(8, 5))
plt.plot(range(len(actual)), actual, 'o', label='Actual Data')
plt.plot(range(len(predicted)), predicted, 'x', label='Predicted Data')
for i in range(len(actual)):
    plt.vlines(i, actual[i], predicted[i], colors='red', linestyles='dotted', label='Error (MAE)' if i==0 else "")
plt.title(f'Visualizing MAE (Red lines = Absolute Error)')
plt.legend()
plt.show()

2. Mean Squared Error (MSE)

The Concept:

MSE squares the difference between actual and predicted values before averaging them. Because it squares the errors, large errors become huge.

The Formula:

$$MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$

Why it's useful:

It heavily penalizes large errors. Being off by 10 is 100 times worse than being off by 1 (since 10^2 = 100).

Best Use Case:

Use MSE when being "very wrong" is unacceptable. For example, in medical dosages or self-driving car trajectories, a large error could be catastrophic, so you want the model to focus on eliminating those outliers.

The Code:

from sklearn.metrics import mean_squared_error
mse = mean_squared_error(actual, predicted)
print(f"MSE: {mse}")

3. Root Mean Squared Error (RMSE)

The Concept:

RMSE is simply the square root of MSE.

The Formula:

$$RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}$$

Why it's useful:

MSE is hard to interpret (what does "squared dollars" mean?). RMSE brings the unit back to the original scale. If you are predicting house prices in dollars, RMSE gives you the error in dollars.

Best Use Case:

The industry standard for regression. Use it when you want the penalization of outliers (like MSE) but in a human-readable format.

The Code:

rmse = np.sqrt(mean_squared_error(actual, predicted))
print(f"RMSE: {rmse}")

4. R-Squared (R^2)

The Concept:

R^2 (Coefficient of Determination) tells you how well your model explains the variance in the data. It compares your model to a baseline model that just predicts the average of the target for everyone.

The Formula:

$$R^2 = 1 - \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2}$$

Why it's useful:

It gives you a score between 0 and 1 (usually).

1.0: Perfect model.
0.0: Your model is as bad as just guessing the average.
Negative: Your model is worse than random guessing.

Best Use Case:

Use this to check "goodness of fit"—how well the independent variables explain the dependent variable.

The Code:

Python

from sklearn.metrics import r2_score
r2 = r2_score(actual, predicted)
print(f"R2 Score: {r2}")

5. Adjusted R-Squared

The Concept:

Standard R^2 has a flaw: it always increases (or stays the same) if you add more features, even if those features are garbage. Adjusted R^2 fixes this by penalizing you for adding useless features.

The Formula:

$$R^2_{adj} = 1 - (1-R^2) \frac{n-1}{n-p-1}$$

(Where n is sample size and p is number of features)

Why it's useful:

It prevents "overfitting" by adding too many variables.

Best Use Case:

Always use Adjusted R^2 instead of normal R^2 when comparing models with a different number of features (Multiple Linear Regression).

The Code:

n = len(actual) # Number of observations
p = 1 # Number of features

adj_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)

print(f"Adjusted R2: {adj_r2}")

Part 2: Classification Metrics (Predicting Categories)

Classification is when you predict discrete labels (Spam/Ham, Cat/Dog, Yes/No).

Setup Code:

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import (
    classification_report, 
    confusion_matrix, 
    accuracy_score, 
    precision_score, 
    recall_score, 
    f1_score
)

# 1. Dummy Data
y_true = [0, 1, 0, 1, 0, 1, 0]
y_pred = [0, 1, 0, 0, 0, 1, 1]

print(classification_report(y_true, y_pred))

1. Accuracy

The Concept:

The ratio of correct predictions to total predictions.

The Formula:

$$Accuracy = \frac{TP + TN}{TP + TN + FP + FN}$$

Why it's useful:

It is the simplest metric to understand.

Best Use Case:

Only use Accuracy when your classes are balanced (e.g., 50% cats, 50% dogs). If you have 99% non-fraud and 1% fraud, a model that predicts "non-fraud" for everything has 99% accuracy but is useless.

acc = accuracy_score(y_true, y_pred)
print(f"Accuracy:  {acc:.2f}")

2. Precision

The Concept:

"Of all the times the model shouted 'Wolf!', how many times was there actually a wolf?"

It measures the quality of the positive predictions.

The Formula:

$$Precision = \frac{TP}{TP + FP}$$

Why it's useful:

It tells you how trustworthy a "Positive" prediction is.

Best Use Case:

Use Precision when the cost of a False Positive is high.

Example: Spam detection. You don't want to classify an important email from your boss as Spam (False Positive). You'd rather miss a few spam emails.

prec = precision_score(y_true, y_pred)
print(f"Precision: {prec:.2f}")

3. Recall (Sensitivity)

The Concept:

"Of all the wolves that were actually in the forest, how many did we find?"

It measures the ability of the model to find all the positive cases.

The Formula:

$$Recall = \frac{TP}{TP + FN}$$

Why it's useful:

It ensures we don't miss actual positive cases.

Best Use Case:

Use Recall when the cost of a False Negative is high.

Example: Cancer detection. You must find every case of cancer. It is okay to flag a healthy person for a checkup (False Positive), but it is fatal to tell a sick person they are healthy (False Negative).

rec = recall_score(y_true, y_pred)
print(f"Recall: {rec:.2f}")

4. F1 Score

The Concept:

The harmonic mean of Precision and Recall. It combines them into a single number.

The Formula:

$$F1 = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}$$

Why it's useful:

It balances the trade-off. You can't cheat F1 score by just predicting everything as positive (which gives perfect Recall but bad Precision).

Best Use Case:

The go-to metric for imbalanced datasets or when you need a balance between Precision and Recall.

The Code:

from sklearn.metrics import classification_report, confusion_matrix

f1 = f1_score(y_true, y_pred)
print(f"F1 Score:  {f1:.2f}")

Confusion Matrix:

# Visualize Confusion Matrix
cm = confusion_matrix(y_true, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

6. AUC-ROC (Area Under Curve - Receiver Operating Characteristic)

The Concept:

This measures how well your model separates the Positive class from the Negative class.

ROC Curve: A plot of True Positive Rate (Recall) vs False Positive Rate at different thresholds.
AUC: The area under that line.

Why it's useful:

It tells you how good your model is at distinguishing between classes, regardless of the threshold you pick later.

Best Use Case:

Use AUC to compare different models. If Model A has AUC 0.9 and Model B has AUC 0.8, Model A is generally better at separating classes.

Visualization:

The Code

from sklearn.metrics import roc_curve, roc_auc_score

# Probabilities (not just labels)
y_probs = [0.1, 0.4, 0.35, 0.8]
y_true = [0, 0, 1, 1]

fpr, tpr, thresholds = roc_curve(y_true, y_probs)
auc = roc_auc_score(y_true, y_probs)

plt.plot(fpr, tpr, label=f"AUC = {auc}")
plt.plot([0,1], [0,1], 'r--') # Random guess line
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()

Part 3: Clustering Metrics (Grouping Data)

Clustering is unsupervised. There are no "Right" answers to check against.

1. Silhouette Score

The Concept:

It answers: "How similar is a data point to its own cluster compared to other clusters?"

It ranges from -1 to +1.

The Formula:

$$S = \frac{b - a}{\max(a, b)}$$

a: Average distance to points in the same cluster (Cohesion).
b: Average distance to points in the nearest neighbor cluster (Separation).

Why it's useful:

+1: Highly dense clusters, well separated.
0: Overlapping clusters.
-1: Points assigned to wrong clusters.

Best Use Case:

Use this to determine the optimal number of clusters (K) in K-means.

2. Davies-Bouldin Index

The Concept:

It measures the average similarity between each cluster and its most similar one. It looks at the ratio of "within-cluster scatter" to "between-cluster separation."

It ranges from 0 to infinity.

$$DB = \frac{1}{k} \sum_{i=1}^{k} \max_{j \neq i} \left( \frac{S_i + S_j}{M_{i,j}} \right)$$

Why it's useful:

Unlike Silhouette, Lower is Better. A lower score means clusters are compact and far apart.

Best Use Case:

Use this alongside the Silhouette score to validate clustering performance.

The Code & Visualization:

from sklearn.metrics import silhouette_score, davies_bouldin_score
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

# Generate Data
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Cluster
kmeans = KMeans(n_clusters=4)
labels = kmeans.fit_predict(X)

print(f"Silhouette: {silhouette_score(X, labels)}")
print(f"Davies-Bouldin: {davies_bouldin_score(X, labels)}")

# Visualize
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.title("Clustering Visualization")
plt.show()

Conclusion

There is no "One Metric to Rule Them All." The choice depends entirely on your business problem.

Predicting house prices? RMSE.
Detecting spam? Precision.
Detecting cancer? Recall.
Grouping customers? Silhouette Score.

Now that you have the code, try running these on your own datasets in Colab!

Happy coding, meow! (Don't forget to subscribe, or my cat will... well, you know. 😼)

Zero to Neuron Series 7: Evaluation Metrics in Machine Learning: The Ultimate Guide

Part 1: Regression Metrics (Predicting Values)

1. Mean Absolute Error (MAE)

2. Mean Squared Error (MSE)

3. Root Mean Squared Error (RMSE)

4. R-Squared (R^2)

5. Adjusted R-Squared

Part 2: Classification Metrics (Predicting Categories)

1. Accuracy

2. Precision

3. Recall (Sensitivity)

4. F1 Score

6. AUC-ROC (Area Under Curve - Receiver Operating Characteristic)

Part 3: Clustering Metrics (Grouping Data)

1. Silhouette Score

2. Davies-Bouldin Index

Conclusion

Comments

More from this blog

Zero to Neuron Series 9: Vision Transformer by Hand: Every Number, Every Matrix, Every "Why"

Zero to Neuron Series 8: The Goldilocks Dilemma—Overfitting, Underfitting, and Tuning

Zero to Neuron Series 6: Coding QLoRA - Fine-Tuning an LLM on a Single GPU

Zero to Neuron Series 5: QLoRA By Hand — A Step-by-Step Numerical Walkthrough

Command Palette

Part 1: Regression Metrics (Predicting Values)

1. Mean Absolute Error (MAE)

2. Mean Squared Error (MSE)

3. Root Mean Squared Error (RMSE)

4. R-Squared (R^2)

5. Adjusted R-Squared

Part 2: Classification Metrics (Predicting Categories)

1. Accuracy

2. Precision

3. Recall (Sensitivity)

4. F1 Score

6. AUC-ROC (Area Under Curve - Receiver Operating Characteristic)

Part 3: Clustering Metrics (Grouping Data)

1. Silhouette Score

2. Davies-Bouldin Index

Conclusion

Comments

More from this blog