03: Oil and Cross-species Adulteration¶

This notebook focuses on Regression tasks within the REIMS framework. We specifically look at predicting oil concentration levels and identifying adulteration in cross-species samples. Unlike simple classification, regression models must learn the continuous linear or non-linear response of specific biomarkers to varying concentrations of a substance.

[1]:

import os, sys, warnings
warnings.filterwarnings("ignore")
import pandas as pd, numpy as np
import plotly.io as pio
import plotly.express as px
import plotly.graph_objects as go
from sklearn.metrics import confusion_matrix, precision_recall_fscore_support, precision_recall_curve, average_precision_score, mean_absolute_error
from fishy import TrainingConfig, run_unified_training, display_final_summary, create_data_module
pio.renderers.default = "notebook_connected"
try:
    import torch
    import torch.nn as nn
    from fishy.analysis.xai import GradCAM, ModelWrapper
    from lime.lime_tabular import LimeTabularExplainer
    from fishy._core.utils import get_device
    HAS_XAI = True
except ImportError:
    HAS_XAI = False
    print("XAI dependencies (lime, torch) not fully available.")

[2]:

# Run regression training for the oil dataset
config = TrainingConfig(model="rf", dataset="oil", regression=True, wandb_log=False)
results = run_unified_training(config)
display_final_summary(results)

INFO     Initialized RunContext: rf on oil

╭─────────────────────────────────────╮
│ Training Complete - Results Summary │
│  Metric              Train     Val  │
│  Accuracy           1.0000  0.0873  │
│  Balanced Accuracy  1.0000  0.0873  │
│  MAE                0.0000  2.4603  │
│  MSE                0.0000  8.6190  │
│  Precision          1.0000  0.0658  │
│  Recall             1.0000  0.0873  │
│  F1 Score           1.0000  0.0738  │
╰─────────────────────────────────────╯

Elapsed training time: 17.9592 seconds

1. Regression Calibration (Predicted vs. Actual)¶

The predicted vs. actual plot is the gold standard for evaluating regression performance. A perfect model would have all points on the 45-degree dashed line. Deviations from this line indicate where the model is systematically over- or under-estimating the adulteration levels.

[3]:

if "predictions" in results:
    p = results["predictions"]
    y_true, y_pred = p["labels"], p["preds"]

    fig_cal = px.scatter(x=y_true, y=y_pred, labels={'x': 'Actual Concentration', 'y': 'Predicted Concentration'},
                         title="Regression Calibration: Predicted vs. Actual",
                         opacity=0.6, template="plotly_white")
    fig_cal.add_shape(type="line", x0=min(y_true), y0=min(y_true), x1=max(y_true), y1=max(y_true),
                      line=dict(color="Red", dash="dash"))
    fig_cal.show()

2. MAE by Concentration (Limit of Detection)¶

In chemical adulteration, it is critical to know at what concentration the model starts failing. This bar chart shows the Mean Absolute Error (MAE) for each concentration level. A high error at low concentrations (e.g., 1%) defines the model’s Limit of Detection.

[4]:

if "predictions" in results:
    errors = np.abs(y_true - y_pred)
    err_df = pd.DataFrame({"Actual": y_true, "MAE": errors})
    # Group by actual concentration and calculate mean error
    mae_by_conc = err_df.groupby("Actual").mean().reset_index()

    px.bar(mae_by_conc, x="Actual", y="MAE",
           title="Prediction Error (MAE) by Concentration Level",
           labels={'Actual': 'Actual Concentration (%)', 'MAE': 'Mean Absolute Error'},
           template="plotly_white", color="MAE", color_continuous_scale="Reds").show()

3. Residual Distribution Analysis¶

Residuals (Error = Actual - Predicted) should ideally be normally distributed around zero. If the histogram is skewed or has multiple peaks, it suggests that the model is failing to capture specific chemical features related to certain concentration levels.

[5]:

if "predictions" in results:
    residuals = y_true - y_pred
    px.histogram(x=residuals, nbins=30, labels={'x': 'Residual Error'},
                 title="Residual Distribution (Error Analysis)",
                 template="plotly_white", color_discrete_sequence=['#636EFA']).show()

4. Biomarker Drift over Concentration¶

In a regression task, specific m/z peaks should show a linear (or monotonic) response to concentration. This plot tracks the intensity of the top identified biomarkers across the sorted classes, proving that the model is learning biological gradients rather than memorizing samples.

[6]:

if "data_module" in results:
    dm = results["data_module"]
    X_raw, y_raw = dm.get_numpy_data(labels_as_indices=True)
    # Simple correlation to find linear biomarkers
    corrs = [np.corrcoef(X_raw[:, i], y_raw)[0, 1] for i in range(X_raw.shape[1])]
    top_bio_idx = np.argsort(np.abs(corrs))[-3:]

    feat_names = [f for f in dm.get_filtered_dataframe().columns if f not in ["Class Name", "m/z"]]

    drift_df = pd.DataFrame({"Concentration": y_raw})
    for idx in top_bio_idx:
        drift_df[f"m/z {feat_names[idx]}"] = X_raw[:, idx]

    melted_drift = drift_df.melt(id_vars="Concentration", var_name="Biomarker", value_name="Intensity")
    px.scatter(melted_drift, x="Concentration", y="Intensity", color="Biomarker",
               title="Biomarker Intensity Drift vs. Concentration", template="plotly_white").show()

5. Performance & Interpretability¶

For tasks treated as ordinal classification, we use a confusion matrix. For interpretability, LIME reveals which spectral peaks are pushing the prediction toward higher or lower concentration levels.

[7]:

if not config.regression and "predictions" in results:
    p = results["predictions"]; cn = results["class_names"]
    cm = confusion_matrix(p["labels"], p["preds"])
    px.imshow(cm, x=cn, y=cn, text_auto=True, title="Confusion Matrix", color_continuous_scale="Blues").show()

[8]:

if HAS_XAI and "model" in results and "data_module" in results:
    try:
        m = results["model"]; dm = results["data_module"]; X_x, y_x = dm.get_numpy_data(labels_as_indices=True)
        feature_names = [f"{c}" for c in dm.get_filtered_dataframe().columns if c not in ["Class Name", "m/z"]]
        explainer = LimeTabularExplainer(X_x, feature_names=feature_names, class_names=results["class_names"], discretize_continuous=True)
        # For regression, we explain the single output value
        exp = explainer.explain_instance(X_x[0], ModelWrapper(m, str(get_device())).predict_proba if not config.regression else m.predict, num_features=10)
        el = exp.as_list()
        px.bar(x=[x[1] for x in el], y=[x[0] for x in el], orientation="h", title="LIME Explanation (Sample 0)").show()
    except Exception as e: print(f"XAI Visualization failed: {e}")

XAI Visualization failed: LIME does not currently support classifier models without probability scores. If this conflicts with your use case, please let us know: https://github.com/datascienceinc/lime/issues/16