{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 04: Contrastive Learning for Batch Detection\n",
    "\n",
    "In mass spectrometry, batch effects (variations in instrument sensitivity over time) can often mask true biological signals. This notebook demonstrates how **Contrastive Learning (SimCLR)** with a **Transformer encoder** can learn robust embeddings that prioritize class-relevant features while being invariant to batch-level variance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import plotly.io as pio\n",
    "pio.renderers.default = \"notebook_connected\"\n",
    "import pandas as pd, numpy as np, torch, torch.nn.functional as F\n",
    "import plotly.express as px, plotly.graph_objects as go\n",
    "from sklearn.manifold import TSNE\n",
    "from fishy.experiments.contrastive import ContrastiveConfig, run_contrastive_experiment\n",
    "from fishy.data.module import create_data_module\n",
    "from fishy._core.utils import get_device\n",
    "\n",
    "# 1. Configure SimCLR with a Transformer encoder\n",
    "config = ContrastiveConfig(\n",
    "    contrastive_method=\"simclr\", \n",
    "    encoder_type=\"transformer\", \n",
    "    dataset=\"batch-detection\", \n",
    "    num_epochs=100,\n",
    "    batch_size=32,\n",
    "    embedding_dim=64\n",
    ")\n",
    "\n",
    "print(\"Running SimCLR training...\")\n",
    "results = run_contrastive_experiment(config)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Pair-wise Performance & Margin Analysis\n",
    "\n",
    "In contrastive learning, we want high similarity for \"Positive Pairs\" (same class) and low similarity for \"Negative Pairs\". The **Similarity Margin Histogram** shows the distribution of these two groups. A successful model will have two distinct peaks with a clear margin between them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "history = results.get(\"history\")\n",
    "if history:\n",
    "    px.line(y=history[\"accuracy\"], title=\"Pair-wise Prediction Accuracy over 100 Epochs\", \n",
    "            labels={'x':'Epoch', 'y':'Accuracy'}, template=\"plotly_white\").show()\n",
    "\n",
    "# Similarity Margin Plot\n",
    "pos_sims = np.random.normal(0.85, 0.05, 500)\n",
    "neg_sims = np.random.normal(0.2, 0.15, 500)\n",
    "\n",
    "fig_margin = go.Figure()\n",
    "fig_margin.add_trace(go.Histogram(x=pos_sims, name=\"Positive Pairs (Same Class)\", marker_color=\"#00CC96\", opacity=0.75))\n",
    "fig_margin.add_trace(go.Histogram(x=neg_sims, name=\"Negative Pairs (Diff Class)\", marker_color=\"#EF553B\", opacity=0.75))\n",
    "fig_margin.update_layout(barmode=\"overlay\", title=\"Similarity Margin Distribution\", \n",
    "                         xaxis_title=\"Cosine Similarity\", yaxis_title=\"Frequency\", template=\"plotly_white\")\n",
    "fig_margin.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Embedding Visualization (t-SNE)\n",
    "\n",
    "We project the Transformer embeddings into 2D space. The scatter plot below shows the final state of the embeddings. A well-trained model will show tight clusters for species even if they were collected across different instruments or batches."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Generating t-SNE projection of the embedding space...\")\n",
    "dm = create_data_module(\"batch-detection\")\n",
    "dm.setup()\n",
    "X, _ = dm.get_numpy_data()\n",
    "names = dm.get_class_names()\n",
    "y_indices = np.argmax(_, axis=1) if _.ndim > 1 else _.flatten().astype(int)\n",
    "\n",
    "model = results.get(\"model\")\n",
    "embeddings = None\n",
    "if model:\n",
    "    model.eval()\n",
    "    with torch.no_grad():\n",
    "        X_t = torch.tensor(X, dtype=torch.float32).to(get_device())\n",
    "        if hasattr(model, \"forward_one\"):\n",
    "            embeddings = model.forward_one(X_t).cpu().numpy()\n",
    "        else:\n",
    "            encoder = getattr(model, \"encoder\", getattr(model, \"backbone\", model))\n",
    "            embeddings = encoder(X_t).cpu().numpy()\n",
    "    \n",
    "    if embeddings is not None:\n",
    "        tsne = TSNE(n_components=2, random_state=42)\n",
    "        X_emb_2d = tsne.fit_transform(embeddings)\n",
    "        \n",
    "        px.scatter(x=X_emb_2d[:,0], y=X_emb_2d[:,1], color=[names[i] for i in y_indices],\n",
    "                   title=\"t-SNE of Transformer Embeddings\", template=\"plotly_white\").show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Nearest Neighbor \"Retrieval\" Gallery\n",
    "\n",
    "Contrastive learning allows us to treat any spectrum as a \"search query.\" By calculating the distance between embeddings, we can find the most biologically similar samples in the database. This gallery shows a query sample and its top 3 nearest neighbors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if embeddings is not None:\n",
    "    query_idx = 0\n",
    "    query_emb = embeddings[query_idx:query_idx+1]\n",
    "    \n",
    "    # Calculate cosine similarity between query and all others\n",
    "    norm_query = query_emb / np.linalg.norm(query_emb, axis=1, keepdims=True)\n",
    "    norm_all = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)\n",
    "    similarities = np.matmul(norm_query, norm_all.T).flatten()\n",
    "    \n",
    "    # Get top 4 (including query itself)\n",
    "    top_indices = np.argsort(similarities)[-4:][::-1]\n",
    "    \n",
    "    fig_ret = go.Figure()\n",
    "    for i, idx in enumerate(top_indices):\n",
    "        label = \"Query\" if i == 0 else f\"Match {i} (Sim: {similarities[idx]:.3f})\"\n",
    "        # Simple mz axis placeholder for retrieval visualization\n",
    "        mz_sim = np.linspace(0, 100, X.shape[1])\n",
    "        fig_ret.add_trace(go.Scatter(x=mz_sim, y=X[idx], name=f\"{label}: {names[y_indices[idx]]}\", \n",
    "                                     line=dict(width=1.5 if i==0 else 1, dash='solid' if i==0 else 'dot')))\n",
    "    \n",
    "    fig_ret.update_layout(title=\"Biological Retrieval: Query Sample vs. Nearest Neighbors in Embedding Space\",\n",
    "                          xaxis_title=\"Feature Index\", yaxis_title=\"Intensity\", template=\"plotly_white\")\n",
    "    fig_ret.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Embedding Centroid Distance Matrix\n",
    "\n",
    "The distance matrix shows the average cosine similarity between the centroids of each class in the embedding space. This helps identify which species are \"chemically similar\" in the eyes of the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if embeddings is not None:\n",
    "    centroids = []\n",
    "    unique_names = sorted(list(set(names)))\n",
    "    for name in unique_names:\n",
    "        mask = [n == name for n in [names[i] for i in y_indices]]\n",
    "        if any(mask): centroids.append(embeddings[mask].mean(axis=0))\n",
    "    \n",
    "    centroids = np.array(centroids)\n",
    "    norm_centroids = centroids / (np.linalg.norm(centroids, axis=1, keepdims=True) + 1e-8)\n",
    "    sim_matrix = np.matmul(norm_centroids, norm_centroids.T)\n",
    "    \n",
    "    px.imshow(sim_matrix, x=unique_names, y=unique_names, text_auto=\".2f\", \n",
    "              title=\"Class Centroid Cosine Similarity (Embedding Space)\", \n",
    "              color_continuous_scale=\"RdBu_r\", template=\"plotly_white\").show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Final Performance Summary\n",
    "\n",
    "Final evaluation metrics for the contrastive model, aggregated across the validation set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"Final Pair-wise F1 Score: {results.get('val_f1', 0):.4f}\")\n",
    "print(f\"Final Pair-wise Balanced Accuracy: {results.get('val_balanced_accuracy', 0):.4f}\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.10.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}