{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 01: Datasets and Preprocessing\n", "\n", "This notebook provides a comprehensive visual exploration of the REIMS datasets. We examine class distributions, spectral signatures, and dimensionality reduction projections to understand the underlying structure of the mass spectrometry data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import plotly.io as pio\n", "pio.renderers.default = \"notebook_connected\"\n", "\n", "import pandas as pd\n", "import numpy as np\n", "import plotly.express as px\n", "import plotly.graph_objects as go\n", "from sklearn.decomposition import PCA\n", "from sklearn.manifold import TSNE\n", "\n", "from fishy.data.module import create_data_module\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Class Distribution\n", "\n", "The class distribution plot shows the number of samples available for each category in the dataset. A balanced dataset is generally preferred for training robust machine learning models, as it prevents the model from developing a bias toward the majority class." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dm = create_data_module(\"species\")\n", "dm.setup()\n", "df = dm.get_filtered_dataframe()\n", "dist = df[\"Class Name\"].value_counts()\n", "px.bar(x=dist.index, y=dist.values, labels={'x': 'Class', 'y': 'Count'}, \n", " title=\"Species Class Distribution\", template=\"plotly_white\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. TIC Normalization Quality (Before vs. After)\n", "\n", "Total Ion Count (TIC) variability is a common technical artifact in mass spectrometry. By comparing the distribution of total intensities before and after normalization, we can verify that the systematic variations in the raw data have been successfully mitigated, ensuring that the remaining signals primarily reflect biological variance." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_norm, y_all = dm.get_numpy_data(labels_as_indices=True)\n", "label_col = \"Class Name\"\n", "feature_cols = [c for c in df.columns if c not in [label_col, \"m/z\"]]\n", "X_raw = df[feature_cols].values\n", "\n", "tic_df = pd.DataFrame({\n", " \"TIC\": np.concatenate([X_raw.sum(axis=1), X_norm.sum(axis=1)]),\n", " \"State\": [\"Raw\"] * len(X_raw) + [\"Normalized\"] * len(X_norm),\n", " \"Class\": list(df[label_col]) * 2\n", "})\n", "\n", "px.violin(tic_df, x=\"State\", y=\"TIC\", color=\"State\", box=True, points=\"all\",\n", " title=\"Total Ion Count (TIC) Distribution: Normalization Impact\", \n", " template=\"plotly_white\", color_discrete_map={\"Raw\": \"#EF553B\", \"Normalized\": \"#00CC96\"})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Interactive Spectrum Viewer\n", "\n", "This line plot overlays multiple individual samples from the dataset. It allows us to observe the raw variance and noise present in the REIMS spectra. By looking at individual samples, we can see if certain peaks are consistently present or if there are significant outliers." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sample_df = df.sample(min(10, len(df)))\n", "melted = sample_df[[label_col]+feature_cols].melt(id_vars=label_col, var_name=\"m/z\", value_name=\"Intensity\")\n", "px.line(melted, x=\"m/z\", y=\"Intensity\", color=label_col, \n", " title=\"Interactive Spectrum Overlay (Random Samples)\", template=\"plotly_white\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Mean Spectral Signatures\n", "\n", "The mean spectral signature represents the average intensity across all samples in a class, with the shaded area indicating one standard deviation. This visualization is critical for identifying \"diagnostic peaks\"\u2014specific m/z values where classes show distinct, non-overlapping behavior." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "names = dm.get_class_names()\n", "mz_axis = np.array([float(c) for c in feature_cols])\n", "\n", "avg_fig = go.Figure()\n", "for idx, name in enumerate(names):\n", " mask = (y_all == idx)\n", " if np.any(mask):\n", " m, s = X_norm[mask].mean(axis=0), X_norm[mask].std(axis=0)\n", " avg_fig.add_trace(go.Scatter(x=mz_axis, y=m, name=name, mode=\"lines\"))\n", " avg_fig.add_trace(go.Scatter(x=np.concatenate([mz_axis, mz_axis[::-1]]), \n", " y=np.concatenate([m+s, (m-s)[::-1]]), \n", " fill=\"toself\", fillcolor=\"rgba(0,100,80,0.1)\", \n", " line=dict(color=\"rgba(255,255,255,0)\"), \n", " hoverinfo=\"skip\", showlegend=False))\n", "\n", "avg_fig.update_layout(template=\"plotly_white\", xaxis_title=\"m/z\", yaxis_title=\"Intensity\", title=\"Mean Signatures with Std Dev\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Difference Spectra (Subtraction)\n", "\n", "Difference spectra highlight the exact m/z values where two classes diverge. By subtracting the mean of Class B from Class A, we can pinpoint specific metabolic or lipidomic peaks that are upregulated (positive) or downregulated (negative) in one species relative to another." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "if len(names) >= 2:\n", " m1 = X_norm[y_all == 0].mean(axis=0)\n", " m2 = X_norm[y_all == 1].mean(axis=0)\n", " diff = m1 - m2\n", " \n", " fig_diff = go.Figure()\n", " fig_diff.add_trace(go.Scatter(x=mz_axis, y=diff, name=f\"{names[0]} - {names[1]}\", line=dict(color='red', width=1.5)))\n", " fig_diff.add_hline(y=0, line_dash=\"dash\", line_color=\"black\", opacity=0.5)\n", " fig_diff.update_layout(title=f\"Difference Spectrum: {names[0]} vs {names[1]}\", \n", " xaxis_title=\"m/z\", yaxis_title=\"Intensity Difference\", template=\"plotly_white\")\n", " fig_diff.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. PCA and Information Retention\n", "\n", "Principal Component Analysis (PCA) reduces the 2000+ features into a few orthogonal components. The scatter plot shows how well the classes separate in this reduced space. The **Information Retention** plot shows the cumulative explained variance; a steep curve indicates that most of the dataset's information is captured in just a few dimensions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pca_obj = PCA(n_components=10)\n", "X_pca = pca_obj.fit_transform(X_norm)\n", "fig_pca = px.scatter(x=X_pca[:,0], y=X_pca[:,1], color=[names[i] for i in y_all], \n", " title=\"PCA Projection (PC1 vs PC2)\", labels={'x':'PC1', 'y':'PC2'}, template=\"plotly_white\")\n", "fig_pca.show()\n", "\n", "px.area(x=range(1, 11), y=np.cumsum(pca_obj.explained_variance_ratio_), \n", " labels={'x': 'Components', 'y': 'Cumulative Variance'}, \n", " title=\"PCA: Information Retention (Explained Variance)\", template=\"plotly_white\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7. t-SNE Visualization\n", "\n", "t-Distributed Stochastic Neighbor Embedding (t-SNE) is a non-linear dimensionality reduction technique that is excellent at preserving local structures. Unlike PCA, it can capture complex, non-linear relationships, often resulting in tighter, more distinct class clusters." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tsne = TSNE(n_components=2, perplexity=min(30, len(X_norm)-1), random_state=42)\n", "X_tsne = tsne.fit_transform(X_norm)\n", "px.scatter(x=X_tsne[:,0], y=X_tsne[:,1], color=[names[i] for i in y_all], \n", " title=\"t-SNE Projection\", template=\"plotly_white\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 8. Global Intensity Distribution\n", "\n", "The violin plots show the distribution of the average intensity for each sample, grouped by class. This helps identify if certain classes have systematically higher or lower total ion counts, which could be a source of bias if not properly normalized." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "avg_int = pd.DataFrame({\"Avg Intensity\": X_norm.mean(axis=1), \"Class\": [names[i] for i in y_all]})\n", "px.violin(avg_int, x=\"Class\", y=\"Avg Intensity\", color=\"Class\", \n", " box=True, points=\"all\", title=\"Global Intensity Distribution per Class\", template=\"plotly_white\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "name": "python", "version": "3.10.0" } }, "nbformat": 4, "nbformat_minor": 5 }