Dataset Configuration
=====================

This guide explains how to integrate new spectral datasets into the framework.

Data Format
-----------
The framework primarily expects spectral data in **Excel (.xlsx)** or **CSV** format.
The standard structure used in this project is:

*   **Row 1**: Header row.
*   **Column 1**: Identifier or Target (e.g., Species name, or specific identifier).
*   **Columns 2+**: Features (typically intensity values for specific m/z values).

Registering a Dataset
---------------------
To add a dataset, edit ``fishy/configs/datasets.yaml``. Each entry supports several keys:

.. code-block:: yaml

   my-new-dataset:
     filter_rules:
       exclude_mz: ["QC"]  # Drops rows containing "QC" in the m/z column
       include_mz_pattern: "Hoki" # Only keeps rows matching this pattern
     label_encoding:
       type: "sklearn"  # Supported: sklearn, one_hot, map, regex_float

Encoding Types
--------------
*   ``sklearn``: Uses a standard LabelEncoder on the first column.
*   ``one_hot``: Converts categorical names into binary vectors.
*   ``map``: A manual dictionary mapping substrings to label vectors.
*   ``regex_float``: Extracts a numeric value from a string using regex (useful for regression).

Filtering Logic
---------------
The ``DataProcessor`` automatically filters out "QC" (Quality Control) samples by default. You can add custom rules to ignore specific batches or experimental conditions without modifying the raw data files.