Dataset Configuration

This guide explains how to integrate new spectral datasets into the framework.

Data Format

The framework primarily expects spectral data in Excel (.xlsx) or CSV format. The standard structure used in this project is:

  • Row 1: Header row.

  • Column 1: Identifier or Target (e.g., Species name, or specific identifier).

  • Columns 2+: Features (typically intensity values for specific m/z values).

Registering a Dataset

To add a dataset, edit fishy/configs/datasets.yaml. Each entry supports several keys:

my-new-dataset:
  filter_rules:
    exclude_mz: ["QC"]  # Drops rows containing "QC" in the m/z column
    include_mz_pattern: "Hoki" # Only keeps rows matching this pattern
  label_encoding:
    type: "sklearn"  # Supported: sklearn, one_hot, map, regex_float

Encoding Types

  • sklearn: Uses a standard LabelEncoder on the first column.

  • one_hot: Converts categorical names into binary vectors.

  • map: A manual dictionary mapping substrings to label vectors.

  • regex_float: Extracts a numeric value from a string using regex (useful for regression).

Filtering Logic

The DataProcessor automatically filters out “QC” (Quality Control) samples by default. You can add custom rules to ignore specific batches or experimental conditions without modifying the raw data files.