fishy.data.module¶

Data module for managing loading, filtering, and preprocessing of datasets.

Classes

class fishy.data.module.DataModule(dataset_name: str, file_path: str | Path | None = None, batch_size: int = 64, is_pre_train: bool = False, augmentation_config: AugmentationConfig | None = None, random_projection: bool = False, quantize: bool = False, turbo_quant: bool = False, polar: bool = False, normalize: bool = False, snv: bool = False, minmax: bool = False, log_transform: bool = False, savgol: bool = False, run_id: int = 42)[source]¶

Bases: object

High-level interface for data management.

__init__(dataset_name: str, file_path: str | Path | None = None, batch_size: int = 64, is_pre_train: bool = False, augmentation_config: AugmentationConfig | None = None, random_projection: bool = False, quantize: bool = False, turbo_quant: bool = False, polar: bool = False, normalize: bool = False, snv: bool = False, minmax: bool = False, log_transform: bool = False, savgol: bool = False, run_id: int = 42) → None[source]¶

get_class_names() → List[str][source]¶

get_dataset()[source]¶

get_filtered_dataframe() → DataFrame[source]¶

get_groups() → ndarray | None[source]¶

get_input_dim() → int[source]¶

get_num_classes() → int[source]¶

get_numpy_data(labels_as_indices: bool = False) → Tuple[ndarray, ndarray][source]¶

get_train_dataframe() → DataFrame[source]¶

get_train_dataloader() → DataLoader[source]¶

setup() → None[source]¶

class fishy.data.module.DataProcessor(dataset_name: str, batch_size: int = 64)[source]¶

Bases: object

Handles the low-level processing of raw data into features and labels.

__init__(dataset_name: str, batch_size: int = 64) → None[source]¶

encode_labels(data: DataFrame) → Tuple[ndarray, ndarray][source]¶

extract_groups(data: DataFrame) → ndarray[source]¶

filter_data(data: DataFrame, is_pre_train: bool = False) → DataFrame[source]¶

load_data(file_path: str | Path = None) → DataFrame[source]¶

Functions

fishy.data.module.create_data_module(dataset_name: str, file_path: str | Path | None = None, batch_size: int = 64, is_pre_train: bool = False, augmentation_enabled: bool = False, random_projection: bool = False, quantize: bool = False, turbo_quant: bool = False, polar: bool = False, normalize: bool = False, snv: bool = False, minmax: bool = False, log_transform: bool = False, savgol: bool = False, run_id: int = 42, **kwargs) → DataModule[source]¶

fishy.data.module.make_pairwise_test_split(X: ndarray, y: ndarray, run_id: int, *extra_arrays, test_size: float = 0.5) → Tuple[source]¶

Creates a reproducible train/test split for pairwise batch-detection evaluation.

Uses test_size=0.5 (not stratified) so that roughly half the samples land in the test set. With 3 samples per class this gives ~1-2 per class in test, producing real positive pairs. Stratification is intentionally avoided: with 3 samples per class a stratified 1/3 split gives exactly 1 per class in test (zero positive pairs), making the pairwise balanced accuracy trivially 1.0.

Pass additional arrays (e.g. y_onehot) as *extra_arrays to split them with the same indices. Returns (X_train, X_test, y_train, y_test, *extra_trains, *extra_tests).

fishy.data.module.preprocess_data_pipeline(data_processor: DataProcessor, file_path: str | Path, is_pre_train: bool = False, augmentation_cfg: AugmentationConfig | None = None, random_projection: bool = False, quantize: bool = False, turbo_quant: bool = False, polar: bool = False, normalize: bool = False, snv: bool = False, minmax: bool = False, log_transform: bool = False, savgol: bool = False, run_id: int = 42) → Tuple[DataLoader, DataFrame, DataFrame][source]¶