fishy.data.module

Data module for managing loading, filtering, and preprocessing of datasets.

Classes

class fishy.data.module.DataModule(dataset_name: str, file_path: str | Path | None = None, batch_size: int = 64, is_pre_train: bool = False, augmentation_config: AugmentationConfig | None = None, random_projection: bool = False, quantize: bool = False, turbo_quant: bool = False, polar: bool = False, normalize: bool = False, snv: bool = False, minmax: bool = False, log_transform: bool = False, savgol: bool = False, run_id: int = 42)[source]

Bases: object

High-level interface for data management.

__init__(dataset_name: str, file_path: str | Path | None = None, batch_size: int = 64, is_pre_train: bool = False, augmentation_config: AugmentationConfig | None = None, random_projection: bool = False, quantize: bool = False, turbo_quant: bool = False, polar: bool = False, normalize: bool = False, snv: bool = False, minmax: bool = False, log_transform: bool = False, savgol: bool = False, run_id: int = 42) None[source]
get_class_names() List[str][source]
get_dataset()[source]
get_filtered_dataframe() DataFrame[source]
get_groups() ndarray | None[source]
get_input_dim() int[source]
get_num_classes() int[source]
get_numpy_data(labels_as_indices: bool = False) Tuple[ndarray, ndarray][source]
get_train_dataframe() DataFrame[source]
get_train_dataloader() DataLoader[source]
setup() None[source]
class fishy.data.module.DataProcessor(dataset_name: str, batch_size: int = 64)[source]

Bases: object

Handles the low-level processing of raw data into features and labels.

__init__(dataset_name: str, batch_size: int = 64) None[source]
encode_labels(data: DataFrame) Tuple[ndarray, ndarray][source]
extract_groups(data: DataFrame) ndarray[source]
filter_data(data: DataFrame, is_pre_train: bool = False) DataFrame[source]
load_data(file_path: str | Path = None) DataFrame[source]

Functions

fishy.data.module.create_data_module(dataset_name: str, file_path: str | Path | None = None, batch_size: int = 64, is_pre_train: bool = False, augmentation_enabled: bool = False, random_projection: bool = False, quantize: bool = False, turbo_quant: bool = False, polar: bool = False, normalize: bool = False, snv: bool = False, minmax: bool = False, log_transform: bool = False, savgol: bool = False, run_id: int = 42, **kwargs) DataModule[source]
fishy.data.module.make_pairwise_test_split(X: ndarray, y: ndarray, run_id: int, *extra_arrays, test_size: float = 0.5) Tuple[source]

Creates a reproducible train/test split for pairwise batch-detection evaluation.

Uses test_size=0.5 (not stratified) so that roughly half the samples land in the test set. With 3 samples per class this gives ~1-2 per class in test, producing real positive pairs. Stratification is intentionally avoided: with 3 samples per class a stratified 1/3 split gives exactly 1 per class in test (zero positive pairs), making the pairwise balanced accuracy trivially 1.0.

Pass additional arrays (e.g. y_onehot) as *extra_arrays to split them with the same indices. Returns (X_train, X_test, y_train, y_test, *extra_trains, *extra_tests).

fishy.data.module.preprocess_data_pipeline(data_processor: DataProcessor, file_path: str | Path, is_pre_train: bool = False, augmentation_cfg: AugmentationConfig | None = None, random_projection: bool = False, quantize: bool = False, turbo_quant: bool = False, polar: bool = False, normalize: bool = False, snv: bool = False, minmax: bool = False, log_transform: bool = False, savgol: bool = False, run_id: int = 42) Tuple[DataLoader, DataFrame, DataFrame][source]

s