fishy.data.module¶
Data module for managing loading, filtering, and preprocessing of datasets.
Classes
- class fishy.data.module.DataModule(dataset_name: str, file_path: str | Path | None = None, batch_size: int = 64, is_pre_train: bool = False, augmentation_config: AugmentationConfig | None = None, random_projection: bool = False, quantize: bool = False, turbo_quant: bool = False, polar: bool = False, normalize: bool = False, snv: bool = False, minmax: bool = False, log_transform: bool = False, savgol: bool = False, run_id: int = 42)[source]¶
Bases:
objectHigh-level interface for data management.
- __init__(dataset_name: str, file_path: str | Path | None = None, batch_size: int = 64, is_pre_train: bool = False, augmentation_config: AugmentationConfig | None = None, random_projection: bool = False, quantize: bool = False, turbo_quant: bool = False, polar: bool = False, normalize: bool = False, snv: bool = False, minmax: bool = False, log_transform: bool = False, savgol: bool = False, run_id: int = 42) None[source]¶
- class fishy.data.module.DataProcessor(dataset_name: str, batch_size: int = 64)[source]¶
Bases:
objectHandles the low-level processing of raw data into features and labels.
Functions
- fishy.data.module.create_data_module(dataset_name: str, file_path: str | Path | None = None, batch_size: int = 64, is_pre_train: bool = False, augmentation_enabled: bool = False, random_projection: bool = False, quantize: bool = False, turbo_quant: bool = False, polar: bool = False, normalize: bool = False, snv: bool = False, minmax: bool = False, log_transform: bool = False, savgol: bool = False, run_id: int = 42, **kwargs) DataModule[source]¶
- fishy.data.module.make_pairwise_test_split(X: ndarray, y: ndarray, run_id: int, *extra_arrays, test_size: float = 0.5) Tuple[source]¶
Creates a reproducible train/test split for pairwise batch-detection evaluation.
Uses test_size=0.5 (not stratified) so that roughly half the samples land in the test set. With 3 samples per class this gives ~1-2 per class in test, producing real positive pairs. Stratification is intentionally avoided: with 3 samples per class a stratified 1/3 split gives exactly 1 per class in test (zero positive pairs), making the pairwise balanced accuracy trivially 1.0.
Pass additional arrays (e.g. y_onehot) as *extra_arrays to split them with the same indices. Returns (X_train, X_test, y_train, y_test, *extra_trains, *extra_tests).
- fishy.data.module.preprocess_data_pipeline(data_processor: DataProcessor, file_path: str | Path, is_pre_train: bool = False, augmentation_cfg: AugmentationConfig | None = None, random_projection: bool = False, quantize: bool = False, turbo_quant: bool = False, polar: bool = False, normalize: bool = False, snv: bool = False, minmax: bool = False, log_transform: bool = False, savgol: bool = False, run_id: int = 42) Tuple[DataLoader, DataFrame, DataFrame][source]¶
s