Skip to content

Transformation Catalog

This is the complete current transformation list exposed through the builder API.

Validation

  • validate_schema(required_cols, expected_dtypes=None, strict=True)
  • Enforces required column presence and optional dtype contracts.

Categorical + Encoding

  • group_rare_categories(categorical_cols, min_freq=0.01, other_label='__OTHER__')
  • Replaces infrequent labels with a shared bucket.
  • target_encode(categorical_cols, smoothing=10.0, suffix='_TE', drop_original=False)
  • Smoothed target encoding for categorical features.
  • one_hot_encode(non_ordinal_categorical_cols)
  • One-hot encoding with train/test column alignment.
  • woe_categorical_imputer(categorical_cols)
  • Weight-of-Evidence encoding based on target distribution.
  • cat_norm_cat_features(categorical_variables, numerical_variables, categorical_alias)
  • Category-relative normalization for numeric variables.

Missing Values

  • fill_na(fill_na_dict)
  • Dictionary-based fill for missing values.
  • mean_imputation_na_list(fill_na_mean, strategy)
  • sklearn SimpleImputer wrapper for selected columns.

Time + Date

  • encode_date_col(date_col, time_col=None)
  • Adds calendar/time components.
  • encode_cyclical_time(period_map, drop_original=False)
  • Adds sin/cos features for periodic numeric fields.
  • add_prophet_features(date_col, time_col=None, y_col='close')
  • Adds Prophet-derived time features.

Numeric Shaping

  • clip_feature_outliers(numeric_cols=None, lower_quantile=0.01, upper_quantile=0.99)
  • Quantile clipping.
  • winsorize_features(numeric_cols=None, lower_tail=0.01, upper_tail=0.99)
  • Winsorization by tail percentiles.
  • transform_skewed_features(numeric_cols=None, method='yeo-johnson', standardize=False)
  • Yeo-Johnson or log1p transform.
  • scaler(scaler_type)
  • Numeric scaling (standard, min_max, robust, max_absolute).
  • reduce_mem_load()
  • Numeric dtype downcasting helper.

Feature Construction

  • add_interaction_features(numeric_cols=None, include_self_interactions=False)
  • Pairwise interaction products.
  • vectorize_text(text_col, max_features=200, prefix='TFIDF', drop_original=True)
  • TF-IDF feature generation for one text column.

Feature Selection

  • drop_correlated_features(drop_thresh)
  • Correlation-based drop (engineering layer).
  • decision_tree_feat_select(importance_thresh)
  • Tree-importance threshold selector (engineering layer).
  • random_forest_feat_select(importance_thresh)
  • RF-importance threshold selector (engineering layer).
  • select_by_correlation(drop_thresh)
  • Dedicated feature-selection layer variant.
  • select_by_decision_tree(importance_thresh)
  • Dedicated feature-selection layer variant.
  • select_by_random_forest(importance_thresh)
  • Dedicated feature-selection layer variant.

Monitoring + Explainability

  • monitor_drift(numeric_cols=None)
  • Stores basic baseline/current drift stats for numeric columns.
  • explain_with_permutation_importance(model, scoring=None, n_repeats=5)
  • Computes permutation importances and stores report.

Resampling

  • random_oversampling(sampling_strategy)
  • smote_oversampling(sampling_strategy, k_neighbors=5)
  • adasyn_oversampling(sampling_strategy, k_neighbors=5)
  • borderline_smote_oversampling(sampling_strategy, k_neighbors=5, kind='borderline-1')
  • random_undersampling(sampling_strategy)
  • cluster_centroids_undersampling(sampling_strategy)
  • tomek_links_undersampling()
  • enn_undersampling(sampling_strategy)
  • near_miss_undersampling(sampling_strategy, version=1)