Getting Started with evoFE

What is evoFE?

evoFE (Evolutionary Feature Engineering) uses a genetic algorithm to automatically discover useful feature transformations for tabular data. Instead of manually crafting interaction terms, ratios, or binning strategies, you let evolution explore the space of possible transformations and keep the ones that improve predictive performance.

The result is an evo_recipe — a reusable transformation pipeline that can be applied to new data at prediction time.

How it works

Initialisation — A population of individuals is created. Each individual is a “recipe” containing a set of feature transformations (genes).
Evaluation — Every individual is scored via cross-validated or split model performance (LightGBM or XGBoost).
Selection — The top 50 % survive to breed.
Breeding — Survivors are combined (crossover) and randomly altered (mutation) to produce the next generation.
Repeat — The cycle continues until the fitness plateaus or the generation budget is exhausted.

Installation

# Install the released version from CRAN
install.packages("evoFE")

# Or install the development version directly from GitHub
# devtools::install_github("tanopereira/evoFE")

Quick Start — Binary Classification

Let’s classify whether a car has an automatic or manual transmission using the mtcars dataset.

library(evoFE)

data(mtcars)
df <- mtcars
df$am <- as.integer(df$am)   # target: 0 = automatic, 1 = manual

set.seed(42)
res <- evolve_features(
  data      = df,
  target_col = "am",
  task       = "classification",
  evaluator  = "xgboost",
  generations = 5,
  pop_size    = 8,
  cv_folds    = 3,
  early_stopping_rounds = 3,
  verbose     = TRUE
)
#> Starting Evolutionary Feature Engineering...
#>   Task: classification
#>   Evaluator: xgboost
#>   Generations: 5, Population Size: 8, CV Folds: 3
#>   Original Numeric columns: mpg, cyl, disp, hp, drat, wt, qsec, vs, gear, carb
#>   Original Categorical columns:
#> 
#> --- Generation 0 (Baseline) ---
#>   Individual 1: [Original features only]
#>   Tested Individual 1 -> Fitness: 0.7147
#> 
#> [Gen 1] Initialized Population:
#>   Individual 1: [Original features only]
#>   Individual 2: [subtract(hp, qsec), deadwood(vs, wt, carb, hp, gear)]
#>   Individual 3: [normalized_difference(drat, carb), reciprocal(vs)]
#>   Individual 4: [log_ratio(cyl, gear), log_binning_cat7(cyl)]
#>   Individual 5: [log(wt), normalized_difference(gear, drat)]
#>   Individual 6: [pca1(qsec, hp, wt, carb, gear, drat), pca2(qsec, hp, wt, carb, gear, drat), pca3(qsec, hp, wt, carb, gear, drat)]
#>   Individual 7: [log_binning6(qsec), truncated_svd1(drat, vs, hp, disp, wt, gear), truncated_svd2(drat, vs, hp, disp, wt, gear), truncated_svd3(drat, vs, hp, disp, wt, gear)]
#>   Individual 8: [pca1(hp, cyl, carb, qsec, disp, drat), pca2(hp, cyl, carb, qsec, disp, drat), pca3(hp, cyl, carb, qsec, disp, drat)]
#> 
#> --- Generation 1 / 5 (Current Best Fitness: 0.7147) ---
#>   Tested Individual 2 -> Fitness: 0.7147
#>   Tested Individual 3 -> Fitness: 0.7147
#>   Tested Individual 4 -> Fitness: 0.7147
#>   Tested Individual 5 -> Fitness: 0.7031
#>   Tested Individual 6 (New Best!) -> Fitness: 0.7154
#>   Tested Individual 7 -> Fitness: 0.7103
#>   Tested Individual 8 (New Best!) -> Fitness: 0.7157
#>   Gen 1 Best Fitness: 0.7157
#>   Gen 1 Best Recipe: [pca1(hp, cyl, carb, qsec, disp, drat), pca2(hp, cyl, carb, qsec, disp, drat), pca3(hp, cyl, carb, qsec, disp, drat)]
#> 
#> --- Generation 2 / 5 (Current Best Fitness: 0.7157) ---
#>   Tested Individual 2 -> Fitness: 0.7154
#>   Tested Individual 3 -> Fitness: 0.7154
#>   Tested Individual 4 -> Fitness: 0.7147
#>   Tested Individual 5 -> Fitness: 0.7147
#>   Tested Individual 6 -> Fitness: 0.7154
#>   Tested Individual 7 -> Fitness: 0.7154
#>   Tested Individual 8 -> Fitness: 0.7141
#>   Gen 2 Best Fitness: 0.7157
#>   Gen 2 Best Recipe: [pca1(hp, cyl, carb, qsec, disp, drat), pca2(hp, cyl, carb, qsec, disp, drat), pca3(hp, cyl, carb, qsec, disp, drat)]
#> 
#> --- Generation 3 / 5 (Current Best Fitness: 0.7157) ---
#>   Tested Individual 2 -> Fitness: 0.7147
#>   Tested Individual 3 -> Fitness: 0.7154
#>   Tested Individual 4 -> Fitness: 0.7154
#>   Tested Individual 5 -> Fitness: 0.7125
#>   Tested Individual 6 -> Fitness: 0.7154
#>   Tested Individual 7 -> Fitness: 0.7141
#>   Tested Individual 8 -> Fitness: 0.7147 (cached)
#>   Tested Individual 9 -> Fitness: 0.7154
#>   Tested Individual 10 -> Fitness: 0.7154
#>   Tested Individual 11 -> Fitness: 0.7157
#>   Tested Individual 12 -> Fitness: 0.7154
#>   Gen 3 Best Fitness: 0.7157
#>   Gen 3 Best Recipe: [pca1(hp, cyl, carb, qsec, disp, drat), pca2(hp, cyl, carb, qsec, disp, drat), pca3(hp, cyl, carb, qsec, disp, drat)]
#> 
#> --- Generation 4 / 5 (Current Best Fitness: 0.7157) ---
#>   Tested Individual 2 -> Fitness: 0.7154
#>   Tested Individual 3 -> Fitness: 0.7154
#>   Tested Individual 4 (New Best!) -> Fitness: 0.7160
#>   Tested Individual 5 -> Fitness: 0.7157
#>   Tested Individual 6 -> Fitness: 0.7160
#>   Tested Individual 7 -> Fitness: 0.7154
#>   Tested Individual 8 -> Fitness: 0.7101
#>   Tested Individual 9 -> Fitness: 0.7154
#>   Tested Individual 10 -> Fitness: 0.7154
#>   Tested Individual 11 (New Best!) -> Fitness: 0.7173
#>   Tested Individual 12 -> Fitness: 0.7154
#>   Tested Individual 13 -> Fitness: 0.7154
#>   Tested Individual 14 -> Fitness: 0.7157
#>   Tested Individual 15 -> Fitness: 0.7154
#>   Tested Individual 16 -> Fitness: 0.7154
#>   Tested Individual 17 -> Fitness: 0.7154
#>   Tested Individual 18 -> Fitness: 0.7154
#>   Gen 4 Best Fitness: 0.7173
#>   Gen 4 Best Recipe: [subtract(hp, qsec), deadwood(vs, wt, carb, hp, gear), pca1(qsec, hp, wt, carb, gear, drat), pca2(qsec, hp, wt, carb, gear, drat), pca3(qsec, hp, wt, carb, gear, drat), groupby_min(Deadwood(vs_wt_car_hp_gea), PCA2(qse_hp_wt_car_gea_dra)), groupby_max(Deadwood(vs_wt_car_hp_gea), drat), mst_score(mpg, gear, wt, qsec, disp, PCA1(qse_hp_wt_car_gea_dra), vs, hp, cyl, ((hp-qsec)))]
#> 
#> --- Generation 5 / 5 (Current Best Fitness: 0.7173) ---
#>   Tested Individual 2 -> Fitness: 0.7160
#>   Tested Individual 3 -> Fitness: 0.7115
#>   Tested Individual 4 -> Fitness: 0.7128
#>   Tested Individual 5 -> Fitness: 0.7160 (cached)
#>   Tested Individual 6 -> Fitness: 0.7154
#>   Tested Individual 7 -> Fitness: 0.7157
#>   Tested Individual 8 -> Fitness: 0.7154
#>   Tested Individual 9 -> Fitness: 0.7154
#>   Tested Individual 10 -> Fitness: 0.7161
#>   Tested Individual 11 -> Fitness: 0.7160
#>   Tested Individual 12 -> Fitness: 0.7154
#>   Gen 5 Best Fitness: 0.7173
#>   Gen 5 Best Recipe: [subtract(hp, qsec), deadwood(vs, wt, carb, hp, gear), pca1(qsec, hp, wt, carb, gear, drat), pca2(qsec, hp, wt, carb, gear, drat), pca3(qsec, hp, wt, carb, gear, drat), groupby_min(Deadwood(vs_wt_car_hp_gea), PCA2(qse_hp_wt_car_gea_dra)), groupby_max(Deadwood(vs_wt_car_hp_gea), drat), mst_score(mpg, gear, wt, qsec, disp, PCA1(qse_hp_wt_car_gea_dra), vs, hp, cyl, ((hp-qsec)))]
#> 
#> Evolution Complete. Best Fitness: 0.7173
#> Best recipe: [subtract(hp, qsec), deadwood(vs, wt, carb, hp, gear), pca1(qsec, hp, wt, carb, gear, drat), pca2(qsec, hp, wt, carb, gear, drat), pca3(qsec, hp, wt, carb, gear, drat), groupby_min(Deadwood(vs_wt_car_hp_gea), PCA2(qse_hp_wt_car_gea_dra)), groupby_max(Deadwood(vs_wt_car_hp_gea), drat), mst_score(mpg, gear, wt, qsec, disp, PCA1(qse_hp_wt_car_gea_dra), vs, hp, cyl, ((hp-qsec)))]
#> Generated columns: ((hp-qsec)), Deadwood(vs_wt_car_hp_gea), PCA1(qse_hp_wt_car_gea_dra), PCA2(qse_hp_wt_car_gea_dra), PCA3(qse_hp_wt_car_gea_dra), min_PCA2(qse_hp_wt_car_gea_dra)_by_Deadwood(vs_wt_car_hp_gea), max_drat_by_Deadwood(vs_wt_car_hp_gea), MSTScore(mpg_gea_wt_qse_dis_PCA_vs_hp_cyl_((h)
#> Training final model on full dataset...

The returned evo_recipe object contains the best individual (feature recipe), the fitted model, and the evolution history.

# Print high-level overview of the recipe
print(res)
#> An evoFE Recipe
#>   Evaluator:    xgboost
#>   Task:         classification
#>   Metric:       default
#>   Best Fitness: 0.7173
#>   Evolved Features: 8
#>   Winning Recipe:
#>     [1] subtract(hp, qsec) -> ((hp-qsec))
#>     [2] deadwood(vs, wt, carb, hp, gear) -> Deadwood(vs_wt_car_hp_gea)
#>     [3] pca1(qsec, hp, wt, carb, gear, drat) -> PCA1(qse_hp_wt_car_gea_dra)
#>     [4] pca2(qsec, hp, wt, carb, gear, drat) -> PCA2(qse_hp_wt_car_gea_dra)
#>     [5] pca3(qsec, hp, wt, carb, gear, drat) -> PCA3(qse_hp_wt_car_gea_dra)
#>     [6] groupby_min(Deadwood(vs_wt_car_hp_gea), PCA2(qse_hp_wt_car_gea_dra)) -> min_PCA2(qse_hp_wt_car_gea_dra)_by_Deadwood(vs_wt_car_hp_gea)
#>     [7] groupby_max(Deadwood(vs_wt_car_hp_gea), drat) -> max_drat_by_Deadwood(vs_wt_car_hp_gea)
#>     [8] mst_score(mpg, gear, wt, qsec, disp, PCA1(qse_hp_wt_car_gea_dra), vs, hp, cyl, ((hp-qsec))) -> MSTScore(mpg_gea_wt_qse_dis_PCA_vs_hp_cyl_((h)

# View a detailed structured summary
summary(res)
#> === Evolutionary Feature Engineering Summary ===
#> ML Evaluator:          xgboost
#> Task type:             classification
#> Optimization Metric:   default
#> Best CV/Split Fitness: 0.717339
#> Number of Evolved Features: 8
#> 
#> Feature Transformation Details:
#>  Transformer
#>     subtract
#>     deadwood
#>          pca
#>          pca
#>          pca
#>  groupby_min
#>  groupby_max
#>    mst_score
#>                                                                            Inputs
#>                                                                          hp, qsec
#>                                                            vs, wt, carb, hp, gear
#>                                                    qsec, hp, wt, carb, gear, drat
#>                                                    qsec, hp, wt, carb, gear, drat
#>                                                    qsec, hp, wt, carb, gear, drat
#>                           Deadwood(vs_wt_car_hp_gea), PCA2(qse_hp_wt_car_gea_dra)
#>                                                  Deadwood(vs_wt_car_hp_gea), drat
#>  mpg, gear, wt, qsec, disp, PCA1(qse_hp_wt_car_gea_dra), vs, hp, cyl, ((hp-qsec))
#>                                                         Output
#>                                                    ((hp-qsec))
#>                                     Deadwood(vs_wt_car_hp_gea)
#>                                    PCA1(qse_hp_wt_car_gea_dra)
#>                                    PCA2(qse_hp_wt_car_gea_dra)
#>                                    PCA3(qse_hp_wt_car_gea_dra)
#>  min_PCA2(qse_hp_wt_car_gea_dra)_by_Deadwood(vs_wt_car_hp_gea)
#>                         max_drat_by_Deadwood(vs_wt_car_hp_gea)
#>                 MSTScore(mpg_gea_wt_qse_dis_PCA_vs_hp_cyl_((h)

Applying the recipe to new data

predict() applies the evolved transformations to new data and returns the engineered feature matrix:

engineered <- predict(res, df[1:5, ])
head(engineered)
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs  gear  carb ((hp-qsec))
#>    <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>       <num>
#> 1:  21.0     6   160   110  3.90 2.620 16.46     0     4     4       93.54
#> 2:  21.0     6   160   110  3.90 2.875 17.02     0     4     4       92.98
#> 3:  22.8     4   108    93  3.85 2.320 18.61     1     4     1       74.39
#> 4:  21.4     6   258   110  3.08 3.215 19.44     1     3     1       90.56
#> 5:  18.7     8   360   175  3.15 3.440 17.02     0     3     2      157.98
#>    Deadwood(vs_wt_car_hp_gea) PCA1(qse_hp_wt_car_gea_dra)
#>                        <fctr>                       <num>
#> 1:                          0                   0.3321325
#> 2:                          0                   0.3148183
#> 3:                          0                   1.7350342
#> 4:                          0                   0.4553966
#> 5:                          0                  -0.8244106
#>    PCA2(qse_hp_wt_car_gea_dra) PCA3(qse_hp_wt_car_gea_dra)
#>                          <num>                       <num>
#> 1:                   1.1946026                  -0.2309977
#> 2:                   0.9920645                   0.1044791
#> 3:                  -0.1232276                  -0.3980120
#> 4:                  -1.9228372                  -0.3054806
#> 5:                  -0.8976149                  -0.8669144
#>    min_PCA2(qse_hp_wt_car_gea_dra)_by_Deadwood(vs_wt_car_hp_gea)
#>                                                            <num>
#> 1:                                                     -2.439953
#> 2:                                                     -2.439953
#> 3:                                                     -2.439953
#> 4:                                                     -2.439953
#> 5:                                                     -2.439953
#>    max_drat_by_Deadwood(vs_wt_car_hp_gea)
#>                                     <num>
#> 1:                                   4.93
#> 2:                                   4.93
#> 3:                                   4.93
#> 4:                                   4.93
#> 5:                                   4.93
#>    MSTScore(mpg_gea_wt_qse_dis_PCA_vs_hp_cyl_((h)
#>                                             <num>
#> 1:                                       18.87167
#> 2:                                       32.78996
#> 3:                                       48.32824
#> 4:                                       74.43728
#> 5:                                       54.92558

predict_model() goes one step further — it applies the transformations and runs the trained model to produce predictions:

preds <- predict_model(res, df[1:5, ])
preds
#> [1] 0.95863724 0.95863724 0.90907621 0.03203617 0.03203617

Regression

Predict petal length from the iris dataset:

data(iris)

set.seed(123)
res_reg <- evolve_features(
  data       = iris[, c("Sepal.Length", "Sepal.Width", "Petal.Width", "Petal.Length")],
  target_col = "Petal.Length",
  task       = "regression",
  evaluator  = "xgboost",
  generations = 5,
  pop_size    = 8,
  cv_folds    = 3,
  early_stopping_rounds = 3,
  verbose     = TRUE
)
#> Starting Evolutionary Feature Engineering...
#>   Task: regression
#>   Evaluator: xgboost
#>   Generations: 5, Population Size: 8, CV Folds: 3
#>   Original Numeric columns: Sepal.Length, Sepal.Width, Petal.Width
#>   Original Categorical columns:
#> 
#> --- Generation 0 (Baseline) ---
#>   Individual 1: [Original features only]
#>   Tested Individual 1 -> Fitness: -0.3333
#> 
#> [Gen 1] Initialized Population:
#>   Individual 1: [Original features only]
#>   Individual 2: [log_binning_cat10(Sepal.Width), normalized_difference(Petal.Width, Petal.Width)]
#>   Individual 3: [log(Petal.Width), log_binning7(Sepal.Length)]
#>   Individual 4: [quantile_binning6(Petal.Width), reciprocal(Petal.Width)]
#>   Individual 5: [multiply(Sepal.Length, Petal.Width), normalized_difference(Sepal.Length, Petal.Width)]
#>   Individual 6: [pca1(Sepal.Width, Sepal.Length), pca2(Sepal.Width, Sepal.Length), pca3(Sepal.Width, Sepal.Length)]
#>   Individual 7: [log(Sepal.Width), sqrt(Petal.Width)]
#>   Individual 8: [reciprocal(Sepal.Width), truncated_svd1(Sepal.Width, Petal.Width), truncated_svd2(Sepal.Width, Petal.Width), truncated_svd3(Sepal.Width, Petal.Width)]
#> 
#> --- Generation 1 / 5 (Current Best Fitness: -0.3333) ---
#>   Tested Individual 2 -> Fitness: -0.3333
#>   Tested Individual 3 -> Fitness: -0.3333
#>   Tested Individual 4 -> Fitness: -0.3333
#>   Tested Individual 5 -> Fitness: -0.3818
#>   Tested Individual 6 -> Fitness: -0.3494
#>   Tested Individual 7 -> Fitness: -0.3333
#>   Tested Individual 8 (New Best!) -> Fitness: -0.3215
#>   Gen 1 Best Fitness: -0.3215
#>   Gen 1 Best Recipe: [reciprocal(Sepal.Width), truncated_svd1(Sepal.Width, Petal.Width), truncated_svd2(Sepal.Width, Petal.Width), truncated_svd3(Sepal.Width, Petal.Width)]
#> 
#> --- Generation 2 / 5 (Current Best Fitness: -0.3215) ---
#>   Tested Individual 2 -> Fitness: -0.3333
#>   Tested Individual 3 -> Fitness: -0.3333
#>   Tested Individual 4 -> Fitness: -0.3333
#>   Tested Individual 5 -> Fitness: -0.3215
#>   Tested Individual 6 -> Fitness: -0.3215
#>   Tested Individual 7 -> Fitness: -0.3215
#>   Tested Individual 8 -> Fitness: -0.3233
#>   Gen 2 Best Fitness: -0.3215
#>   Gen 2 Best Recipe: [reciprocal(Sepal.Width), truncated_svd1(Sepal.Width, Petal.Width), truncated_svd2(Sepal.Width, Petal.Width), truncated_svd3(Sepal.Width, Petal.Width)]
#> 
#> --- Generation 3 / 5 (Current Best Fitness: -0.3215) ---
#>   Tested Individual 2 -> Fitness: -0.3215
#>   Tested Individual 3 -> Fitness: -0.3215 (cached)
#>   Tested Individual 4 (New Best!) -> Fitness: -0.2951
#>   Tested Individual 5 -> Fitness: -0.3215
#>   Tested Individual 6 -> Fitness: -0.3233
#>   Tested Individual 7 -> Fitness: -0.3215
#>   Tested Individual 8 -> Fitness: -0.3215 (cached)
#>   Tested Individual 9 -> Fitness: -0.3215
#>   Tested Individual 10 -> Fitness: -0.3215
#>   Tested Individual 11 -> Fitness: -0.3215
#>   Tested Individual 12 -> Fitness: -0.3304
#>   Gen 3 Best Fitness: -0.2951
#>   Gen 3 Best Recipe: [log(Petal.Width), log_binning7(Sepal.Length), reciprocal(Sepal.Width), truncated_svd1(Sepal.Width, Petal.Width, Sepal.Length), truncated_svd2(Sepal.Width, Petal.Width), truncated_svd3(Sepal.Width, Petal.Width)]
#> 
#> --- Generation 4 / 5 (Current Best Fitness: -0.2951) ---
#>   Tested Individual 2 -> Fitness: -0.2984
#>   Tested Individual 3 -> Fitness: -0.3215
#>   Tested Individual 4 -> Fitness: -0.3215
#>   Tested Individual 5 -> Fitness: -0.3215 (cached)
#>   Tested Individual 6 -> Fitness: -0.3274
#>   Tested Individual 7 -> Fitness: -0.3215
#>   Tested Individual 8 -> Fitness: -0.3084
#>   Gen 4 Best Fitness: -0.2951
#>   Gen 4 Best Recipe: [log(Petal.Width), log_binning7(Sepal.Length), reciprocal(Sepal.Width), truncated_svd1(Sepal.Width, Petal.Width, Sepal.Length), truncated_svd2(Sepal.Width, Petal.Width), truncated_svd3(Sepal.Width, Petal.Width)]
#> 
#> --- Generation 5 / 5 (Current Best Fitness: -0.2951) ---
#>   Tested Individual 2 (New Best!) -> Fitness: -0.2949
#>   Tested Individual 3 -> Fitness: -0.3357
#>   Tested Individual 4 -> Fitness: -0.3042
#>   Tested Individual 5 -> Fitness: -0.3078
#>   Tested Individual 6 -> Fitness: -0.3362
#>   Tested Individual 7 -> Fitness: -0.3301
#>   Tested Individual 8 -> Fitness: -0.3046
#>   Tested Individual 11 -> Fitness: -0.2951
#>   Tested Individual 12 -> Fitness: -0.3084
#>   Gen 5 Best Fitness: -0.2949
#>   Gen 5 Best Recipe: [reciprocal(Sepal.Width), truncated_svd2(Sepal.Width, Petal.Width), truncated_svd3(Sepal.Width, Petal.Width), truncated_svd1(Sepal.Width, Petal.Width, Sepal.Length)]
#> 
#> Evolution Complete. Best Fitness: -0.2949
#> Best recipe: [reciprocal(Sepal.Width), truncated_svd2(Sepal.Width, Petal.Width), truncated_svd3(Sepal.Width, Petal.Width), truncated_svd1(Sepal.Width, Petal.Width, Sepal.Length)]
#> Generated columns: rec(Sepal.Width), SVD2(Sep_Pet), SVD3(Sep_Pet), SVD1(Sep_Pet_Sep)
#> Training final model on full dataset...

cat("Best recipe:", individual_to_recipe_string(res_reg$best_individual), "\n")
#> Best recipe: [reciprocal(Sepal.Width), truncated_svd2(Sepal.Width, Petal.Width), truncated_svd3(Sepal.Width, Petal.Width), truncated_svd1(Sepal.Width, Petal.Width, Sepal.Length)]
cat("Fitness (neg RMSE):", res_reg$best_individual$fitness, "\n")
#> Fitness (neg RMSE): -0.2948619

preds_reg <- predict_model(res_reg, iris[1:10, ])
# Compare predictions to actuals
data.frame(
  actual    = iris$Petal.Length[1:10],
  predicted = round(preds_reg, 2)
)
#>    actual predicted
#> 1     1.4      1.43
#> 2     1.4      1.46
#> 3     1.3      1.46
#> 4     1.5      1.49
#> 5     1.4      1.43
#> 6     1.7      1.53
#> 7     1.4      1.42
#> 8     1.5      1.49
#> 9     1.4      1.37
#> 10    1.5      1.51

Multiclass Classification

Classify iris species (3 classes). Note task = "multiclass":

iris_mc <- iris
iris_mc$Species <- as.character(iris_mc$Species)

set.seed(99)
res_mc <- evolve_features(
  data       = iris,
  target_col = "Species",
  task       = "multiclass",
  evaluator  = "xgboost",
  generations = 5,
  pop_size    = 8,
  cv_folds    = 3,
  early_stopping_rounds = 3,
  verbose     = TRUE
)
#> Starting Evolutionary Feature Engineering...
#>   Task: multiclass
#>   Evaluator: xgboost
#>   Generations: 5, Population Size: 8, CV Folds: 3
#>   Original Numeric columns: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
#>   Original Categorical columns:
#> 
#> --- Generation 0 (Baseline) ---
#>   Individual 1: [Original features only]
#>   Tested Individual 1 -> Fitness: 0.8382
#> 
#> [Gen 1] Initialized Population:
#>   Individual 1: [Original features only]
#>   Individual 2: [normalized_difference(Sepal.Length, Petal.Length), quantile_binning3(Petal.Length)]
#>   Individual 3: [mst_score(Sepal.Length, Petal.Width), log(Petal.Length)]
#>   Individual 4: [quantile_binning_cat10(Petal.Length), sqrt(Sepal.Width)]
#>   Individual 5: [divide(Petal.Width, Sepal.Width), reciprocal(Petal.Length)]
#>   Individual 6: [lumbermark_k3(Petal.Length, Sepal.Length), log_ratio(Sepal.Width, Petal.Length)]
#>   Individual 7: [sqrt(Sepal.Length), multiply(Sepal.Length, Petal.Length, Petal.Width)]
#>   Individual 8: [truncated_svd1(Petal.Width, Petal.Length), truncated_svd2(Petal.Width, Petal.Length), truncated_svd3(Petal.Width, Petal.Length)]
#> 
#> --- Generation 1 / 5 (Current Best Fitness: 0.8382) ---
#>   Tested Individual 2 -> Fitness: 0.8313
#>   Tested Individual 3 -> Fitness: 0.8382
#>   Tested Individual 4 -> Fitness: 0.8382
#>   Tested Individual 5 -> Fitness: 0.8351
#>   Tested Individual 6 (New Best!) -> Fitness: 0.8499
#>   Tested Individual 7 -> Fitness: 0.8315
#>   Tested Individual 8 -> Fitness: 0.8460
#>   Gen 1 Best Fitness: 0.8499
#>   Gen 1 Best Recipe: [lumbermark_k3(Petal.Length, Sepal.Length), log_ratio(Sepal.Width, Petal.Length)]
#> 
#> --- Generation 2 / 5 (Current Best Fitness: 0.8499) ---
#>   Tested Individual 2 -> Fitness: 0.8382 (cached)
#>   Tested Individual 3 (New Best!) -> Fitness: 0.8643
#>   Tested Individual 4 -> Fitness: 0.8420
#>   Tested Individual 5 -> Fitness: 0.8433
#>   Tested Individual 6 -> Fitness: 0.8420
#>   Tested Individual 7 -> Fitness: 0.8460
#>   Tested Individual 8 -> Fitness: 0.8425
#>   Gen 2 Best Fitness: 0.8643
#>   Gen 2 Best Recipe: [lumbermark_k3(Petal.Length, Sepal.Length), log_ratio(Sepal.Width, Petal.Length), truncated_svd1(Petal.Width, Petal.Length)]
#> 
#> --- Generation 3 / 5 (Current Best Fitness: 0.8643) ---
#>   Tested Individual 2 -> Fitness: 0.8489
#>   Tested Individual 3 -> Fitness: 0.8545
#>   Tested Individual 4 -> Fitness: 0.8460
#>   Tested Individual 5 -> Fitness: 0.8524
#>   Tested Individual 6 -> Fitness: 0.8489
#>   Tested Individual 7 (New Best!) -> Fitness: 0.8698
#>   Tested Individual 8 (New Best!) -> Fitness: 0.8714
#>   Gen 3 Best Fitness: 0.8714
#>   Gen 3 Best Recipe: [lumbermark_k3(Petal.Length, Sepal.Length), log_ratio(Sepal.Width, Petal.Length), truncated_svd1(Petal.Width, Petal.Length), truncated_svd1(SVD1(Pet_Pet), Petal.Length, Petal.Width), truncated_svd2(SVD1(Pet_Pet), Petal.Length, Petal.Width), truncated_svd3(SVD1(Pet_Pet), Petal.Length, Petal.Width)]
#> 
#> --- Generation 4 / 5 (Current Best Fitness: 0.8714) ---
#>   Tested Individual 2 -> Fitness: 0.8699
#>   Tested Individual 3 -> Fitness: 0.8698
#>   Tested Individual 4 -> Fitness: 0.8698
#>   Tested Individual 5 -> Fitness: 0.8714
#>   Tested Individual 6 -> Fitness: 0.8711
#>   Tested Individual 7 -> Fitness: 0.8667
#>   Tested Individual 8 -> Fitness: 0.8699
#>   Gen 4 Best Fitness: 0.8714
#>   Gen 4 Best Recipe: [lumbermark_k3(Petal.Length, Sepal.Length), log_ratio(Sepal.Width, Petal.Length), truncated_svd1(Petal.Width, Petal.Length), truncated_svd1(SVD1(Pet_Pet), Petal.Length, Petal.Width), truncated_svd2(SVD1(Pet_Pet), Petal.Length, Petal.Width), truncated_svd3(SVD1(Pet_Pet), Petal.Length, Petal.Width)]
#> 
#> --- Generation 5 / 5 (Current Best Fitness: 0.8714) ---
#>   Tested Individual 2 -> Fitness: 0.8629
#>   Tested Individual 3 -> Fitness: 0.8683
#>   Tested Individual 4 (New Best!) -> Fitness: 0.8720
#>   Tested Individual 5 -> Fitness: 0.8681
#>   Tested Individual 6 -> Fitness: 0.8716
#>   Tested Individual 7 -> Fitness: 0.8699 (cached)
#>   Tested Individual 8 -> Fitness: 0.8699
#>   Tested Individual 9 -> Fitness: 0.8700
#>   Tested Individual 10 -> Fitness: 0.8689
#>   Tested Individual 11 -> Fitness: 0.8714
#>   Tested Individual 12 -> Fitness: 0.8657
#>   Gen 5 Best Fitness: 0.8720
#>   Gen 5 Best Recipe: [lumbermark_k3(Petal.Length, Sepal.Length), log_ratio(Sepal.Width, Petal.Length), truncated_svd1(Petal.Width, Petal.Length), truncated_svd1(SVD1(Pet_Pet), Petal.Length, Petal.Width), truncated_svd2(SVD1(Pet_Pet), Petal.Length, Petal.Width), truncated_svd3(SVD1(Pet_Pet), Petal.Length, Petal.Width), quantile_binning_cat10(Petal.Length), sqrt(Sepal.Width), truncated_svd2(Petal.Width, Petal.Length), truncated_svd3(Petal.Width, Petal.Length), lumbermark_k3(Petal.Length, Sepal.Length, Petal.Width)]
#> 
#> Evolution Complete. Best Fitness: 0.8720
#> Best recipe: [lumbermark_k3(Petal.Length, Sepal.Length), log_ratio(Sepal.Width, Petal.Length), truncated_svd1(Petal.Width, Petal.Length), truncated_svd1(SVD1(Pet_Pet), Petal.Length, Petal.Width), truncated_svd2(SVD1(Pet_Pet), Petal.Length, Petal.Width), truncated_svd3(SVD1(Pet_Pet), Petal.Length, Petal.Width), quantile_binning_cat10(Petal.Length), sqrt(Sepal.Width), truncated_svd2(Petal.Width, Petal.Length), truncated_svd3(Petal.Width, Petal.Length), lumbermark_k3(Petal.Length, Sepal.Length, Petal.Width)]
#> Generated columns: Lumb3(Pet_Sep), logratio(Sepal.Width_Petal.Length), SVD1(Pet_Pet), SVD1(SVD_Pet_Pet), SVD2(SVD_Pet_Pet), SVD3(SVD_Pet_Pet), qbin_cat10(Petal.Length), sqrt(Sepal.Width), SVD2(Pet_Pet), SVD3(Pet_Pet), Lumb3(Pet_Sep_Pet)
#> Training final model on full dataset...

cat("Best recipe:", individual_to_recipe_string(res_mc$best_individual), "\n")
#> Best recipe: [lumbermark_k3(Petal.Length, Sepal.Length), log_ratio(Sepal.Width, Petal.Length), truncated_svd1(Petal.Width, Petal.Length), truncated_svd1(SVD1(Pet_Pet), Petal.Length, Petal.Width), truncated_svd2(SVD1(Pet_Pet), Petal.Length, Petal.Width), truncated_svd3(SVD1(Pet_Pet), Petal.Length, Petal.Width), quantile_binning_cat10(Petal.Length), sqrt(Sepal.Width), truncated_svd2(Petal.Width, Petal.Length), truncated_svd3(Petal.Width, Petal.Length), lumbermark_k3(Petal.Length, Sepal.Length, Petal.Width)]

For multiclass, predict_model() returns a probability matrix — one column per class:

probs <- predict_model(res_mc, iris_mc[c(1, 51, 101), ])
round(probs, 3)
#>      setosa versicolor virginica
#> [1,]  0.987      0.007     0.006
#> [2,]  0.293      0.394     0.313
#> [3,]  0.005      0.007     0.988

Transformer Reference

evoFE ships with 32 built-in transformers that the genetic algorithm can select from during evolution. The table below groups them by category.

Arithmetic (numeric → numeric)

Transformer	Arity	Description
`log`	unary	Natural logarithm (safe: `log(abs(x) + 1)`)
`sqrt`	unary	Square root (safe: `sqrt(abs(x))`)
`reciprocal`	unary	`1 / (x + ε)`
`add`	multi	Element-wise sum of 2+ columns
`subtract`	binary	`x₁ − x₂`
`multiply`	multi	Element-wise product of 2+ columns
`divide`	binary	`x₁ / (x₂ + ε)`
`normalized_difference`	binary	`(x₁ − x₂) / (x₁ + x₂ + ε)`
`log_ratio`	binary	`log((x₁ + ε) / (x₂ + ε))`

Group-by aggregations (mixed → numeric)

These combine a categorical grouping column with a numeric value column.

Transformer	Description
`groupby_mean`	Mean of value within each group
`groupby_sd`	Standard deviation within each group
`groupby_max` / `groupby_min`	Max / min within each group
`groupby_ratio`	`value / group_mean`
`groupby_zscore`	`(value − group_mean) / group_sd`

Encoding & binning

Transformer	Input → Output	Description
`target_encode`	cat → num	Supervised mean-target encoding with smoothing (for binary classification / regression)
`target_encode_multiclass`	cat → num	Supervised mean-target encoding for multiclass classification tasks (one component per class-indicator)
`frequency_encode`	cat → num	Proportion of each category in the data
`one_hot_encode`	cat → num	Binary one-hot encoding indicator for a specific category (or “other” for rare categories)
`quantile_binning`	num → num	Assign quantile rank (1–5)
`log_binning`	num → num	Assign log-scale bin index
`quantile_binning_cat`	num → cat	Same as `quantile_binning`, output as factor
`log_binning_cat`	num → cat	Same as `log_binning`, output as factor
`datetime_extract`	cat → num	Extracted datetime components (year, month, day, hour, day of week, or weekend indicator) from date/time columns

Dimensionality reduction (numeric → numeric)

Transformer	Description
`pca`	First principal component of 2+ columns
`truncated_svd`	First component from truncated SVD
`random_projection`	Random linear combination of 2+ columns
`umap`	Low-dimensional UMAP projection

Manifold & Graph Learning (numeric → categorical/numeric)

Transformer	Output	Description
`genie`	categorical	Genie robust hierarchical clustering
`lumbermark`	categorical	Lumbermark hierarchical clustering
`mst_score`	numeric	Minimum Spanning Tree-based anomaly score
`deadwood`	categorical	Deadwood anomaly detection (outlier indicators)

Hierarchical Features (Gene Chaining)

One of evoFE’s powerful capabilities is hierarchical feature construction. After a gene has been evaluated and proven useful, subsequent generations can build on top of its output.

For example:

Gen 1: log_ratio(Sepal.Length, Petal.Width)        → tested ✓
Gen 2: divide(Petal.Width, logratio(…))            → chains from tested gene ✓

Important safety rule: a gene can only chain from outputs that have been evaluated in a previous generation. A brand-new untested gene is never used as input for another gene in the same individual. This prevents fragile dependency chains built on unproven transformations.

Custom Transformer Registration

evoFE makes it easy to register your own custom transformations, extending the genetic algorithm’s vocabulary with domain-specific features.

Use create_transformer() to define your transformer, and register it with register_transformer() to make it available during evolution:

library(evoFE)

# 1. Define a transformer that adds 5 to a numeric variable
add_five_trans <- create_transformer(
  name = "add_five",
  type = "unary",
  input_type = "numeric",
  apply_func = function(data, gene, state = NULL) {
    data[[gene$input_cols[1]]] + 5
  },
  name_generator = function(gene) paste0("add5_", gene$input_cols[1])
)

# 2. Register it with the package registry
register_transformer("add_five", add_five_trans)

# Now, "add_five" is part of the active transformer pool and will
# be automatically selected, mutated, and chained during evolution!

Understanding the Output

evolve_features() returns an evo_recipe S3 object with:

Field	Description
`best_individual`	The winning recipe (list of genes, column sets, fitness)
`best_model`	The final LightGBM/XGBoost model trained on all data
`history`	Full final-generation population (for inspection)
`task`	The task type used
`evaluator`	The evaluator used
`classes`	Class labels (multiclass only)

Inspecting the recipe

ind <- res$best_individual

# Human-readable recipe string
cat(individual_to_recipe_string(ind), "\n")
#> [subtract(hp, qsec), deadwood(vs, wt, carb, hp, gear), pca1(qsec, hp, wt, carb, gear, drat), pca2(qsec, hp, wt, carb, gear, drat), pca3(qsec, hp, wt, carb, gear, drat), groupby_min(Deadwood(vs_wt_car_hp_gea), PCA2(qse_hp_wt_car_gea_dra)), groupby_max(Deadwood(vs_wt_car_hp_gea), drat), mst_score(mpg, gear, wt, qsec, disp, PCA1(qse_hp_wt_car_gea_dra), vs, hp, cyl, ((hp-qsec)))]

# Number of evolved genes
cat("Evolved genes:", length(ind$genes), "\n")
#> Evolved genes: 8

# Original columns retained
cat("Numeric cols: ", paste(ind$numeric_cols, collapse = ", "), "\n")
#> Numeric cols:  mpg, cyl, disp, hp, drat, wt, qsec, vs, gear, carb
cat("Categorical cols:", paste(ind$categorical_cols, collapse = ", "), "\n")
#> Categorical cols:

# Individual gene details
for (g in ind$genes) {
  cat(sprintf("  %s(%s) → %s\n",
    g$transformer_name,
    paste(g$input_cols, collapse = ", "),
    g$output_col))
}
#>   subtract(hp, qsec) → ((hp-qsec))
#>   deadwood(vs, wt, carb, hp, gear) → Deadwood(vs_wt_car_hp_gea)
#>   pca(qsec, hp, wt, carb, gear, drat) → PCA1(qse_hp_wt_car_gea_dra)
#>   pca(qsec, hp, wt, carb, gear, drat) → PCA2(qse_hp_wt_car_gea_dra)
#>   pca(qsec, hp, wt, carb, gear, drat) → PCA3(qse_hp_wt_car_gea_dra)
#>   groupby_min(Deadwood(vs_wt_car_hp_gea), PCA2(qse_hp_wt_car_gea_dra)) → min_PCA2(qse_hp_wt_car_gea_dra)_by_Deadwood(vs_wt_car_hp_gea)
#>   groupby_max(Deadwood(vs_wt_car_hp_gea), drat) → max_drat_by_Deadwood(vs_wt_car_hp_gea)
#>   mst_score(mpg, gear, wt, qsec, disp, PCA1(qse_hp_wt_car_gea_dra), vs, hp, cyl, ((hp-qsec))) → MSTScore(mpg_gea_wt_qse_dis_PCA_vs_hp_cyl_((h)

Evaluation Strategies

evoFE supports two evaluation strategies for scoring individuals:

Cross-Validation (cv): The default strategy. Evaluates the fitness of individuals using \(K\)-fold cross-validation (cv_folds parameter).
Train/Validation/Holdout Split (split): Useful for faster evaluation on larger datasets. You configure it with evaluation_strategy = "split" and split_ratio (e.g., c(0.6, 0.2, 0.2)).
- The first two portions of split_ratio are used as the Train and Validation sets to score the candidate recipes during the evolutionary search.
- The third portion (if provided) is the Holdout set. To prevent data leakage/snooping and optimize computation time, the holdout set is only evaluated once at the very end of evolution on the final selected best individual.

Alternative and Custom Metrics

By default, evoFE optimizes for LogLoss on classification tasks and RMSE on regression tasks. However, you can optimize for other metrics by passing the metric parameter to evolve_features():

Binary Classification: "default" (LogLoss), "auc" (Area Under the ROC Curve), or "f1" (F1-score at a 0.5 probability threshold).
Multiclass Classification: "default" (Multiclass LogLoss) or "auc" (One-vs-Rest macro-averaged AUC).
Regression: "default" (RMSE) or "mae" (Mean Absolute Error).
Custom Metrics: You can pass any custom function of the form function(y_true, y_pred) that returns a numeric value. Note: Since the genetic algorithm always maximizes fitness, ensure your custom metric returns a value where higher is better (e.g., negate error metrics).