evoFE (Evolutionary Feature Engineering) uses a genetic algorithm to automatically discover useful feature transformations for tabular data. Instead of manually crafting interaction terms, ratios, or binning strategies, you let evolution explore the space of possible transformations and keep the ones that improve predictive performance.
The result is an evo_recipe — a reusable transformation pipeline that can be applied to new data at prediction time.
Let’s classify whether a car has an automatic or manual transmission
using the mtcars dataset.
library(evoFE)
data(mtcars)
df <- mtcars
df$am <- as.integer(df$am) # target: 0 = automatic, 1 = manual
set.seed(42)
res <- evolve_features(
data = df,
target_col = "am",
task = "classification",
evaluator = "xgboost",
generations = 5,
pop_size = 8,
cv_folds = 3,
early_stopping_rounds = 3,
verbose = TRUE
)
#> Starting Evolutionary Feature Engineering...
#> Task: classification
#> Evaluator: xgboost
#> Generations: 5, Population Size: 8, CV Folds: 3
#> Original Numeric columns: mpg, cyl, disp, hp, drat, wt, qsec, vs, gear, carb
#> Original Categorical columns:
#>
#> --- Generation 0 (Baseline) ---
#> Individual 1: [Original features only]
#> Tested Individual 1 -> Fitness: 0.7147
#>
#> [Gen 1] Initialized Population:
#> Individual 1: [Original features only]
#> Individual 2: [subtract(hp, qsec), deadwood(vs, wt, carb, hp, gear)]
#> Individual 3: [normalized_difference(drat, carb), reciprocal(vs)]
#> Individual 4: [log_ratio(cyl, gear), log_binning_cat7(cyl)]
#> Individual 5: [log(wt), normalized_difference(gear, drat)]
#> Individual 6: [pca1(qsec, hp, wt, carb, gear, drat), pca2(qsec, hp, wt, carb, gear, drat), pca3(qsec, hp, wt, carb, gear, drat)]
#> Individual 7: [log_binning6(qsec), truncated_svd1(drat, vs, hp, disp, wt, gear), truncated_svd2(drat, vs, hp, disp, wt, gear), truncated_svd3(drat, vs, hp, disp, wt, gear)]
#> Individual 8: [pca1(hp, cyl, carb, qsec, disp, drat), pca2(hp, cyl, carb, qsec, disp, drat), pca3(hp, cyl, carb, qsec, disp, drat)]
#>
#> --- Generation 1 / 5 (Current Best Fitness: 0.7147) ---
#> Tested Individual 2 -> Fitness: 0.7147
#> Tested Individual 3 -> Fitness: 0.7147
#> Tested Individual 4 -> Fitness: 0.7147
#> Tested Individual 5 -> Fitness: 0.7031
#> Tested Individual 6 (New Best!) -> Fitness: 0.7154
#> Tested Individual 7 -> Fitness: 0.7103
#> Tested Individual 8 (New Best!) -> Fitness: 0.7157
#> Gen 1 Best Fitness: 0.7157
#> Gen 1 Best Recipe: [pca1(hp, cyl, carb, qsec, disp, drat), pca2(hp, cyl, carb, qsec, disp, drat), pca3(hp, cyl, carb, qsec, disp, drat)]
#>
#> --- Generation 2 / 5 (Current Best Fitness: 0.7157) ---
#> Tested Individual 2 -> Fitness: 0.7154
#> Tested Individual 3 -> Fitness: 0.7154
#> Tested Individual 4 -> Fitness: 0.7147
#> Tested Individual 5 -> Fitness: 0.7147
#> Tested Individual 6 -> Fitness: 0.7154
#> Tested Individual 7 -> Fitness: 0.7154
#> Tested Individual 8 -> Fitness: 0.7141
#> Gen 2 Best Fitness: 0.7157
#> Gen 2 Best Recipe: [pca1(hp, cyl, carb, qsec, disp, drat), pca2(hp, cyl, carb, qsec, disp, drat), pca3(hp, cyl, carb, qsec, disp, drat)]
#>
#> --- Generation 3 / 5 (Current Best Fitness: 0.7157) ---
#> Tested Individual 2 -> Fitness: 0.7147
#> Tested Individual 3 -> Fitness: 0.7154
#> Tested Individual 4 -> Fitness: 0.7154
#> Tested Individual 5 -> Fitness: 0.7125
#> Tested Individual 6 -> Fitness: 0.7154
#> Tested Individual 7 -> Fitness: 0.7141
#> Tested Individual 8 -> Fitness: 0.7147 (cached)
#> Tested Individual 9 -> Fitness: 0.7154
#> Tested Individual 10 -> Fitness: 0.7154
#> Tested Individual 11 -> Fitness: 0.7157
#> Tested Individual 12 -> Fitness: 0.7154
#> Gen 3 Best Fitness: 0.7157
#> Gen 3 Best Recipe: [pca1(hp, cyl, carb, qsec, disp, drat), pca2(hp, cyl, carb, qsec, disp, drat), pca3(hp, cyl, carb, qsec, disp, drat)]
#>
#> --- Generation 4 / 5 (Current Best Fitness: 0.7157) ---
#> Tested Individual 2 -> Fitness: 0.7154
#> Tested Individual 3 -> Fitness: 0.7154
#> Tested Individual 4 (New Best!) -> Fitness: 0.7160
#> Tested Individual 5 -> Fitness: 0.7157
#> Tested Individual 6 -> Fitness: 0.7160
#> Tested Individual 7 -> Fitness: 0.7154
#> Tested Individual 8 -> Fitness: 0.7101
#> Tested Individual 9 -> Fitness: 0.7154
#> Tested Individual 10 -> Fitness: 0.7154
#> Tested Individual 11 (New Best!) -> Fitness: 0.7173
#> Tested Individual 12 -> Fitness: 0.7154
#> Tested Individual 13 -> Fitness: 0.7154
#> Tested Individual 14 -> Fitness: 0.7157
#> Tested Individual 15 -> Fitness: 0.7154
#> Tested Individual 16 -> Fitness: 0.7154
#> Tested Individual 17 -> Fitness: 0.7154
#> Tested Individual 18 -> Fitness: 0.7154
#> Gen 4 Best Fitness: 0.7173
#> Gen 4 Best Recipe: [subtract(hp, qsec), deadwood(vs, wt, carb, hp, gear), pca1(qsec, hp, wt, carb, gear, drat), pca2(qsec, hp, wt, carb, gear, drat), pca3(qsec, hp, wt, carb, gear, drat), groupby_min(Deadwood(vs_wt_car_hp_gea), PCA2(qse_hp_wt_car_gea_dra)), groupby_max(Deadwood(vs_wt_car_hp_gea), drat), mst_score(mpg, gear, wt, qsec, disp, PCA1(qse_hp_wt_car_gea_dra), vs, hp, cyl, ((hp-qsec)))]
#>
#> --- Generation 5 / 5 (Current Best Fitness: 0.7173) ---
#> Tested Individual 2 -> Fitness: 0.7160
#> Tested Individual 3 -> Fitness: 0.7115
#> Tested Individual 4 -> Fitness: 0.7128
#> Tested Individual 5 -> Fitness: 0.7160 (cached)
#> Tested Individual 6 -> Fitness: 0.7154
#> Tested Individual 7 -> Fitness: 0.7157
#> Tested Individual 8 -> Fitness: 0.7154
#> Tested Individual 9 -> Fitness: 0.7154
#> Tested Individual 10 -> Fitness: 0.7161
#> Tested Individual 11 -> Fitness: 0.7160
#> Tested Individual 12 -> Fitness: 0.7154
#> Gen 5 Best Fitness: 0.7173
#> Gen 5 Best Recipe: [subtract(hp, qsec), deadwood(vs, wt, carb, hp, gear), pca1(qsec, hp, wt, carb, gear, drat), pca2(qsec, hp, wt, carb, gear, drat), pca3(qsec, hp, wt, carb, gear, drat), groupby_min(Deadwood(vs_wt_car_hp_gea), PCA2(qse_hp_wt_car_gea_dra)), groupby_max(Deadwood(vs_wt_car_hp_gea), drat), mst_score(mpg, gear, wt, qsec, disp, PCA1(qse_hp_wt_car_gea_dra), vs, hp, cyl, ((hp-qsec)))]
#>
#> Evolution Complete. Best Fitness: 0.7173
#> Best recipe: [subtract(hp, qsec), deadwood(vs, wt, carb, hp, gear), pca1(qsec, hp, wt, carb, gear, drat), pca2(qsec, hp, wt, carb, gear, drat), pca3(qsec, hp, wt, carb, gear, drat), groupby_min(Deadwood(vs_wt_car_hp_gea), PCA2(qse_hp_wt_car_gea_dra)), groupby_max(Deadwood(vs_wt_car_hp_gea), drat), mst_score(mpg, gear, wt, qsec, disp, PCA1(qse_hp_wt_car_gea_dra), vs, hp, cyl, ((hp-qsec)))]
#> Generated columns: ((hp-qsec)), Deadwood(vs_wt_car_hp_gea), PCA1(qse_hp_wt_car_gea_dra), PCA2(qse_hp_wt_car_gea_dra), PCA3(qse_hp_wt_car_gea_dra), min_PCA2(qse_hp_wt_car_gea_dra)_by_Deadwood(vs_wt_car_hp_gea), max_drat_by_Deadwood(vs_wt_car_hp_gea), MSTScore(mpg_gea_wt_qse_dis_PCA_vs_hp_cyl_((h)
#> Training final model on full dataset...The returned evo_recipe object contains the best
individual (feature recipe), the fitted model, and the evolution
history.
# Print high-level overview of the recipe
print(res)
#> An evoFE Recipe
#> Evaluator: xgboost
#> Task: classification
#> Metric: default
#> Best Fitness: 0.7173
#> Evolved Features: 8
#> Winning Recipe:
#> [1] subtract(hp, qsec) -> ((hp-qsec))
#> [2] deadwood(vs, wt, carb, hp, gear) -> Deadwood(vs_wt_car_hp_gea)
#> [3] pca1(qsec, hp, wt, carb, gear, drat) -> PCA1(qse_hp_wt_car_gea_dra)
#> [4] pca2(qsec, hp, wt, carb, gear, drat) -> PCA2(qse_hp_wt_car_gea_dra)
#> [5] pca3(qsec, hp, wt, carb, gear, drat) -> PCA3(qse_hp_wt_car_gea_dra)
#> [6] groupby_min(Deadwood(vs_wt_car_hp_gea), PCA2(qse_hp_wt_car_gea_dra)) -> min_PCA2(qse_hp_wt_car_gea_dra)_by_Deadwood(vs_wt_car_hp_gea)
#> [7] groupby_max(Deadwood(vs_wt_car_hp_gea), drat) -> max_drat_by_Deadwood(vs_wt_car_hp_gea)
#> [8] mst_score(mpg, gear, wt, qsec, disp, PCA1(qse_hp_wt_car_gea_dra), vs, hp, cyl, ((hp-qsec))) -> MSTScore(mpg_gea_wt_qse_dis_PCA_vs_hp_cyl_((h)
# View a detailed structured summary
summary(res)
#> === Evolutionary Feature Engineering Summary ===
#> ML Evaluator: xgboost
#> Task type: classification
#> Optimization Metric: default
#> Best CV/Split Fitness: 0.717339
#> Number of Evolved Features: 8
#>
#> Feature Transformation Details:
#> Transformer
#> subtract
#> deadwood
#> pca
#> pca
#> pca
#> groupby_min
#> groupby_max
#> mst_score
#> Inputs
#> hp, qsec
#> vs, wt, carb, hp, gear
#> qsec, hp, wt, carb, gear, drat
#> qsec, hp, wt, carb, gear, drat
#> qsec, hp, wt, carb, gear, drat
#> Deadwood(vs_wt_car_hp_gea), PCA2(qse_hp_wt_car_gea_dra)
#> Deadwood(vs_wt_car_hp_gea), drat
#> mpg, gear, wt, qsec, disp, PCA1(qse_hp_wt_car_gea_dra), vs, hp, cyl, ((hp-qsec))
#> Output
#> ((hp-qsec))
#> Deadwood(vs_wt_car_hp_gea)
#> PCA1(qse_hp_wt_car_gea_dra)
#> PCA2(qse_hp_wt_car_gea_dra)
#> PCA3(qse_hp_wt_car_gea_dra)
#> min_PCA2(qse_hp_wt_car_gea_dra)_by_Deadwood(vs_wt_car_hp_gea)
#> max_drat_by_Deadwood(vs_wt_car_hp_gea)
#> MSTScore(mpg_gea_wt_qse_dis_PCA_vs_hp_cyl_((h)predict() applies the evolved transformations to new
data and returns the engineered feature matrix:
engineered <- predict(res, df[1:5, ])
head(engineered)
#> mpg cyl disp hp drat wt qsec vs gear carb ((hp-qsec))
#> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1: 21.0 6 160 110 3.90 2.620 16.46 0 4 4 93.54
#> 2: 21.0 6 160 110 3.90 2.875 17.02 0 4 4 92.98
#> 3: 22.8 4 108 93 3.85 2.320 18.61 1 4 1 74.39
#> 4: 21.4 6 258 110 3.08 3.215 19.44 1 3 1 90.56
#> 5: 18.7 8 360 175 3.15 3.440 17.02 0 3 2 157.98
#> Deadwood(vs_wt_car_hp_gea) PCA1(qse_hp_wt_car_gea_dra)
#> <fctr> <num>
#> 1: 0 0.3321325
#> 2: 0 0.3148183
#> 3: 0 1.7350342
#> 4: 0 0.4553966
#> 5: 0 -0.8244106
#> PCA2(qse_hp_wt_car_gea_dra) PCA3(qse_hp_wt_car_gea_dra)
#> <num> <num>
#> 1: 1.1946026 -0.2309977
#> 2: 0.9920645 0.1044791
#> 3: -0.1232276 -0.3980120
#> 4: -1.9228372 -0.3054806
#> 5: -0.8976149 -0.8669144
#> min_PCA2(qse_hp_wt_car_gea_dra)_by_Deadwood(vs_wt_car_hp_gea)
#> <num>
#> 1: -2.439953
#> 2: -2.439953
#> 3: -2.439953
#> 4: -2.439953
#> 5: -2.439953
#> max_drat_by_Deadwood(vs_wt_car_hp_gea)
#> <num>
#> 1: 4.93
#> 2: 4.93
#> 3: 4.93
#> 4: 4.93
#> 5: 4.93
#> MSTScore(mpg_gea_wt_qse_dis_PCA_vs_hp_cyl_((h)
#> <num>
#> 1: 18.87167
#> 2: 32.78996
#> 3: 48.32824
#> 4: 74.43728
#> 5: 54.92558predict_model() goes one step further — it applies the
transformations and runs the trained model to produce
predictions:
Predict petal length from the iris dataset:
data(iris)
set.seed(123)
res_reg <- evolve_features(
data = iris[, c("Sepal.Length", "Sepal.Width", "Petal.Width", "Petal.Length")],
target_col = "Petal.Length",
task = "regression",
evaluator = "xgboost",
generations = 5,
pop_size = 8,
cv_folds = 3,
early_stopping_rounds = 3,
verbose = TRUE
)
#> Starting Evolutionary Feature Engineering...
#> Task: regression
#> Evaluator: xgboost
#> Generations: 5, Population Size: 8, CV Folds: 3
#> Original Numeric columns: Sepal.Length, Sepal.Width, Petal.Width
#> Original Categorical columns:
#>
#> --- Generation 0 (Baseline) ---
#> Individual 1: [Original features only]
#> Tested Individual 1 -> Fitness: -0.3333
#>
#> [Gen 1] Initialized Population:
#> Individual 1: [Original features only]
#> Individual 2: [log_binning_cat10(Sepal.Width), normalized_difference(Petal.Width, Petal.Width)]
#> Individual 3: [log(Petal.Width), log_binning7(Sepal.Length)]
#> Individual 4: [quantile_binning6(Petal.Width), reciprocal(Petal.Width)]
#> Individual 5: [multiply(Sepal.Length, Petal.Width), normalized_difference(Sepal.Length, Petal.Width)]
#> Individual 6: [pca1(Sepal.Width, Sepal.Length), pca2(Sepal.Width, Sepal.Length), pca3(Sepal.Width, Sepal.Length)]
#> Individual 7: [log(Sepal.Width), sqrt(Petal.Width)]
#> Individual 8: [reciprocal(Sepal.Width), truncated_svd1(Sepal.Width, Petal.Width), truncated_svd2(Sepal.Width, Petal.Width), truncated_svd3(Sepal.Width, Petal.Width)]
#>
#> --- Generation 1 / 5 (Current Best Fitness: -0.3333) ---
#> Tested Individual 2 -> Fitness: -0.3333
#> Tested Individual 3 -> Fitness: -0.3333
#> Tested Individual 4 -> Fitness: -0.3333
#> Tested Individual 5 -> Fitness: -0.3818
#> Tested Individual 6 -> Fitness: -0.3494
#> Tested Individual 7 -> Fitness: -0.3333
#> Tested Individual 8 (New Best!) -> Fitness: -0.3215
#> Gen 1 Best Fitness: -0.3215
#> Gen 1 Best Recipe: [reciprocal(Sepal.Width), truncated_svd1(Sepal.Width, Petal.Width), truncated_svd2(Sepal.Width, Petal.Width), truncated_svd3(Sepal.Width, Petal.Width)]
#>
#> --- Generation 2 / 5 (Current Best Fitness: -0.3215) ---
#> Tested Individual 2 -> Fitness: -0.3333
#> Tested Individual 3 -> Fitness: -0.3333
#> Tested Individual 4 -> Fitness: -0.3333
#> Tested Individual 5 -> Fitness: -0.3215
#> Tested Individual 6 -> Fitness: -0.3215
#> Tested Individual 7 -> Fitness: -0.3215
#> Tested Individual 8 -> Fitness: -0.3233
#> Gen 2 Best Fitness: -0.3215
#> Gen 2 Best Recipe: [reciprocal(Sepal.Width), truncated_svd1(Sepal.Width, Petal.Width), truncated_svd2(Sepal.Width, Petal.Width), truncated_svd3(Sepal.Width, Petal.Width)]
#>
#> --- Generation 3 / 5 (Current Best Fitness: -0.3215) ---
#> Tested Individual 2 -> Fitness: -0.3215
#> Tested Individual 3 -> Fitness: -0.3215 (cached)
#> Tested Individual 4 (New Best!) -> Fitness: -0.2951
#> Tested Individual 5 -> Fitness: -0.3215
#> Tested Individual 6 -> Fitness: -0.3233
#> Tested Individual 7 -> Fitness: -0.3215
#> Tested Individual 8 -> Fitness: -0.3215 (cached)
#> Tested Individual 9 -> Fitness: -0.3215
#> Tested Individual 10 -> Fitness: -0.3215
#> Tested Individual 11 -> Fitness: -0.3215
#> Tested Individual 12 -> Fitness: -0.3304
#> Gen 3 Best Fitness: -0.2951
#> Gen 3 Best Recipe: [log(Petal.Width), log_binning7(Sepal.Length), reciprocal(Sepal.Width), truncated_svd1(Sepal.Width, Petal.Width, Sepal.Length), truncated_svd2(Sepal.Width, Petal.Width), truncated_svd3(Sepal.Width, Petal.Width)]
#>
#> --- Generation 4 / 5 (Current Best Fitness: -0.2951) ---
#> Tested Individual 2 -> Fitness: -0.2984
#> Tested Individual 3 -> Fitness: -0.3215
#> Tested Individual 4 -> Fitness: -0.3215
#> Tested Individual 5 -> Fitness: -0.3215 (cached)
#> Tested Individual 6 -> Fitness: -0.3274
#> Tested Individual 7 -> Fitness: -0.3215
#> Tested Individual 8 -> Fitness: -0.3084
#> Gen 4 Best Fitness: -0.2951
#> Gen 4 Best Recipe: [log(Petal.Width), log_binning7(Sepal.Length), reciprocal(Sepal.Width), truncated_svd1(Sepal.Width, Petal.Width, Sepal.Length), truncated_svd2(Sepal.Width, Petal.Width), truncated_svd3(Sepal.Width, Petal.Width)]
#>
#> --- Generation 5 / 5 (Current Best Fitness: -0.2951) ---
#> Tested Individual 2 (New Best!) -> Fitness: -0.2949
#> Tested Individual 3 -> Fitness: -0.3357
#> Tested Individual 4 -> Fitness: -0.3042
#> Tested Individual 5 -> Fitness: -0.3078
#> Tested Individual 6 -> Fitness: -0.3362
#> Tested Individual 7 -> Fitness: -0.3301
#> Tested Individual 8 -> Fitness: -0.3046
#> Tested Individual 11 -> Fitness: -0.2951
#> Tested Individual 12 -> Fitness: -0.3084
#> Gen 5 Best Fitness: -0.2949
#> Gen 5 Best Recipe: [reciprocal(Sepal.Width), truncated_svd2(Sepal.Width, Petal.Width), truncated_svd3(Sepal.Width, Petal.Width), truncated_svd1(Sepal.Width, Petal.Width, Sepal.Length)]
#>
#> Evolution Complete. Best Fitness: -0.2949
#> Best recipe: [reciprocal(Sepal.Width), truncated_svd2(Sepal.Width, Petal.Width), truncated_svd3(Sepal.Width, Petal.Width), truncated_svd1(Sepal.Width, Petal.Width, Sepal.Length)]
#> Generated columns: rec(Sepal.Width), SVD2(Sep_Pet), SVD3(Sep_Pet), SVD1(Sep_Pet_Sep)
#> Training final model on full dataset...
cat("Best recipe:", individual_to_recipe_string(res_reg$best_individual), "\n")
#> Best recipe: [reciprocal(Sepal.Width), truncated_svd2(Sepal.Width, Petal.Width), truncated_svd3(Sepal.Width, Petal.Width), truncated_svd1(Sepal.Width, Petal.Width, Sepal.Length)]
cat("Fitness (neg RMSE):", res_reg$best_individual$fitness, "\n")
#> Fitness (neg RMSE): -0.2948619preds_reg <- predict_model(res_reg, iris[1:10, ])
# Compare predictions to actuals
data.frame(
actual = iris$Petal.Length[1:10],
predicted = round(preds_reg, 2)
)
#> actual predicted
#> 1 1.4 1.43
#> 2 1.4 1.46
#> 3 1.3 1.46
#> 4 1.5 1.49
#> 5 1.4 1.43
#> 6 1.7 1.53
#> 7 1.4 1.42
#> 8 1.5 1.49
#> 9 1.4 1.37
#> 10 1.5 1.51Classify iris species (3 classes). Note
task = "multiclass":
iris_mc <- iris
iris_mc$Species <- as.character(iris_mc$Species)
set.seed(99)
res_mc <- evolve_features(
data = iris,
target_col = "Species",
task = "multiclass",
evaluator = "xgboost",
generations = 5,
pop_size = 8,
cv_folds = 3,
early_stopping_rounds = 3,
verbose = TRUE
)
#> Starting Evolutionary Feature Engineering...
#> Task: multiclass
#> Evaluator: xgboost
#> Generations: 5, Population Size: 8, CV Folds: 3
#> Original Numeric columns: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
#> Original Categorical columns:
#>
#> --- Generation 0 (Baseline) ---
#> Individual 1: [Original features only]
#> Tested Individual 1 -> Fitness: 0.8382
#>
#> [Gen 1] Initialized Population:
#> Individual 1: [Original features only]
#> Individual 2: [normalized_difference(Sepal.Length, Petal.Length), quantile_binning3(Petal.Length)]
#> Individual 3: [mst_score(Sepal.Length, Petal.Width), log(Petal.Length)]
#> Individual 4: [quantile_binning_cat10(Petal.Length), sqrt(Sepal.Width)]
#> Individual 5: [divide(Petal.Width, Sepal.Width), reciprocal(Petal.Length)]
#> Individual 6: [lumbermark_k3(Petal.Length, Sepal.Length), log_ratio(Sepal.Width, Petal.Length)]
#> Individual 7: [sqrt(Sepal.Length), multiply(Sepal.Length, Petal.Length, Petal.Width)]
#> Individual 8: [truncated_svd1(Petal.Width, Petal.Length), truncated_svd2(Petal.Width, Petal.Length), truncated_svd3(Petal.Width, Petal.Length)]
#>
#> --- Generation 1 / 5 (Current Best Fitness: 0.8382) ---
#> Tested Individual 2 -> Fitness: 0.8313
#> Tested Individual 3 -> Fitness: 0.8382
#> Tested Individual 4 -> Fitness: 0.8382
#> Tested Individual 5 -> Fitness: 0.8351
#> Tested Individual 6 (New Best!) -> Fitness: 0.8499
#> Tested Individual 7 -> Fitness: 0.8315
#> Tested Individual 8 -> Fitness: 0.8460
#> Gen 1 Best Fitness: 0.8499
#> Gen 1 Best Recipe: [lumbermark_k3(Petal.Length, Sepal.Length), log_ratio(Sepal.Width, Petal.Length)]
#>
#> --- Generation 2 / 5 (Current Best Fitness: 0.8499) ---
#> Tested Individual 2 -> Fitness: 0.8382 (cached)
#> Tested Individual 3 (New Best!) -> Fitness: 0.8643
#> Tested Individual 4 -> Fitness: 0.8420
#> Tested Individual 5 -> Fitness: 0.8433
#> Tested Individual 6 -> Fitness: 0.8420
#> Tested Individual 7 -> Fitness: 0.8460
#> Tested Individual 8 -> Fitness: 0.8425
#> Gen 2 Best Fitness: 0.8643
#> Gen 2 Best Recipe: [lumbermark_k3(Petal.Length, Sepal.Length), log_ratio(Sepal.Width, Petal.Length), truncated_svd1(Petal.Width, Petal.Length)]
#>
#> --- Generation 3 / 5 (Current Best Fitness: 0.8643) ---
#> Tested Individual 2 -> Fitness: 0.8489
#> Tested Individual 3 -> Fitness: 0.8545
#> Tested Individual 4 -> Fitness: 0.8460
#> Tested Individual 5 -> Fitness: 0.8524
#> Tested Individual 6 -> Fitness: 0.8489
#> Tested Individual 7 (New Best!) -> Fitness: 0.8698
#> Tested Individual 8 (New Best!) -> Fitness: 0.8714
#> Gen 3 Best Fitness: 0.8714
#> Gen 3 Best Recipe: [lumbermark_k3(Petal.Length, Sepal.Length), log_ratio(Sepal.Width, Petal.Length), truncated_svd1(Petal.Width, Petal.Length), truncated_svd1(SVD1(Pet_Pet), Petal.Length, Petal.Width), truncated_svd2(SVD1(Pet_Pet), Petal.Length, Petal.Width), truncated_svd3(SVD1(Pet_Pet), Petal.Length, Petal.Width)]
#>
#> --- Generation 4 / 5 (Current Best Fitness: 0.8714) ---
#> Tested Individual 2 -> Fitness: 0.8699
#> Tested Individual 3 -> Fitness: 0.8698
#> Tested Individual 4 -> Fitness: 0.8698
#> Tested Individual 5 -> Fitness: 0.8714
#> Tested Individual 6 -> Fitness: 0.8711
#> Tested Individual 7 -> Fitness: 0.8667
#> Tested Individual 8 -> Fitness: 0.8699
#> Gen 4 Best Fitness: 0.8714
#> Gen 4 Best Recipe: [lumbermark_k3(Petal.Length, Sepal.Length), log_ratio(Sepal.Width, Petal.Length), truncated_svd1(Petal.Width, Petal.Length), truncated_svd1(SVD1(Pet_Pet), Petal.Length, Petal.Width), truncated_svd2(SVD1(Pet_Pet), Petal.Length, Petal.Width), truncated_svd3(SVD1(Pet_Pet), Petal.Length, Petal.Width)]
#>
#> --- Generation 5 / 5 (Current Best Fitness: 0.8714) ---
#> Tested Individual 2 -> Fitness: 0.8629
#> Tested Individual 3 -> Fitness: 0.8683
#> Tested Individual 4 (New Best!) -> Fitness: 0.8720
#> Tested Individual 5 -> Fitness: 0.8681
#> Tested Individual 6 -> Fitness: 0.8716
#> Tested Individual 7 -> Fitness: 0.8699 (cached)
#> Tested Individual 8 -> Fitness: 0.8699
#> Tested Individual 9 -> Fitness: 0.8700
#> Tested Individual 10 -> Fitness: 0.8689
#> Tested Individual 11 -> Fitness: 0.8714
#> Tested Individual 12 -> Fitness: 0.8657
#> Gen 5 Best Fitness: 0.8720
#> Gen 5 Best Recipe: [lumbermark_k3(Petal.Length, Sepal.Length), log_ratio(Sepal.Width, Petal.Length), truncated_svd1(Petal.Width, Petal.Length), truncated_svd1(SVD1(Pet_Pet), Petal.Length, Petal.Width), truncated_svd2(SVD1(Pet_Pet), Petal.Length, Petal.Width), truncated_svd3(SVD1(Pet_Pet), Petal.Length, Petal.Width), quantile_binning_cat10(Petal.Length), sqrt(Sepal.Width), truncated_svd2(Petal.Width, Petal.Length), truncated_svd3(Petal.Width, Petal.Length), lumbermark_k3(Petal.Length, Sepal.Length, Petal.Width)]
#>
#> Evolution Complete. Best Fitness: 0.8720
#> Best recipe: [lumbermark_k3(Petal.Length, Sepal.Length), log_ratio(Sepal.Width, Petal.Length), truncated_svd1(Petal.Width, Petal.Length), truncated_svd1(SVD1(Pet_Pet), Petal.Length, Petal.Width), truncated_svd2(SVD1(Pet_Pet), Petal.Length, Petal.Width), truncated_svd3(SVD1(Pet_Pet), Petal.Length, Petal.Width), quantile_binning_cat10(Petal.Length), sqrt(Sepal.Width), truncated_svd2(Petal.Width, Petal.Length), truncated_svd3(Petal.Width, Petal.Length), lumbermark_k3(Petal.Length, Sepal.Length, Petal.Width)]
#> Generated columns: Lumb3(Pet_Sep), logratio(Sepal.Width_Petal.Length), SVD1(Pet_Pet), SVD1(SVD_Pet_Pet), SVD2(SVD_Pet_Pet), SVD3(SVD_Pet_Pet), qbin_cat10(Petal.Length), sqrt(Sepal.Width), SVD2(Pet_Pet), SVD3(Pet_Pet), Lumb3(Pet_Sep_Pet)
#> Training final model on full dataset...
cat("Best recipe:", individual_to_recipe_string(res_mc$best_individual), "\n")
#> Best recipe: [lumbermark_k3(Petal.Length, Sepal.Length), log_ratio(Sepal.Width, Petal.Length), truncated_svd1(Petal.Width, Petal.Length), truncated_svd1(SVD1(Pet_Pet), Petal.Length, Petal.Width), truncated_svd2(SVD1(Pet_Pet), Petal.Length, Petal.Width), truncated_svd3(SVD1(Pet_Pet), Petal.Length, Petal.Width), quantile_binning_cat10(Petal.Length), sqrt(Sepal.Width), truncated_svd2(Petal.Width, Petal.Length), truncated_svd3(Petal.Width, Petal.Length), lumbermark_k3(Petal.Length, Sepal.Length, Petal.Width)]For multiclass, predict_model() returns a probability
matrix — one column per class:
evoFE ships with 32 built-in transformers that the genetic algorithm can select from during evolution. The table below groups them by category.
| Transformer | Arity | Description |
|---|---|---|
log |
unary | Natural logarithm (safe:
log(abs(x) + 1)) |
sqrt |
unary | Square root (safe: sqrt(abs(x))) |
reciprocal |
unary | 1 / (x + ε) |
add |
multi | Element-wise sum of 2+ columns |
subtract |
binary | x₁ − x₂ |
multiply |
multi | Element-wise product of 2+ columns |
divide |
binary | x₁ / (x₂ + ε) |
normalized_difference |
binary | (x₁ − x₂) / (x₁ + x₂ + ε) |
log_ratio |
binary | log((x₁ + ε) / (x₂ + ε)) |
These combine a categorical grouping column with a numeric value column.
| Transformer | Description |
|---|---|
groupby_mean |
Mean of value within each group |
groupby_sd |
Standard deviation within each group |
groupby_max /
groupby_min |
Max / min within each group |
groupby_ratio |
value / group_mean |
groupby_zscore |
(value − group_mean) / group_sd |
| Transformer | Input → Output | Description |
|---|---|---|
target_encode |
cat → num | Supervised mean-target encoding with smoothing (for binary classification / regression) |
target_encode_multiclass |
cat → num | Supervised mean-target encoding for multiclass classification tasks (one component per class-indicator) |
frequency_encode |
cat → num | Proportion of each category in the data |
one_hot_encode |
cat → num | Binary one-hot encoding indicator for a specific category (or “other” for rare categories) |
quantile_binning |
num → num | Assign quantile rank (1–5) |
log_binning |
num → num | Assign log-scale bin index |
quantile_binning_cat |
num → cat | Same as quantile_binning, output as
factor |
log_binning_cat |
num → cat | Same as log_binning, output as factor |
datetime_extract |
cat → num | Extracted datetime components (year, month, day, hour, day of week, or weekend indicator) from date/time columns |
| Transformer | Description |
|---|---|
pca |
First principal component of 2+ columns |
truncated_svd |
First component from truncated SVD |
random_projection |
Random linear combination of 2+ columns |
umap |
Low-dimensional UMAP projection |
| Transformer | Output | Description |
|---|---|---|
genie |
categorical | Genie robust hierarchical clustering |
lumbermark |
categorical | Lumbermark hierarchical clustering |
mst_score |
numeric | Minimum Spanning Tree-based anomaly score |
deadwood |
categorical | Deadwood anomaly detection (outlier indicators) |
One of evoFE’s powerful capabilities is hierarchical feature construction. After a gene has been evaluated and proven useful, subsequent generations can build on top of its output.
For example:
Gen 1: log_ratio(Sepal.Length, Petal.Width) → tested ✓
Gen 2: divide(Petal.Width, logratio(…)) → chains from tested gene ✓
Important safety rule: a gene can only chain from outputs that have been evaluated in a previous generation. A brand-new untested gene is never used as input for another gene in the same individual. This prevents fragile dependency chains built on unproven transformations.
evoFE makes it easy to register your own custom
transformations, extending the genetic algorithm’s vocabulary with
domain-specific features.
Use create_transformer() to define your transformer, and
register it with register_transformer() to make it
available during evolution:
library(evoFE)
# 1. Define a transformer that adds 5 to a numeric variable
add_five_trans <- create_transformer(
name = "add_five",
type = "unary",
input_type = "numeric",
apply_func = function(data, gene, state = NULL) {
data[[gene$input_cols[1]]] + 5
},
name_generator = function(gene) paste0("add5_", gene$input_cols[1])
)
# 2. Register it with the package registry
register_transformer("add_five", add_five_trans)
# Now, "add_five" is part of the active transformer pool and will
# be automatically selected, mutated, and chained during evolution!evolve_features() returns an evo_recipe S3
object with:
| Field | Description |
|---|---|
best_individual |
The winning recipe (list of genes, column sets, fitness) |
best_model |
The final LightGBM/XGBoost model trained on all data |
history |
Full final-generation population (for inspection) |
task |
The task type used |
evaluator |
The evaluator used |
classes |
Class labels (multiclass only) |
ind <- res$best_individual
# Human-readable recipe string
cat(individual_to_recipe_string(ind), "\n")
#> [subtract(hp, qsec), deadwood(vs, wt, carb, hp, gear), pca1(qsec, hp, wt, carb, gear, drat), pca2(qsec, hp, wt, carb, gear, drat), pca3(qsec, hp, wt, carb, gear, drat), groupby_min(Deadwood(vs_wt_car_hp_gea), PCA2(qse_hp_wt_car_gea_dra)), groupby_max(Deadwood(vs_wt_car_hp_gea), drat), mst_score(mpg, gear, wt, qsec, disp, PCA1(qse_hp_wt_car_gea_dra), vs, hp, cyl, ((hp-qsec)))]
# Number of evolved genes
cat("Evolved genes:", length(ind$genes), "\n")
#> Evolved genes: 8
# Original columns retained
cat("Numeric cols: ", paste(ind$numeric_cols, collapse = ", "), "\n")
#> Numeric cols: mpg, cyl, disp, hp, drat, wt, qsec, vs, gear, carb
cat("Categorical cols:", paste(ind$categorical_cols, collapse = ", "), "\n")
#> Categorical cols:
# Individual gene details
for (g in ind$genes) {
cat(sprintf(" %s(%s) → %s\n",
g$transformer_name,
paste(g$input_cols, collapse = ", "),
g$output_col))
}
#> subtract(hp, qsec) → ((hp-qsec))
#> deadwood(vs, wt, carb, hp, gear) → Deadwood(vs_wt_car_hp_gea)
#> pca(qsec, hp, wt, carb, gear, drat) → PCA1(qse_hp_wt_car_gea_dra)
#> pca(qsec, hp, wt, carb, gear, drat) → PCA2(qse_hp_wt_car_gea_dra)
#> pca(qsec, hp, wt, carb, gear, drat) → PCA3(qse_hp_wt_car_gea_dra)
#> groupby_min(Deadwood(vs_wt_car_hp_gea), PCA2(qse_hp_wt_car_gea_dra)) → min_PCA2(qse_hp_wt_car_gea_dra)_by_Deadwood(vs_wt_car_hp_gea)
#> groupby_max(Deadwood(vs_wt_car_hp_gea), drat) → max_drat_by_Deadwood(vs_wt_car_hp_gea)
#> mst_score(mpg, gear, wt, qsec, disp, PCA1(qse_hp_wt_car_gea_dra), vs, hp, cyl, ((hp-qsec))) → MSTScore(mpg_gea_wt_qse_dis_PCA_vs_hp_cyl_((h)evoFE supports two evaluation strategies for scoring
individuals:
cv): The default
strategy. Evaluates the fitness of individuals using \(K\)-fold cross-validation
(cv_folds parameter).split): Useful for faster evaluation on larger
datasets. You configure it with
evaluation_strategy = "split" and split_ratio
(e.g., c(0.6, 0.2, 0.2)).
split_ratio are used as the
Train and Validation sets to score the
candidate recipes during the evolutionary search.By default, evoFE optimizes for LogLoss on
classification tasks and RMSE on regression tasks. However, you can
optimize for other metrics by passing the metric parameter
to evolve_features():
"default"
(LogLoss), "auc" (Area Under the ROC Curve), or
"f1" (F1-score at a 0.5 probability threshold)."default"
(Multiclass LogLoss) or "auc" (One-vs-Rest macro-averaged
AUC)."default" (RMSE) or
"mae" (Mean Absolute Error).function(y_true, y_pred) that returns a numeric
value. Note: Since the genetic algorithm always
maximizes fitness, ensure your custom metric returns a value
where higher is better (e.g., negate error metrics).For example:
# Evolve features optimizing for Area Under the ROC Curve (AUC)
recipe_auc <- evolve_features(
data = df, target_col = "am", task = "classification",
metric = "auc", generations = 5, pop_size = 8
)
# Evolve features using a custom regression metric (e.g. Mean Absolute Percentage Error, negated)
mape_metric <- function(y_true, y_pred) {
-mean(abs((y_true - y_pred) / y_true))
}
recipe_mape <- evolve_features(
data = iris[, 1:5], target_col = "Petal.Length", task = "regression",
metric = mape_metric, generations = 5, pop_size = 8
)evolve_features()| Parameter | Default | Description |
|---|---|---|
generations |
10 | Maximum number of evolutionary generations |
pop_size |
10 | Number of individuals per generation |
evaluation_strategy |
"cv" |
Evaluation method: "cv" (cross-validation)
or "split" (train/val/holdout split) |
cv_folds |
3 | Cross-validation folds for fitness evaluation (only
used if strategy is "cv") |
split_ratio |
c(0.6, 0.2, 0.2) |
Proportions for Train/Val/Holdout split (only used if
strategy is "split") |
split_ids |
NULL |
Optional user-defined vector of split assignments
("train", "val", "holdout") |
early_stopping_rounds |
3 | Stop if no improvement for n generations |
evaluator |
"lightgbm" |
Model backend: "lightgbm" or
"xgboost" |
dynamic_population |
TRUE |
Expand population during stagnation |
crossover_type |
"both" |
"random", "union", or
"both" |
threads |
2 | Parallelism for model training |
seed |
NULL |
RNG seed for reproducibility |
metric |
"default" |
Optimization metric: "default",
"auc", "f1", "mae", or a custom
function |
generations = 5, pop_size = 8 is enough to validate the
pipeline. Scale up once you confirm the setup works.pop_size for wider
exploration. Useful when you have many columns (> 20) and diverse
transformer options.generations for deeper
search. Works best when combined with
dynamic_population = TRUE so stagnation triggers population
expansion.set.seed() before calling
evolve_features() for reproducible experiments and
benchmarking.crossover_type = "union" tends to
produce larger recipes (more features). "random" keeps
recipes leaner.Calling set.seed() before evolve_features()
guarantees identical results across runs:
set.seed(42)
r1 <- evolve_features(iris[,1:5], "Petal.Length", task = "regression",
generations = 3, pop_size = 5, evaluator = "xgboost",
verbose = FALSE)
set.seed(42)
r2 <- evolve_features(iris[,1:5], "Petal.Length", task = "regression",
generations = 3, pop_size = 5, evaluator = "xgboost",
verbose = FALSE)
identical(r1$best_individual$fitness, r2$best_individual$fitness)
#> [1] TRUE
identical(
individual_to_recipe_string(r1$best_individual),
individual_to_recipe_string(r2$best_individual)
)
#> [1] TRUEA realistic workflow with hold-out evaluation:
data(iris)
set.seed(1)
idx <- sample(nrow(iris), 0.7 * nrow(iris))
train <- iris[idx, ]
test <- iris[-idx, ]
# Evolve on training data only
set.seed(7)
recipe <- evolve_features(
data = train[, 1:4], # exclude Species
target_col = "Petal.Length",
task = "regression",
evaluator = "xgboost",
generations = 5,
pop_size = 8,
verbose = FALSE
)
# Predict on held-out test data
test_preds <- predict_model(recipe, test[, 1:4])
# Evaluate
rmse <- sqrt(mean((test$Petal.Length - test_preds)^2))
cat(sprintf("Test RMSE: %.4f\n", rmse))
#> Test RMSE: 0.3007
cat(sprintf("Recipe: %s\n", individual_to_recipe_string(recipe$best_individual)))
#> Recipe: [multiply(Sepal.Width, Sepal.Length, Sepal.Length), mst_score(Petal.Width, Sepal.Width), add(Petal.Width, Petal.Width), log(((Sepal.Width*Sepal.Length*Sepal.Length))), subtract(log(((Sepal.Width*Sepal.Length*Sepal.Length))), Sepal.Width)]sessionInfo()
#> R version 4.6.0 (2026-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] evoFE_0.2.0 rmarkdown_2.31
#>
#> loaded via a namespace (and not attached):
#> [1] sass_0.4.10 mlrMBO_1.1.6 stringi_1.8.7
#> [4] lattice_0.22-9 quitefastmst_0.9.1 lhs_1.3.0
#> [7] digest_0.6.39 evaluate_1.0.5 grid_4.6.0
#> [10] RColorBrewer_1.1-3 BBmisc_1.13.1 fastmap_1.2.0
#> [13] mlr_2.19.3 lumbermark_0.9.0 xgboost_3.2.1.1
#> [16] jsonlite_2.0.0 Matrix_1.7-5 backports_1.5.1
#> [19] survival_3.8-6 scales_1.4.0 RhpcBLASctl_0.23-42
#> [22] codetools_0.2-20 jquerylib_0.1.4 cli_3.6.6
#> [25] rlang_1.2.0 ParamHelpers_1.14.2 lightgbm_4.6.0
#> [28] RcppAnnoy_0.0.23 uwot_0.2.4 splines_4.6.0
#> [31] cachem_1.1.0 yaml_2.3.12 otel_0.2.0
#> [34] tools_4.6.0 parallel_4.6.0 checkmate_2.3.4
#> [37] ggplot2_4.0.3 fastmatch_1.1-8 buildtools_1.0.0
#> [40] vctrs_0.7.3 R6_2.6.1 lifecycle_1.0.5
#> [43] deadwood_0.9.0-3 parallelMap_1.5.1 smoof_1.7.0
#> [46] bslib_0.11.0 gtable_0.3.6 data.table_1.18.4
#> [49] glue_1.8.1 Rcpp_1.1.1-1.1 genieclust_1.3.0
#> [52] xfun_0.58 sys_3.4.3 knitr_1.51
#> [55] farver_2.1.2 htmltools_0.5.9 maketools_1.3.2
#> [58] compiler_4.6.0 S7_0.2.2