MODEL ARCHITECTURE
How the prediction engine was built
How does the model work?
Instead of using a single algorithm, this solution uses a stacking ensemble — three different AI models (LightGBM, CatBoost, XGBoost) each make their own prediction independently. Then a fourth model (RidgeCV) learns the best way to combine those three predictions into one final answer. This "wisdom of crowds" approach consistently outperforms any single model on its own. Training used 5-fold cross-validation to prevent overfitting — meaning the data was split 5 ways and each split was used to validate the others.
Three models, each making independent predictions
Each bar shows how much that model contributes to the final answer. All three are gradient boosting algorithms — they build many small decision trees one after another, each one correcting the mistakes of the last.
Base Models
LightGBM
n_estimators=400, lr=0.03, depth=7, subsample=0.9
CatBoost
iterations=400, lr=0.03, depth=7, verbose=0
XGBoost
n_estimators=400, lr=0.03, depth=7, subsample=0.9
RidgeCV META-LEARNER
alphas=logspace(-2,2,10), cv=5 — combines base predictions
How raw inputs were transformed before training
Raw data rarely goes straight into a model. These 5 steps cleaned, enriched, and normalised the data to help the model learn better patterns.
Feature Engineering Pipeline
1
Raw Inputs
5 component fractions + 5×10 component properties = 55 features
2
Weighted Averages
WA_Propertyi = Σ(fractionc × propertyc,i) · 10 engineered features
3
Outlier Removal
IQR-based (1.5×) per target to clean training distribution
4
Quantile Transform
100-quantile normal output distribution on all features
5
5-Fold Stacking
Out-of-fold predictions feed meta-learner to prevent data leakage
Final scores after combining all three models
These are the R² scores achieved by the full stacking ensemble on the test set — one score per blend property. The closer to 1.0, the better. 9 out of 10 properties score above 0.99, which is exceptional for a real-world chemistry prediction task.
Final R² Scores — Stacked Ensemble