Optimised Support Vector Regression for California Housing Price Prediction: The Critical Role of Feature Engineering and Hyperparameter Tuning

arXiv:2605.08660v1 Announce Type: new Abstract: In the recent literature, Support Vector Regression (SVR) has been cited as one of the weakest performers on the California Housing benchmark dataset, with Preethi et al. (2025)specifically ranking it last among the algorithms they tested, reporting an R2 of only 0.60. This paper examines whether the previously reported performance reflects experimental configuration choices rather than an inherent algorithmic limitation. A structured experimental workflow is applied: ten domain-motivated derived features are constructed from the eight raw inputs, an exploratory ensemble feature importance analysis identifies the most predictive candidates, and a randomised search over hyperparameter combinations with three-fold cross-validation selects the optimal SVR configuration within a leakage-safe scikit-learn Pipeline. A formal four-stage ablation study isolates the contribution of each component: scaling alone accounts for +0.744 in R2 (from -0.054 to 0.690), feature engineering adds +0.026 (to 0.716), and hyperparameter tuning contributes +0.008 (to 0.723). The resulting tuned SVR achieves a test R2 of 0.723, a 0.123-point absolute improvement over the previously reported SVR result (from 0.60 to 0.723, approximately 20% relative gain). In the ten-model comparison, the tuned SVR ranks fourth with R2 = 0.723, below XGBoost (0.832), Random Forest (0.814) and Gradient Boosting (0.783), while substantially outperforming simpler baselines. Ten-fold cross-validation yields a mean R2 of 0.703 (95% CI: [0.630, 0.775]), confirming robust generalisation. The observed improvement from R2 = 0.60 to R2 = 0.723 is associated primarily with proper feature scaling within a unified preprocessing pipeline, with domain-motivated feature engineering and systematic hyperparameter tuning, providing further incremental gains.

Leave a Comment