Automated machine learning (AutoML) is the process of automating the end-to-end process of applying machine learning to real-world problems. AutoML automates most of the steps in an ML pipeline, with a minimum amount of human effort and without compromising on its performance.
What is AutoML?
Automatic machine learning broadly includes the following steps:
- Data preparation and Ingestion: The real-world data can be raw data or just in any format. In this step, data needs to be converted into a format that can be processed easily. This also required to decide the data type of different columns in the dataset. We also required a clear knowledge about the task we need to perform on data (e.g classification, regression, etc.)
- Feature Engineering: This includes various steps that are required for cleaning the dataset such as dealing with NULL /missing values, selecting the most important features of the dataset, and removing the low-correlational features, dealing with the skewed dataset.
- Hyperparameter Optimization: To obtain the best results on any model, the AutoML need to carefully tune the hyperparameter values.
- Model Selection: H2O autoML trains with a large number of models in order to produce the best results. H2O AutoML also trains the data of different ensembles to get the best performance out of training data.
Functionalities of H2O AutoML
H2O AutoML contains the cutting-edge and distributed implementation of many machine learning algorithms. These algorithms are available in Java, Python, Spark, Scala, and R. H2O also provide a web GUI that uses JSON to implement these algorithms.
- Automates steps like basic data processing, model training and tuning, Ensemble and stacking of various models to provide the models with the best performance so that developers can focus on other steps like data collection, feature engineering and deployment of model.
- H2O AutoML provides necessary data processing capabilities. These are also included in all of the H2O algorithms.
- Trains a Random grid of algorithms like GBMs, DNNs, GLMs, etc. using a carefully chosen hyper-parameter space.
- Individual models are tuned using cross-validation.
- Two Stacked Ensembles are trained. One ensemble contains all the models (optimized for model performance), and the other ensemble provides just the best performing model from each algorithm class/family (optimized for production use).
- Returns a sorted “Leaderboard” of all models.
- All models can be easily exported to production.
Architecture of H2O AutoML
H2O AutoML uses H2O architecture. H2O architecture can be divided into different layers in which the top layer will be different APIs, and the bottom layer will be H2O JVM.

H2O provides REST API clients for Python, R, Excel, Tableau, and Flow Web UI using socket connections. The bottom layer contains different components that will run on the H2O JVM process.
An H2O cluster consists of one or more nodes. Each node is a single JVM process. Each JVM process is split into three layers: language, algorithms, and core infrastructure.
- The first layer in the bottom section is the language layer. The language layer consists of an expression evaluation engine for R and the Scala layer.
- The second layer is the algorithm layer. This layer contains an algorithms that are already provided in the H2O such as: XGBoost, GBM, Random Forest, K-Means, etc.
- The third layer is the core infrastructure layer that deals with resource management such as Memory and CPU management.
Implementation of House Price Prediction using H2O AutoML
Here, we will be using California Housing Prices as our Dataset for House Price Prediction.
1. Importing Necessary Libraries
First, we need to import the necessary packages, i.e. Pandas, Numpy, Matplotlib.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
2. Loading the Data
You can download the California Housing Training Dataset from here. Load the Dataset using pre-defined functions in Pandas Library.
df = pd.read_csv('/content/california_housing_train.csv')
3. Understanding the Data
Let's look at the dataset. We use the head function to list the first 5 rows of the dataset.
df.head()
Output

4. Pre-processing the Data
Now, let's check for null values in the dataset.
df.isna().sum()
Output

As we can see that there are no null values now in the dataset. Thus, we don't need to handle them.
5. Installation and Initialization of H2O
- We need to install the h2o, we can install it using pip command. Note, if you are using the local environment for H2O, you need to install the Java Development Kit (JDK).
- After installing JDK and H2O, we will initialize it, if it works fine this will start an H2O instance on the localhost.
- There are many arguments which we can pass such as: nthreads, ip, port, max_mem_size, and min_mem_size.
You can refer to H2O for more details on H2O.
!pip install h2o
Output

import h2o
h2o.init()
Output

We can observe that H2O instance can also be assessed from localhost: 54321, this instance provides a web GUI called FlowGUI.
6. Converting Train Dataframe to H2O DataFrame
To convert the train data frame into the H2O Dataframe, we'll use the following step.
train_df = h2o.H2OFrame(df)
train_df.describe()
Output

7. Preparing Test Dataset
Now, Download the Testing Dataset from here and convert pandas DataFrame into the H2O Dataframe. Further, we remove label classvariable from feature variable.
test = pd.read_csv('/content/california_housing_test.csv')
test = h2o.H2OFrame(test)
# Defining feature and label columns
x = test.columns
y = 'median_house_value'
x.remove(y)
Output

8. Setup AutoML and Train the model
Now, we import H2O AutoML and start training.
from h2o.automl import H2OAutoML
# callh20automl function
aml = H2OAutoML(max_runtime_secs = 600,
seed = 1,
balance_classes = False,
project_name ='Project_1'
)
# Train model and record time % time
aml.train(x = x, y = y, training_frame = train_df)
Output

9. View the H2O aml leaderboard
In this step, we will look for the best performing model using the leaderboard and it will most probably be one of the two stacked ensemble models.
lb = aml.leaderboard
lb.head(rows = lb.nrows)
Output

10. Analyzing Leaderboard and Top Model
In this step, we explore the base learners of the stacked ensemble model and select the best performing base learning model. Here, we identify top, metalearner, and base learner models.
se = aml.leader
metalearner = h2o.get_model(se.metalearner()['name']))
metalearner.varimp()
Output
/usr/local/lib/python3.11/dist-packages/h2o/estimators/stackedensemble.py:965: H2ODeprecationWarning: The usage of stacked_ensemble.metalearner()['name'] will be deprecated. Metalearner now returns the metalearner object. If you need to get the 'name' please use stacked_ensemble.metalearner().model_id
warnings.warn(
[('GBM_3_AutoML_1_20250707_91158', 23234.78125, 1.0, 0.21439825598646953),
('GBM_4_AutoML_1_20250707_91158',
21459.20703125,
0.9235811949488442,
0.19801419745893173),
11. Model Evaluation
Now, we calculate error on this base learning model and plot the feature importance plot using this model.
model = h2o.get_model('XGBoost_grid__1_AutoML_20200714_173719_model_5')
model.model_performance(test)
Output
ModelMetricsRegressionGLM: stackedensemble
** Reported on test data. **
MSE: 2082655548.9724505
RMSE: 45636.121099108
MAE: 29539.84376082709
RMSLE: 0.2277528452878982
Mean Residual Deviance: 2082655548.9724505
R^2: 0.83718821 ...
In case of error, Alter model according to the models in your h2o instance.
12. Visualizing Feature Importance
model.varimp_plot(num_of_features = 9)
Output

13. Save the Base Learner Model
We can finally save this model using the model.save method, this model can be deployed on various platforms.
model_path = h2o.save_model(model = model, path ='sample_data/', force = True)
You can refer to the source code and download it from - here.