AutoML using H2o

Automated machine learning (AutoML) is the process of automating the end-to-end process of applying machine learning to real-world problems. AutoML automates most of the steps in an ML pipeline, with a minimum amount of human effort and without compromising on its performance.

What is AutoML?

Automatic machine learning broadly includes the following steps:

Data preparation and Ingestion: The real-world data can be raw data or just in any format. In this step, data needs to be converted into a format that can be processed easily. This also required to decide the data type of different columns in the dataset. We also required a clear knowledge about the task we need to perform on data (e.g classification, regression, etc.)
Feature Engineering: This includes various steps that are required for cleaning the dataset such as dealing with NULL /missing values, selecting the most important features of the dataset, and removing the low-correlational features, dealing with the skewed dataset.
Hyperparameter Optimization: To obtain the best results on any model, the AutoML need to carefully tune the hyperparameter values.
Model Selection: H2O autoML trains with a large number of models in order to produce the best results. H2O AutoML also trains the data of different ensembles to get the best performance out of training data.

Functionalities of H2O AutoML

H2O AutoML contains the cutting-edge and distributed implementation of many machine learning algorithms. These algorithms are available in Java, Python, Spark, Scala, and R. H2O also provide a web GUI that uses JSON to implement these algorithms.

Automates steps like basic data processing, model training and tuning, Ensemble and stacking of various models to provide the models with the best performance so that developers can focus on other steps like data collection, feature engineering and deployment of model.
H2O AutoML provides necessary data processing capabilities. These are also included in all of the H2O algorithms.
Trains a Random grid of algorithms like GBMs, DNNs, GLMs, etc. using a carefully chosen hyper-parameter space.
Individual models are tuned using cross-validation.
Two Stacked Ensembles are trained. One ensemble contains all the models (optimized for model performance), and the other ensemble provides just the best performing model from each algorithm class/family (optimized for production use).
Returns a sorted “Leaderboard” of all models.
All models can be easily exported to production.

Architecture of H2O AutoML

H2O AutoML uses H2O architecture. H2O architecture can be divided into different layers in which the top layer will be different APIs, and the bottom layer will be H2O JVM.

H2O provides REST API clients for Python, R, Excel, Tableau, and Flow Web UI using socket connections. The bottom layer contains different components that will run on the H2O JVM process.

An H2O cluster consists of one or more nodes. Each node is a single JVM process. Each JVM process is split into three layers: language, algorithms, and core infrastructure.

The first layer in the bottom section is the language layer. The language layer consists of an expression evaluation engine for R and the Scala layer.
The second layer is the algorithm layer. This layer contains an algorithms that are already provided in the H2O such as: XGBoost, GBM, Random Forest, K-Means, etc.
The third layer is the core infrastructure layer that deals with resource management such as Memory and CPU management.

Implementation of House Price Prediction using H2O AutoML

Here, we will be using California Housing Prices as our Dataset for House Price Prediction.

1. Importing Necessary Libraries

First, we need to import the necessary packages, i.e. Pandas, Numpy, Matplotlib.

python

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

2. Loading the Data

You can download the California Housing Training Dataset from here. Load the Dataset using pre-defined functions in Pandas Library.

python

df = pd.read_csv('/content/california_housing_train.csv')

3. Understanding the Data

Let's look at the dataset. We use the head function to list the first 5 rows of the dataset.

python

df.head()

Output

4. Pre-processing the Data

Now, let's check for null values in the dataset.

python

df.isna().sum()

Output

As we can see that there are no null values now in the dataset. Thus, we don't need to handle them.

5. Installation and Initialization of H2O

We need to install the h2o, we can install it using pip command. Note, if you are using the local environment for H2O, you need to install the Java Development Kit (JDK).
After installing JDK and H2O, we will initialize it, if it works fine this will start an H2O instance on the localhost.
There are many arguments which we can pass such as: nthreads, ip, port, max_mem_size, and min_mem_size.

You can refer to H2O for more details on H2O.

Python

!pip install h2o

Output

Python

import h2o
h2o.init()

Output

AutoML_4_og — Importing and Initializing Session

We can observe that H2O instance can also be assessed from localhost: 54321, this instance provides a web GUI called FlowGUI.

6. Converting Train Dataframe to H2O DataFrame

To convert the train data frame into the H2O Dataframe, we'll use the following step.

python

train_df = h2o.H2OFrame(df)
train_df.describe()

Output

AutoML_5_og — Converting to H2O DataFrame and Analyzing Training Data

7. Preparing Test Dataset

Now, Download the Testing Dataset from here and convert pandas DataFrame into the H2O Dataframe. Further, we remove label classvariable from feature variable.

python

test = pd.read_csv('/content/california_housing_test.csv')
test = h2o.H2OFrame(test)

# Defining feature and label columns
x = test.columns
y = 'median_house_value'
x.remove(y)

Output

8. Setup AutoML and Train the model

Now, we import H2O AutoML and start training.

python

from h2o.automl import H2OAutoML
# callh20automl  function
aml = H2OAutoML(max_runtime_secs = 600,
                seed = 1,
                balance_classes = False,
                project_name ='Project_1'
)
# Train model and record time % time
aml.train(x = x, y = y, training_frame = train_df)

Output

AutoML_8_og — AutoML based Training of Model

9. View the H2O aml leaderboard

In this step, we will look for the best performing model using the leaderboard and it will most probably be one of the two stacked ensemble models.

python

lb = aml.leaderboard
lb.head(rows = lb.nrows)

Output

10. Analyzing Leaderboard and Top Model

In this step, we explore the base learners of the stacked ensemble model and select the best performing base learning model. Here, we identify top, metalearner, and base learner models.

python

se = aml.leader
metalearner = h2o.get_model(se.metalearner()['name']))
metalearner.varimp()

Output

/usr/local/lib/python3.11/dist-packages/h2o/estimators/stackedensemble.py:965: H2ODeprecationWarning: The usage of stacked_ensemble.metalearner()['name'] will be deprecated. Metalearner now returns the metalearner object. If you need to get the 'name' please use stacked_ensemble.metalearner().model_id
warnings.warn(
[('GBM_3_AutoML_1_20250707_91158', 23234.78125, 1.0, 0.21439825598646953),
('GBM_4_AutoML_1_20250707_91158',
21459.20703125,
0.9235811949488442,
0.19801419745893173),

11. Model Evaluation

Now, we calculate error on this base learning model and plot the feature importance plot using this model.

python

model = h2o.get_model('XGBoost_grid__1_AutoML_20200714_173719_model_5')
model.model_performance(test)

Output

ModelMetricsRegressionGLM: stackedensemble
** Reported on test data. **

MSE: 2082655548.9724505
RMSE: 45636.121099108
MAE: 29539.84376082709
RMSLE: 0.2277528452878982
Mean Residual Deviance: 2082655548.9724505
R^2: 0.83718821 ...

In case of error, Alter model according to the models in your h2o instance.

12. Visualizing Feature Importance

python

model.varimp_plot(num_of_features = 9)

Output

AutoML_9_og — Feature Importance Visualization

13. Save the Base Learner Model

We can finally save this model using the model.save method, this model can be deployed on various platforms.

python

model_path = h2o.save_model(model = model, path ='sample_data/', force = True)

You can refer to the source code and download it from - here.

What is AutoML?

Functionalities of H2O AutoML

Architecture of H2O AutoML

Implementation of House Price Prediction using H2O AutoML

1. Importing Necessary Libraries

2. Loading the Data

3. Understanding the Data

4. Pre-processing the Data

5. Installation and Initialization of H2O

6. Converting Train Dataframe to H2O DataFrame

7. Preparing Test Dataset

8. Setup AutoML and Train the model

9. View the H2O aml leaderboard

10. Analyzing Leaderboard and Top Model

11. Model Evaluation

12. Visualizing Feature Importance

13. Save the Base Learner Model

Explore