This tutorial is an introduces on how to train deep learning models on Google ML Engine (now AI Platform). This is a tool developed by Google to make it easier to take ML projects from ideation to production.
Note: This tutorial is an adaption of the Getting Started Training Prediction tutorial.
Machine learning is essentially what the term suggests - the learning of machines, particularly, computer systems. Machine learning enables computer systems to learn from examples in order to predict outcomes without being explicitly programmed by humans. Oftentimes large data sets are required to accurately train the computer system; therefore a great amount of force is needed by the neural networks. This limitation can be solved by the application of deep learning.
Deep learning is a class of machine learning algorithms that: use multiple layers to progressively extract higher level features from raw input.
In esssence deep learning is achieved by training computer systems to learn algorithms by the use of neural networks; these neural networks mimic the biological neurons of the human brain. The neural networks are based on a set of multiple layers which are connected to each other; the main function of these neural networks is to use inputted data to accurately predict outputs using progressively complex calculations.
The main focus of this tutorial is on how to train a deep learning model on Google's AI Platform and less on how to develop a deep learning model & the science behind that.
In this tutorial we are going to make use of the Iris flower data set to train a deep neural network and classify different flower species. We are going to use the DNNClassifier tensorflow estimator to train the model and parkage it as an application to be deployed on AI platform.
Complete the following steps to set up a GCP account, activate the AI Platform API, and install and activate the Cloud SDK.
-
Select or create a Google Cloud Platform project. Go To The Manager Resources Page
-
Make sure that billing is enabled for your Google Cloud Platform project. Learn How To Enable Billing
-
Enable the AI Platform ("Cloud Machine Learning Engine") and Compute Engine APIs. Enable The APIs
-
Install virtualenv
virtualenv
is a tool to create isolated Python environments. Check if you already havevirtualenv
installed by runningvirtualenv --version
.pip install --user --upgrade virtualenv
To create an isolated development environment for this guide, create a new virtual environment in virtualenv. For example, the following command activates an environment named
cmle-env
:virtualenv cmle-env --python=python2.7 source cmle-env/bin/activate
-
For the purposes of this tutorial, run the rest of the commands within your virtual environment.
The relevant data files for training, iris_training.csv
and iris_test.csv
, are stored in the data
folder.
-
Set the
TRAIN_DATA
ANDEVAL_DATA
variables to your local file paths. For example, the following commands set the variables to local paths.TRAIN_DATA=$(pwd)/data/iris_training.csv EVAL_DATA=$(pwd)/data/iris_test.csv
-
Install dependencies
pip install -r ../requirements.txt
A local training job loads your Python training program and starts a training process in an environment that's similar to that of a live AI Platform cloud training job.
-
Specify an output directory and set a
MODEL_DIR
variable. The following command setsMODEL_DIR
to a value of output.MODEL_DIR=output
-
It's a good practice to delete the contents of the output directory in case data remains from a previous training run. The following command deletes all data in the output directory.
rm -rf $MODEL_DIR/*
-
To run your training locally, run the following command:
gcloud ai-platform local train \ --module-name trainer.task \ --package-path trainer/ \ --job-dir $MODEL_DIR \ -- \ --train-files $TRAIN_DATA \ --eval-files $EVAL_DATA \ --train-steps 1000 \ --eval-steps 100
-
Inspect the summary logs using Tensorboard
tensorboard --logdir=$MODEL_DIR
When you have started running TensorBoard, you can access it in your browser at http://localhost:6006
Click on Accuracy to see graphical representations of how accuracy changes as your job progresses.
This section shows you how to create a new bucket. You can use an existing bucket, but if it is not part of the project you are using to run AI Platform, you must explicitly grant access to the AI Platform service accounts.
-
Specify a name for your new bucket. The name must be unique across all buckets in Cloud Storage.
BUCKET_NAME="your_bucket_name"
For example, use your project name with
-mlengine
appended:PROJECT_ID=$(gcloud config list project --format "value(core.project)") BUCKET_NAME=${PROJECT_ID}-mlengine
-
Check the bucket name that you created.
echo $BUCKET_NAME
-
Select a region for your bucket and set a REGION environment variable.
For example, the following code creates REGION and sets it to
us-central1
:REGION=us-central1
-
Create the new bucket:
gsutil mb -l $REGION gs://$BUCKET_NAME
Note: Use the same region where you plan on running AI Platform jobs. The example uses
us-central1
because that is the region used in the getting-started instructions.
-
Use
gsutil
to copy the two files to your Cloud Storage bucket.BUCKET_NAME="your_bucket_name"
-
Set the
TRAIN_DATA
andEVAL_DATA
variables to point to the files.TRAIN_DATA=gs://$BUCKET_NAME/data/iris_training.csv EVAL_DATA=gs://$BUCKET_NAME/data/iris_test.csv
-
Use
gsutil
again to copy the JSON test filetest.json
to your Cloud Storage bucket.gsutil cp ../test.json gs://$BUCKET_NAME/data/test.json
-
Set the
TEST_JSON
variable to point to that file.TEST_JSON=gs://$BUCKET_NAME/data/test.json
With a validated training job that runs in both single-instance and distributed mode, you're now ready to run a training job in the cloud. You'll start by requesting a single-instance training job.
Use the default BASIC
scale tier to run a single-instance training job. The initial job request can take a few minutes to start, but subsequent jobs run more quickly. This enables quick iteration as you develop and validate your training job.
-
Select a name for the initial training run that distinguishes it from any subsequent training runs. For example, you can append a number to represent the iteration.
JOB_NAME=iris_single_1
-
Specify a directory for output generated by AI Platform by setting an
OUTPUT_PATH
variable to include when requesting training and prediction jobs. TheOUTPUT_PATH
represents the fully qualified Cloud Storage location for model checkpoints, summaries, and exports. You can use theBUCKET_NAME
variable you defined in a previous step.It's a good practice to use the job name as the output directory. For example, the following OUTPUT_PATH points to a directory named
iris_single_1
.OUTPUT_PATH=gs://$BUCKET_NAME/$JOB_NAME
-
Run the following command to submit a training job in the cloud that uses a single process. This time, set the
--verbosity
tag toDEBUG
so that you can inspect the full logging output and retrieve accuracy, loss, and other metrics. The output also contains a number of other warning messages that you can ignore for the purposes of this sample.gcloud ai-platform jobs submit training $JOB_NAME \ --job-dir $OUTPUT_PATH \ --runtime-version 1.10 \ --module-name trainer.task \ --package-path trainer/ \ --region $REGION \ -- \ --train-files $TRAIN_DATA \ --eval-files $EVAL_DATA \ --train-steps 1000 \ --eval-steps 100 \ --verbosity DEBUG
You can monitor the progress of your training job by watching the command-line output or in AI Platform > Jobs on Google Cloud Platform Console.
-
Choose a name for your model; this must start with a letter and contain only letters, numbers, and underscores. For example:
MODEL_NAME=iris
-
Create an AI Platform model:
gcloud ai-platform models create $MODEL_NAME --regions=$REGION
-
Select the job output to use. The following sample uses the job named iris_single_1.
OUTPUT_PATH=gs://$BUCKET_NAME/iris_single_1
-
Look up the full path of your exported trained model binaries:
gsutil ls -r $OUTPUT_PATH/export
-
Find a directory named
$OUTPUT_PATH/export/iris/<timestamp>
and copy this directory path (without the : at the end) and set the environment variable MODEL_BINARIES to its value. For example:MODEL_BINARIES=gs://$BUCKET_NAME/census_dist_1/export/iris/1487877383942/
Where
$BUCKET_NAME
is your Cloud Storage bucket name, and iris_single_1 is the output directory. -
Run the following command to create a version
v1
:gcloud ai-platform versions create v1 \ --model $MODEL_NAME \ --origin $MODEL_BINARIES \ --runtime-version 1.10
You can get a list of your models using the models list command.
gcloud ai-platform models list
You can now send prediction requests to your model. For example, the following command sends an online prediction request using a test.json
file inside the data
folder.
gcloud ai-platform predict \
--model $MODEL_NAME \
--version v1 \
--json-instances data/test.json