Do not miss this exclusive book on Binary Tree Problems. Get it now for free.

In this article, we will be learning about various ways of saving and reusing machine learning models.

Introduction
Serialization using Pickle
Serialization using HDF5
Serialization using TensorFlow's SavedModel format and Protocol Buffers
Serialization using JSON
Comparison
Conclusion
References

Introduction

Creating and training an accurate machine learning model can take from just a few minutes to days or even weeks, depending on how complex it is and the amount of data used to train it. If you wanted to test the final model and apply it in a real-world setting, it would be very time-consuming and inefficient to retrain it every time you wanted to use it. Therefore, it is important to save your model so that it can be quickly loaded without you having to train it again.

What is Serialization?

Serialization is the process of converting an object into a format that can be stored for later use. The reverse of serialization, or the process of using the serialized data to reconstruct the original object, is called deserialization.

There are various ways we can serialize machine learning models, and each of them have their advantages and disadvantages. Some of them are Python specific, so they cannot be used in other programming languages, while others can be used in several programming languages. Some of them are easier to read and interpret than others, while others can only be understood by a computer. Additionally, some are easier to implement than others, since machine learning libraries only provide support for certain serialization methods. We will discuss several of these methods and discuss when they are suitable to use.

Serialization using `pickle`

pickle is a Python module that uses binary protocols to serialize and deserialize Python objects. The process of using the pickle module to serialize objects is commonly referred to as "pickling", and deserializing objects using pickle is called "unpickling".

Advantages

pickle can serialize nearly any object in Python. This even includes functions. Therefore, pickle can be applied to save any model - it doesn't whether you made it from scratch or used any preexisting library. This is a significant advantage pickle has over other serialization methods.
pickle is very simple to use and can be applied very quickly. It just takes a few lines of code to pickle an object.

Disadvantages

pickle is a binary file, so it is not human-readable. Below is an example of what a pickle file may look like:
pickle is not secure. Pickled files may be able to execute malicious code on your machine, so it is important to only unpickle data from a trusted source.
pickle can only be used with Python.

Implementation

To pickle an object, we use the pickle.dump method. This takes two arguments - the object to be serialized, and the output file. We open this file using Python's built-in open method, where we specify the path to the file and the mode. When we use pickle.dump, the mode is wb, which specifies that we are writing to a binary file.

If you have trained a machine learning model named model, you can pickle it using the following code:

import pickle # built in, no need to install
with open("model.pickle", "wb") as file:
    pickle.dump(model, file)

To unpickle an object, we use the pickle.load method. This takes one argument - the file containing the pickled data. This time, the mode is rb, specifying that we are reading a binary file.

with open("model.pickle", "rb") as file:
    model = pickle.load(open("model.pickle", "rb"))

HDF5

Hierarchical Data Format version 5 (HDF5) is an open-source format designed to store large amounts of data.

Advantages:

It is compatible with several different programming languages, including Python, R, C/C++, and Java.
It can store and modify compressed data. This is especially useful when dealing with large amounts of data.

Disadvantages

Not all objects can easily be serialized using HDF5.
Less comprehensive than SavedModel format, discussed next.

Implementation (TensorFlow/Keras)

Keras provides a simple method to save a model using the HDF5 format.

If you have trained a model named model, you can save it using the following code:

# The '.h5' extension specifies that the format to be used is HDF5.
model.save('my_model.h5')

To reconstruct the model and reuse it, use the following code:

from tensorflow import keras
model = keras.models.load_model("my_model.h5")

Serialization using TensorFlow's SavedModel format and Protocol Buffers

Another format you can use with TensorFlow/Keras is SavedModel. Compared to the HDF5 format, it is a more comprehensive save format that saves the entire model's architecture and weights. Instead of one file, this creates a new directory with a structure as shown below:

keras_metadata.pb
saved_model.pb
assets/
variables/

The variables/ directory stores the model weights. saved_model.pb stores the model architecture and training configuration.The .pb file extension represents that it is based on Protocol Buffers, commonly referred to as protobufs. It allows us to define data structures in text files that can generate classes in several languages like Python and C.

Advantages

SavedModel is a more comprehensive save format than HDF5. It stores external losses and metrics (added using model.add_loss() or model.add_metric()), unlike the HDF5 format. It also stores custom layers and objects.
It is compatible with several different programming languages, including Python, R, C/C++, and Java.

Disadvantages

SavedModel takes up more storage space than the HDF5 format.

Implementation

We use the same model.save method to save a model using the SavedModel format.

# the directory is my_model/
model.save('my_model')

To reconstruct the model and reuse it, run the following code:

from tensorflow import keras
model = keras.models.load_model("my_model")

Serialization using JSON

JavaScript Object Notation (JSON) is a lightweight data format. The format is very similar to that of Python dictionaries - there are key-value pairs, and each key value pair is separated by a comma. The JSON object is surrounded by curly braces {}.

Advantages

JSON is supported by almost all programming languages since it is built on data structures that almost all programming languages understand. In Python, a collection of key-value pairs is a dict. C++ provides a map structure, and Java provides several classes extending the Map interface that consist of key-value pairs.
JSON syntax is human readable. JSON is essentially just a string with a specified format.
JSON is safe - it cannot execute code upon deserialization.

Disadvantages

Not every library provides a way to save a model in JSON format.
Not every object can easily be serialized using JSON. Not all machine learning libraries provide support for JSON, and in order to use JSON for serialization, you may have to write some code from scratch to do so.

Implementation (XGBoost)

XGBoost is a gradient boosting library with machine learning algorithms implemented under the Gradient Boosting framework. XGBoost works with several programming languages and allows us to export models to JSON files.

If you created and trained an XGBClassifier model named model, you can export it to JSON using the save_model method:

model.save_model('model.json')

This model can then be loaded using the load_model method:

model = XGBClassifier()
model.load_model("model.json")

Comparison

We looked at various techniques for serializing models. Depending on the scenario, one model may be better than the others. Here are some things you may want to keep in mind when deciding which format to use.

When you want to save an object quickly and you are only using Python, pickle is likely the best option. It is almost always the easiest to implement.
When you are using a machine learning library that does not provide built-in support for any file formats (e.g. scikit-learn), pickle is the easiest to use because it can be used for any type of Python object.
When you are dealing with large amounts of data, HDF5 is a good option.
If you are using TensorFlow or Keras, SavedModel is usually a better option than HDF5 since SavedModel is more comprehensive.
When readability and cross-platform compatibility matter, there is no premade serialization implementation that meets these requirements, and you are willing to put in the time to write extra code, JSON is likely the best option.

Conclusion

In this article at OpenGenus, we learned why serialization of machine learning models is important, several different techniques for serialization, and when each of them are suitable to use.

That is it for this article, and thank you for reading.

References

Introduction to Model IO — xgboost 1.6.1 documentation, https://xgboost.readthedocs.io/en/stable/tutorials/saving_model.html
pickle — Python object serialization — Python 3.10.5 documentation, https://docs.python.org/3/library/pickle.html
Save and load Keras models | TensorFlow Core, https://www.tensorflow.org/guide/keras/save_and_serialize

Saving and Reusing Machine Learning Models

Table of Contents

Introduction

What is Serialization?

Serialization using `pickle`

Advantages

Disadvantages

Implementation

HDF5

Advantages:

Disadvantages

Implementation (TensorFlow/Keras)

Serialization using TensorFlow's SavedModel format and Protocol Buffers

Advantages

Disadvantages

Implementation

Serialization using JSON

Advantages

Disadvantages

Implementation (XGBoost)

Comparison

Conclusion

References

Table of Contents

Introduction

What is Serialization?

Serialization using pickle

Advantages

Disadvantages

Implementation

HDF5

Advantages:

Disadvantages

Implementation (TensorFlow/Keras)

Serialization using TensorFlow's SavedModel format and Protocol Buffers

Advantages

Disadvantages

Implementation

Serialization using JSON

Advantages

Disadvantages

Implementation (XGBoost)

Comparison

Conclusion

References

Serialization using `pickle`