Open-Source Internship opportunity by OpenGenus for programmers. Apply now.
In this article, we will be learning about various ways of saving and reusing machine learning models.
Table of Contents
- Introduction
- Serialization using Pickle
- Serialization using HDF5
- Serialization using TensorFlow's SavedModel format and Protocol Buffers
- Serialization using JSON
- Comparison
- Conclusion
- References
Introduction
Creating and training an accurate machine learning model can take from just a few minutes to days or even weeks, depending on how complex it is and the amount of data used to train it. If you wanted to test the final model and apply it in a real-world setting, it would be very time-consuming and inefficient to retrain it every time you wanted to use it. Therefore, it is important to save your model so that it can be quickly loaded without you having to train it again.
What is Serialization?
Serialization is the process of converting an object into a format that can be stored for later use. The reverse of serialization, or the process of using the serialized data to reconstruct the original object, is called deserialization.
There are various ways we can serialize machine learning models, and each of them have their advantages and disadvantages. Some of them are Python specific, so they cannot be used in other programming languages, while others can be used in several programming languages. Some of them are easier to read and interpret than others, while others can only be understood by a computer. Additionally, some are easier to implement than others, since machine learning libraries only provide support for certain serialization methods. We will discuss several of these methods and discuss when they are suitable to use.
Serialization using pickle
pickle
is a Python module that uses binary protocols to serialize and deserialize Python objects. The process of using the pickle
module to serialize objects is commonly referred to as "pickling", and deserializing objects using pickle
is called "unpickling".
Advantages
pickle
can serialize nearly any object in Python. This even includes functions. Therefore,pickle
can be applied to save any model - it doesn't whether you made it from scratch or used any preexisting library. This is a significant advantagepickle
has over other serialization methods.pickle
is very simple to use and can be applied very quickly. It just takes a few lines of code to pickle an object.
Disadvantages
pickle
is a binary file, so it is not human-readable. Below is an example of what apickle
file may look like:
pickle
is not secure. Pickled files may be able to execute malicious code on your machine, so it is important to only unpickle data from a trusted source.pickle
can only be used with Python.
Implementation
To pickle an object, we use the pickle.dump
method. This takes two arguments - the object to be serialized, and the output file. We open this file using Python's built-in open
method, where we specify the path to the file and the mode. When we use pickle.dump
, the mode is wb
, which specifies that we are writing to a binary file.
If you have trained a machine learning model named model
, you can pickle it using the following code:
import pickle # built in, no need to install
with open("model.pickle", "wb") as file:
pickle.dump(model, file)
To unpickle an object, we use the pickle.load
method. This takes one argument - the file containing the pickled data. This time, the mode is rb
, specifying that we are reading a binary file.
with open("model.pickle", "rb") as file:
model = pickle.load(open("model.pickle", "rb"))
HDF5
Hierarchical Data Format version 5 (HDF5) is an open-source format designed to store large amounts of data.
Advantages:
- It is compatible with several different programming languages, including Python, R, C/C++, and Java.
- It can store and modify compressed data. This is especially useful when dealing with large amounts of data.
Disadvantages
- Not all objects can easily be serialized using HDF5.
- Less comprehensive than SavedModel format, discussed next.
Implementation (TensorFlow/Keras)
Keras provides a simple method to save a model using the HDF5 format.
If you have trained a model named model
, you can save it using the following code:
# The '.h5' extension specifies that the format to be used is HDF5.
model.save('my_model.h5')
To reconstruct the model and reuse it, use the following code:
from tensorflow import keras
model = keras.models.load_model("my_model.h5")
Serialization using TensorFlow's SavedModel format and Protocol Buffers
Another format you can use with TensorFlow/Keras is SavedModel. Compared to the HDF5 format, it is a more comprehensive save format that saves the entire model's architecture and weights. Instead of one file, this creates a new directory with a structure as shown below:
keras_metadata.pb
saved_model.pb
assets/
variables/
The variables/
directory stores the model weights. saved_model.pb
stores the model architecture and training configuration.The .pb
file extension represents that it is based on Protocol Buffers, commonly referred to as protobufs. It allows us to define data structures in text files that can generate classes in several languages like Python and C.
Advantages
- SavedModel is a more comprehensive save format than HDF5. It stores external losses and metrics (added using
model.add_loss()
ormodel.add_metric()
), unlike the HDF5 format. It also stores custom layers and objects. - It is compatible with several different programming languages, including Python, R, C/C++, and Java.
Disadvantages
- SavedModel takes up more storage space than the HDF5 format.
Implementation
We use the same model.save
method to save a model using the SavedModel format.
# the directory is my_model/
model.save('my_model')
To reconstruct the model and reuse it, run the following code:
from tensorflow import keras
model = keras.models.load_model("my_model")
Serialization using JSON
JavaScript Object Notation (JSON) is a lightweight data format. The format is very similar to that of Python dictionaries - there are key-value pairs, and each key value pair is separated by a comma. The JSON object is surrounded by curly braces {}
.
Advantages
- JSON is supported by almost all programming languages since it is built on data structures that almost all programming languages understand. In Python, a collection of key-value pairs is a
dict
. C++ provides amap
structure, and Java provides several classes extending theMap
interface that consist of key-value pairs. - JSON syntax is human readable. JSON is essentially just a string with a specified format.
- JSON is safe - it cannot execute code upon deserialization.
Disadvantages
- Not every library provides a way to save a model in JSON format.
- Not every object can easily be serialized using JSON. Not all machine learning libraries provide support for JSON, and in order to use JSON for serialization, you may have to write some code from scratch to do so.
Implementation (XGBoost)
XGBoost is a gradient boosting library with machine learning algorithms implemented under the Gradient Boosting framework. XGBoost works with several programming languages and allows us to export models to JSON files.
If you created and trained an XGBClassifier
model named model
, you can export it to JSON using the save_model
method:
model.save_model('model.json')
This model can then be loaded using the load_model
method:
model = XGBClassifier()
model.load_model("model.json")
Comparison
We looked at various techniques for serializing models. Depending on the scenario, one model may be better than the others. Here are some things you may want to keep in mind when deciding which format to use.
- When you want to save an object quickly and you are only using Python,
pickle
is likely the best option. It is almost always the easiest to implement. - When you are using a machine learning library that does not provide built-in support for any file formats (e.g. scikit-learn),
pickle
is the easiest to use because it can be used for any type of Python object. - When you are dealing with large amounts of data, HDF5 is a good option.
- If you are using TensorFlow or Keras, SavedModel is usually a better option than HDF5 since SavedModel is more comprehensive.
- When readability and cross-platform compatibility matter, there is no premade serialization implementation that meets these requirements, and you are willing to put in the time to write extra code, JSON is likely the best option.
Conclusion
In this article at OpenGenus, we learned why serialization of machine learning models is important, several different techniques for serialization, and when each of them are suitable to use.
That is it for this article, and thank you for reading.
References
- Introduction to Model IO β xgboost 1.6.1 documentation, https://xgboost.readthedocs.io/en/stable/tutorials/saving_model.html
- pickle β Python object serialization β Python 3.10.5 documentation, https://docs.python.org/3/library/pickle.html
- Save and load Keras models | TensorFlow Core, https://www.tensorflow.org/guide/keras/save_and_serialize