What is Latency in Machine Learning (ML)?

Latency is a measurement in Machine Learning to determine the performance of various models for a specific application. Latency refers to the time taken to process one unit of data provided only one unit of data is processed at a time.

The unit of latency is seconds (time unit).

In terms of Image Classification:

Latency is the time taken to process one image for batch size 1.
Batch size is the number of images processed at a time together.

Example of Latency: GoogleNet for Image Classification takes 0.057 seconds to classify one image on Intel CascadeLake. This improves to 0.009 seconds if we use INT8 version of GoogleNet on the same system for the same application.

This means the user has to wait for 0.009 seconds to get the result for 1 image.

Significance of Latency

Latency is important as it is directly tied with real time performance of systems.

Less latency is better.

Latency is the time one has to wait to get the result. If the waiting time is observable, it provides a poor experience. Every system wants to work in real time and hence, it is important to improve latency.

Improving latency is not trivial and requires deep insights into the Machine Learning model at hand and the concerned application. It depends on the Machine Learning framework and the system as well.

Latency vs Throughput

Latency and throughput can be used interchangeably.

Latency is mainly used for applications that will be used directly by customers while throughput is used for server applications which will pre-compute a specific task or process input of multiple users together.
Latency is for batch size 1 while throughput is for batch size greater than 1.
Latency is inversely proportional to throughput.
Usually, performance of Machine Learning models tend to improve when batch size is greater than 1 (a power of 2 and depends on the system).
The strategy to improve latency differs from the strategy to improve throughput.
Running workload in parallel gives more improvement for throughput than for latency.

Use of Latency

Latency is mainly used for applications that use batch size = 1. Consider the following applications:

Face Unlock: The camera takes a picture of the person infront of it and processes it in a lightweight ML model (present in the mobile itself). In this case, only one image is processed at a time and it is important for it to run fast to give result in real time.
Image Compression web services: Web services, generally, take in a single user input and run it in their server to generate the output. As one data unit is processed, latency is an important measurement. In web services, if work load increases, data from different users can be combined and run as batch size > 1. This may take more time for the user but overall time will be less. This is another challenging task.

With this article at OpenGenus, you must have the complete idea of latency in Machine Learning. Enjoy.