Search anything:

Person re-identification ReID

Internship at OpenGenus

Get this book -> Problems on Array: For Interviews and Competitive Programming

In this article, we have explored the idea behind Person re-identification ReID applications, techniques for ReID and real world applications.

Table of contents:

  1. Person re-identification ReID
  2. Re-Identification With Deep Learning Works
  3. Unsupervised Re-Identification with Deep Learning
    • DeepSORT
    • Siamese Networks
    • Variational Autoencoders
  4. Applications of Person re-identification ReID

Person re-identification ReID

In short, Person re-identification ReID is the task of using a picture of a person to identify the presence of the same person is a set of images or video. It is used to identify a person in a CCTV footage.


The difficulty of matching persons across discontinuous camera views in a multi-camera system is known as person re-identification (or Person Re-ID for short). Due to changes in person positions, different camera angles, and occlusion, this task is incredibly difficult. Due to the quick growth and great performance of deep learning, person ReID based on deep learning technology has attracted extensive attention in recent years. Due to changes in person positions, different camera angles, and occlusion, this task is incredibly difficult.Due to the quick growth and great performance of deep learning, person ReID based on deep learning technology has attracted extensive attention in recent years.It can be used for a variety of public security purposes, including intelligent camera surveillance systems.
In a typical real-world application, a single individual or a watch-list of a few known people serves as the target set for searching through a vast volume of video surveillance footage for people on the watch-list who are likely to reappear.

The computer vision society has recently focused on person re-identification. While new technologies are continually improving matching performance, video surveillance applications continue to cause problems for ReID systems. Direct methods, metric learning methods, and transform learning methods are the three types of methodologies used in recent methods for person ReID.
A review of contemporary ReID approaches was conducted, and a new transform learning-based method was developed. A cumulative weight brightness transfer function is used to simulate the appearance of a person in a new camera (CWBTF). To segment the human image into relevant portions, the method uses a powerful segmentation methodology. ReID performance is improved by matching characteristics derived just from the body area. To improve the matching rate, the approach makes use of numerous pedestrian detections.

While the approach achieves best-in-class results on color photos, effective matching in low-light settings necessitates the use of additional modalities at night. Cross-modality color-infrared matching is becoming more popular. In recent years, several multispectral datasets have been collected. Unpaired color and near-infrared pictures are included in the SYSU-MM01 dataset. Color and infrared photos are included in the RegDB dataset for testing cross-modality ReID techniques. On these datasets, new approaches have revealed that color-infrared matching is difficult. In the visible range, contemporary ReID methods achieve a ReID rate (r = 20) of 90, while cross-modality ReID approaches get just 70. Nonetheless, the cross-modality technique improves ReID resilience at night.

In the subject of video surveillance, the thermal camera has gotten a lot of interest from academics. While paired color and thermal photos from thermal cameras improve pedestrian recognition and ReID, cross-modality person ReID is difficult due to considerable differences in a person's look in color and thermal images.

In random image-to-image translation applications, generative adversarial networks (GANs) have recently shown promising results. We believe that employing a specific GAN framework to translate color to thermal image translation can improve color-thermal ReID performance.

Re-Identification With Deep Learning Works

The availability of raw video data from surveillance cameras is the most important necessity. These cameras are frequently installed in various locations and in various situations.The raw visual data frequently contains a lot of complicated and loud background clutter.

Bounding Box Generation: Person detection and tracking algorithms are used to detect people in video footage. From the video data, bounding boxes containing the person images are retrieved.Cross-camera labels are annotated in the training data.
Due to the high cross-camera variances, training data annotation is frequently required for discriminative Re-identification model learning. In the case of substantial domain shifts, the training data must usually be annotated in each new scenario.

Model Training: Using the previously annotated person photos or videos, a discriminative and robust Re-ID model is trained in the training phase.
This is the most extensively investigated aspect of the creation of a re-identification system. To address the numerous issues, extensive models have been built, focusing on feature representation learning, distance metric learning, or their mixtures.

Pedestrian Retrieval: The pedestrian retrieval is carried out during the testing phase. The Re-ID model retrieves feature representations learned in the previous stage from a query for a person-of-interest and a gallery collection.The calculated query-to-gallery similarity is sorted to produce a rating list (probability of ID-match).

Some strategies that operate in open situations have been proposed in recent years.

Unsupervised Re-Identification with Deep Learning

Labels are available in supervised learning (annotated data). The idea of "cross-camera label estimation" for unsupervised learning is to estimate Re-identification labels as correctly as feasible. Following that, the estimated labels are employed in feature learning to construct robust re-ID models. Unsupervised Re-ID has received more attention in recent years as a result of the success of deep learning. The unsupervised Re-ID performance for the Market-1501 dataset has improved dramatically over the last three years, with Rank-1 accuracy rising from 54.5 percent to 90.3 percent and mAP rising from 26.3 percent to 76.7 percent.Despite its promising results, unsupervised Re-identification is still in its infancy and requires further development.

The uncontrolled and supervised Re-ID still have a significant disparity. On the Market-1501 dataset, for example, the rank-1 accuracy of supervised ConsAtt was 96.1 percent, while the greatest accuracy of unsupervised SpCL was 90.3 percent. Unsupervised learning with large-scale unlabeled training data recently outperformed supervised learning on a variety of tasks, according to academics.


Deep sort performs re-identification and tracking using a combination of Kalman filter and CNN. Any object detection model can be used to recognize humans, while Yolov5 (pre-trained on the coco dataset) usually works well. The embeddings for each detection are provided by the CNN and can be used to make associations with other detections. Kalman filters are used to detect "tracks" between frames, which can be used to construct or throw relationships between people who leave or enter the frame. The distance between feature embeddings was calculated using Mahalanobis distance and cosine distance, according to the authors of the paper. The CNN was trained on the MARS dataset in the original paper. Bounding box predictions are made using an eight-dimensional state (u, v, y, h, u^hat, v^hat, y^hat, h^hat).
The states u, v, y, and h, respectively, are the center point, aspect ratio, and height.

The velocity vector utilized by the Kalman filter to make estimations is formed from the states u^hat, v^hat, y^hat, h^hat.

We employ Kalman filters because the model may miss an association in a single frame but may re-identify the person in subsequent frames.
We can use the Kalman filter to find such missed re-identifications.

Siamese Networks

One-shot learning is the foundation of Siamese Networks.
One-shot learning is a technique for performing classification tasks on classes that the model has never seen before but are related to the training data in some way.
Let's imagine we need to create a facial biometrics system for a 10,000-person organization. We can start by collecting photos of 10,000 employees and then create a classification model. What if 200 more employees join the company next week?
So, with the existing 10000+200 people, we can retrain a model. However, every time a new employee is hired, we must retrain the model to recognize them. In such cases, Siamese networks can be useful. Siamese networks do not directly learn to categorize, but rather learn distinctions between labels from training data. The labels in the case of face biometrics are employees of the company.

Because the Siamese network only learns differences between classes, this model may be modified to calculate differences between the faces of persons the machine has never seen before. The Triplet Loss and the Contrastive Loss are two popular loss functions for training Siamese Networks. Any object detection model, similar to Deepsort, can be used to recover the persons in the picture, and then the Siamese network can be used to attach similarity scores to identify people.

Variational Autoencoders

VAEs (Variational Autoencoders) are a type of generative model that can model the distribution of a dataset. After training a VAE to simulate a distribution, we can use it to generate new samples that fit the distribution, even if the VAE has never seen the new sample.

A VAE has a simple structure, consisting of two parts: an encoder and a decoder.
The encoder accepts an image (I) as input and returns feature embeddings.
The embeddings are known as latent variables in Autoencoders.
The decoder's job is to take the encoder's latent variable and recreate the original picture I.

The encoders' latent variables, on the other hand, can be useful for analyzing similarities between new and previous detections. If given enough data to train on, the VAE will learn to build very effective latent variable vectors that can reveal a lot about the individual of interest.

So we can train a VAE on data with humans and then discard the decoder because the encoder is all that is required to obtain the latent variables. We can use any object detection model to detect people, just like the other models, while employing the VAE to calculate the similarity for person ReID.

Applications of Person re-identification ReID

The ROSE Lab also produced a web-based AI-powered surveillance system that served as the project's demo, based on the newly established MMFA-AAE model.
This technology works in tandem with the NTU EEE building's 175 surveillance cameras, processing and analyzing video inputs in real time.
The two key purposes of this system are trajectory tracking retrieval and real-time person matching.

COVID-19 ROSE & DSTA-Digital Hub Human Re-ID System
The Flask micro web framework underpins the ROSE Re-ID system. RTSP or HTTP video streams can be simply adjusted and integrated into any surveillance network. To improve security during the COVID-19 epidemic, this system was updated and deployed in foreign worker isolation facilities.

With this article at OpenGenus, you must have the complete idea of Person re-identification ReID.

Person re-identification ReID
Share this