Open-Source Internship opportunity by OpenGenus for programmers. Apply now.
In machine learning, a large amount of labeled data is needed to train accrurate models. However labelling data can be time consuming and expensive, especially in domains like healthcare or autonomous driving. Active learning solves this problem by letting the model query a human expert for labels on the most uncertain or informative data points. This reduces the amount of labeled data needed which then speeds up the training process.
WHAT IS ACTIVE LEARNING?
Active learning is a type of machine learning in which the model is allowed to interact with an external oracle (such as a human annotator) to obtain the correct labels for the most ambiguous data points. Instead of randomly labeling data, the model actively selects the data it is uncertain about, significantly reducing the number of labeled examples needed to achieve good performance.
Key Strategies In Active learning
1.Uncertainty Sampling: The model queries examples for which it is least certain of the output. This strategy is effective when the model needs to differentiate between similar classes or when certain classes are underrepresented in the dataset. For example, in image recognition, the model may struggle with ambiguous images and query these for labeling.
2.Query by Committee: Instead of using a single model, Query by Committee employs a collection of models. These models may have different architectures or be trained on slightly different subsets of the data. The data points on which the models disagree the most are selected for labeling. This strategy works well when there are many ambiguous data points that confuse even slightly different models.
3.Expected Model Change: This strategy looks for data points that, once labeled, would cause the most significant update to the model. By choosing data that significantly influences model parameters, this method helps the model generalize faster with fewer queries.
4.Diversity Sampling:The model queries diverse data points that cover as many different regions of the input space as possible. This helps ensure that the model doesn't get stuck focusing on just a narrow part of the data distribution.
Real-World Applications of Active Learning
Active learning is especially useful in industries where labeled data is expensive to obtain. Some real-world applications include:
1.Healthcare: Medical image classification often requires the expertise of radiologists to label MRI scans or X-rays. By applying active learning, models can identify ambiguous cases and query radiologists only for those, significantly reducing the time and cost involved.
2.Autonomous Driving: In self-driving cars, labeling driving scenarios, obstacles, and road conditions involves human experts. Active learning allows the model to focus on edge cases, like unusual traffic situations, improving the car's ability to operate safely in diverse environments.
3.Natural Language Processing: Tasks like sentiment analysis or legal document classification require human experts to label sentences or paragraphs. Active learning can minimize the amount of human effort by selectively querying for sentences where the model is least confident.
4.Robotics: Robots equipped with machine learning algorithms can use active learning to improve their understanding of their environment. By querying uncertain perceptions, robots can quickly adapt to new situations without requiring exhaustive pre-training data.
Solving A Problem
Active learning can be applied in various domains, but the approach to solving the problem generally follows a simple workflow:
1.Start by training a model on a small labeled dataset.
2.Use the model to predict labels for the remaining unlabeled data.
3.Query an oracle for labels of the data points where the model is least certain.
4.Retrain the model using the newly labeled data and repeat the process.
Tracing An Example
An example where we want to classify images of handwritten digits (like in the MNIST dataset). Instead of labeling thousands of images, we start by labeling just 50 images and let the model predict the rest.
1.The model is initially trained on 50 labeled examples.
2.The model then selects 10 images it’s most uncertain about and asks a human expert to label those.
3.The newly labeled data is added to the training set, and the model is retrained.
4.After a few iterations of querying, the model’s accuracy improves significantly, despite having labeled only a fraction of the entire dataset.
Implementing The Solution
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import numpy as np
data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.3)
initial_samples = 10 # Number of initial samples to label
X_init, X_pool, y_init, y_pool = train_test_split(X_train, y_train, train_size=initial_samples)
model = RandomForestClassifier()
model.fit(X_init, y_init)
for i in range(5): # Perform 5 rounds of querying
# Predict probabilities on the remaining pool data
probs = model.predict_proba(X_pool)
uncertainties = np.max(probs, axis=1)
# Query the most uncertain example (where the model is least confident)
query_idx = np.argmin(uncertainties)
X_new, y_new = X_pool[query_idx].reshape(1, -1), y_pool[query_idx].reshape(1, )
# Add the new labeled example to the training set
X_init = np.vstack([X_init, X_new])
y_init = np.hstack([y_init, y_new])
# Remove the queried example from the pool
X_pool = np.delete(X_pool, query_idx, axis=0)
y_pool = np.delete(y_pool, query_idx, axis=0)
# Retrain the model on the updated labeled dataset
model.fit(X_init, y_init)
accuracy = model.score(X_test, y_test)
print(f"Accuracy after active learning: {accuracy}")
Time And Space Complexity
The complexity of the active learning process depends on:
1.Training complexity: Training the random forest model takes O(n log n) where n is the number of labeled examples.
2.Querying complexity: Calculating uncertainties for the unlabeled data takes O(m) where m is the size of the pool.
Overall, the time complexity for each iteration of the active learning loop is O(m + n log n).Since the number of unlabeled data points decreases after each query, this approach is more efficient than labeling all data in advance.
Key Takeaways
1.Active learning reduces the need for extensive labeled data by querying uncertain examples for human labels.
2.Uncertainty sampling is a widely-used strategy where the model queries data points it is least confident about.
3.The approach is particularly useful in domains where labeling data is expensive, such as healthcare and autonomous vehicles.