Open-Source Internship opportunity by OpenGenus for programmers. Apply now.
An accurate comprehension of the world largely hinges on understanding what motivates the people living on it. A person's action or reaction to any issue is largely dependent on the answer to the question: What kind of a person he is?
In this OpenGenus article, we aim to create a Machine Learning model which can tell us exactly that.
Contents
- The Myers Briggs Type Indicator (MBTI)
- Algorithm : XGBoost Classifier
- Building the Model
- Code Implementation
- Applications
- Key-Takeaways
The Myers Briggs Type Indicator(MBTI)
Personality tests in the modern world generally take the form of questionnaires, their length being a significant factor in the accuracy of the results.
We here attempt to do this task with the help of a machine learning algorithm. We aim to take input textual data and classify them into specific personality traits following the rules of one of the common personality tests.
There are multitude of personality tests littered across the internet. Some of the most common examples include the Big Five Inventory (BFI) and its different variations, the Myers Briggs Type Indicator (MBTI), the Minnesota Multiphasic Personality Inventory (MMPI), the Hogan Personality Inventory (HPI) etc.
In this article, we use the classification labels of the MBTI to build our model.
The MBTI personality test categorises the entire human population into 16 distinctive labels using a combination of the following characteristics:
- Extroversion (E) or Introversion (I);
- Sensing (S) or Intuiting (N);
- Thinking (T) or Feeling (F); and
- Judging (J) or Perceiving (P).
Algorithm : XGBoost Classifier
Our task is primarily a classification task. Thus, we need a robust classification algorithm. We have decided to use the XGBoost model because due to its advanced gradient boosting technique, XGBoost often achieves higher accuracy and better predictive performance compared to other algorithms.
XGBoost, or Extreme Gradient Boosting, is a highly efficient and flexible machine learning algorithm commonly used for supervised learning tasks such as classification and regression. It operates by building an ensemble of decision trees in a sequential manner, where each new tree aims to correct the errors made by the previous trees. This process, known as boosting, helps improve the overall accuracy and robustness of the model.
One of the key strengths of XGBoost is its speed and performance, achieved through advanced optimization techniques such as parallel processing and tree pruning. Additionally, XGBoost includes regularization parameters that help prevent overfitting, making it particularly effective for complex datasets.
Building the Model
Step 1: Data Preprocessing
The foremost task before building any model is the preprocessing of the dataset. As we aim to use text based data, we go through a series of tasks based in the domain of natural language processing(NLP) to clean and prepare our data for the model so as to not interfere with the results.
This includes converting the whole dataset into lowercase, then removing all punctuations, special characters, numbers, stopwords and any alphanumeric contents from the data. We can aim to expand contractions if feasible. Additionally for this model, we need to remove mentions of any of the labels, for example: "ENJF" or "INFJ" directly present in the text of our dataset as it may skew the results.
Then we perform lemmatization on the text data to group together words with the same basis to simplify the model's grasp of the data.
Step 2: Vectorization
Computers cannot recognize human language and topics like semantics, symbolism and context. Therefore, we need to vectorize the text.
Vectorization is the process of converting text data into numerical vectors that can be used by machine learning algorithms. Since algorithms work with numerical data, vectorization transforms words, phrases, or entire documents into a format that models can understand and process.
Step 3: Fitting the model
We use label encoding to convert the categorical column of the 16 different personality types to a numerical value so that it can be easier to fit the machine learning model.
Thus we compute the label encoding of the different labels.
The output of the label encoding and the vectorization is then fed to the XGBoost Classifier as input and the model is fitted.
The model is then trained and tested on the preprocessed data.
Question
What is not a vectorization technique?
Code Implementation
Firstly, we need to clean our data.
Python offers the NLTK(Natural Language Toolkit) library which contains a list of stopwords to help in the preprocessing.
from nltk.corpus import stopwords
nltk.download('stopwords')
For lemmatization we can use the built-in feature of NLTK:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
nltk.download('wordnet')
After preprocessing is over, we need to vectorize the data.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
vectorizer=TfidfVectorizer( max_features=4000,stop_words='english')
vectorizer.fit(train_data.texts)
train_post=vectorizer.transform(train_data.texts).toarray() #here train_data.texts is the column containing the actual textual data in the dataset
test_post=vectorizer.transform(test_data.texts).toarray()
Here label encoding is included to ensure a smoother functioning.
from sklearn.preprocessing import LabelEncoder
target_encoder=LabelEncoder()
train_target=target_encoder.fit_transform(train_data.type) #here train_data.type is the column containing the label in the dataset
test_target=target_encoder.fit_transform(test_data.type)
Then we fit our data into the XGBoost Classifier model as discussed previously.
from xgboost import XGBClassifier
model_xgb=XGBClassifier()
model_xgb.fit(train_post,train_target)
pred_xgb=model_xgb.predict(test_post)
pred_training_xgb=model_xgb.predict(train_post)
Applications
Human Resource : HR departments can utilize these models to help categorize work personnel based on personality traits. This can help in better team building and thus better output. This can also be used to satisfy any concerns and needs of workers, thus fostering a community like workplace.
Recruitment : This can also help immensely in the hiring process, streamlining the process a bit more. We can get a more specific idea of a person than what a 30-minute interview may provide, thus helping in finding the best fit for a job.
Education : Different students learn differently. The model can help in figuring out the individual needs of a student and can help tailor their academic experience to get the maximum benefit for them.
Crimes Prevention and Security : This model can help in identifying individuals with certain personality traits that may indicate a higher risk of criminal behavior. Thus, it can be used in law enforcement and security to stay a step ahead and help prevent crime.
Market Analysis : It can be used to identify and categorize a companies customer base and its market to help tailor marketing strategies, personalize content, and create targeted advertising campaigns that resonate better with different segments of their audience, thus enhancing customer experience.
It can also be used in political analysis, to help in diagnosing and treating mental health issues, in therapy, in detecting harmful online behaviors like trolling or cyberbullying, improving content recommendations etc to name a few more.
Key-takeaways
-
There are several different types of personality tests. Here, we attempt to work with the Myers Briggs Type Indicator (MBTI).
-
We use the XGBoost Classifier for its accuracy, efficiency and speed, to categorize individuals into 16 different personality types.
-
We firstly preprocess the data, then vectorize it, and finally, fit it into the model.
-
Its application is widely varied ranging from helping law enforcement to enhancing customer satisfaction. It can also be utilized in the field of education, health, human resource management, politics etc.