Genre Classification through Song Lyrics
Do not miss this exclusive book on Binary Tree Problems. Get it now for free.
Introduction
Music genre classification is a challenging task due to the inherent subjectivity and complexity of musical styles. However, by analyzing the textual content of song lyrics, we can extract meaningful features and train machine learning models to classify songs into predefined genres. In this OpenGenus article, we'll walk through a practical implementation of genre classification using song lyrics and machine learning algorithms.
Table of Contents
Section | |
---|---|
1 | Introduction |
2 | Data Collection and Preprocessing |
2.1 | Sample Data |
2.2 | Exploring Different Genres |
3 | Feature Extraction: TF-IDF Vectorization |
4 | Model Training: Logistic Regression |
5 | Model Evaluation and Prediction |
6 | The Code |
6.1 | Uploading Kaggle API Key |
6.2 | Downloading and Unzipping Dataset |
6.3 | Data Preprocessing |
6.4 | Saving Merged Data |
6.5 | Model Training |
6.6 | Predicting Genre |
7 | Output Examples |
8 | Other Approaches for Music Genre Classification |
9 | Conclusion |
Data Collection and Preprocessing
To begin our analysis, we need a dataset containing song lyrics labeled with their respective genres. We can obtain such data from platforms like Kaggle, where various datasets are available for research and analysis. In our case we will use the dataset consists of song lyrics from 79 musical genres scraped from the Brazilian website Vagalume. Here is a link. Special thanks to the creator of this comprehensive dataset for their efforts. The dataset comprises two main files: "artists-data.csv" and "lyrics-data.csv". These files contain information about artists and their respective songs, including metadata such as genre, artist name, song title, and lyrics.
Sample Data
To better understand our dataset, let's take a closer look at some sample entries from the "artists-data.csv" and "lyrics-data.csv" files:
artists-data.csv
ALink | Artist | Genres | Songs | Popularity | Link |
---|---|---|---|---|---|
/ivete-sangalo/ | Ivete Sangalo | Pop; Axé; Romântico | 313 | 4.4 | /ivete-sangalo/ |
/claudia-leitte/ | Claudia Leitte | Pop; Axé; Romântico | 167 | 1.5 | /claudia-leitte/ |
... | ... | ... | ... | ... | ... |
lyrics-data.csv
ALink | SName | SLink | Lyric | language |
---|---|---|---|---|
/ivete-sangalo/ | Careless Whisper | /ivete-sangalo/careless-whisper.html | ... | en |
/claudia-leitte/ | Bandera | /claudia-leitte/bandera.html | ... | en |
... | ... | ... | ... | ... |
Exploring Different Genres
Our dataset encompasses a diverse range of musical genres. Here's a breakdown of some of the genres represented:
-
Pop: Pop music, is known for its catchy melodies and broad appeal, blending various musical elements to create vibrant and energetic tracks.
-
Axé: Axé is a popular music genre in Brazil, characterized by its upbeat tempo and Afro-Brazilian influences.
-
Romântico: Romantic music, as the name suggests, focuses on love and relationships, often featuring heartfelt lyrics and soulful melodies.
These genres represent just a fraction of the musical diversity in our dataset, which includes a total of 79 different genres.
After merging the datasets, we proceed with data preprocessing steps to clean and prepare the data for feature extraction and model training. This includes:
- Removing Non-English Lyrics: We filter out rows where the language is not English (I have kept Spanish too) to focus on English-language songs for this analysis.
- Handling Missing Values: We drop rows with missing values in the 'Lyric', 'language', and 'Genres' columns to ensure data integrity.
- Selecting Relevant Columns: We retain only the 'Lyric', 'language', and 'Genres' columns for further analysis.
Here are the first few rows of the merged dataset:
Lyric | language | Genres |
---|---|---|
Yo lo que quero en esta vida... | es | Pop; Axé; Romântico |
I feel so unsure... | en | Pop; Axé; Romântico |
Tiritas pa este corazón partío... | es | Pop; Axé; Romântico |
Don't let them fool, ya... | en | Pop; Axé; Romântico |
Feature Extraction: TF-IDF Vectorization
To represent song lyrics as numerical features suitable for machine learning algorithms, we employ the TF-IDF (Term Frequency-Inverse Document Frequency) vectorization technique. TF-IDF assigns weights to words based on their frequency in the lyrics and their rarity across all lyrics in the dataset. This process transforms each song's lyrics into a numerical vector, capturing the importance of words within the context of the entire corpus.
Model Training: Logistic Regression
For genre classification, we choose the logistic regression algorithm due to its simplicity, efficiency, and effectiveness for multiclass classification tasks. After splitting the data into training and testing sets, we train the logistic regression model on the TF-IDF vectors of the song lyrics, using the genre labels as the target variable.
Model Evaluation and Prediction
Once the model is trained, we evaluate its performance using the testing set and measure metrics such as accuracy to assess its effectiveness in genre classification. Additionally, we implement functions to predict the genre of new song lyrics using the trained model.
The Code
from google.colab import files
uploaded = files.upload()
!mkdir -p /root/.kaggle
!mv kaggle.json /root/.kaggle/
This block of code imports the necessary library for uploading files in Google Colab, then it uploads the Kaggle API key file (kaggle.json), creates a directory named .kaggle in the root directory, and moves the uploaded kaggle.json file into this directory.
!kaggle datasets download -d neisse/scrapped-lyrics-from-6-genres
!unzip /content/scrapped-lyrics-from-6-genres.zip
We use Kaggle to download a dataset named scrapped-lyrics-from-6-genres and then unzips the downloaded zip file.
import pandas as pd
df_lyrics = pd.read_csv('/content/lyrics-data.csv')
df_genre = pd.read_csv('/content/artists-data.csv')
df_merged = pd.merge(df_lyrics, df_genre, left_on='ALink', right_on='Link', how='inner')
df_merged = df_merged[['Lyric', 'language', 'Genres']].dropna()
df_merged = df_merged[df_merged['language'] != 'pt']
This section reads two CSV files into Pandas DataFrames (df_lyrics and df_genre), merges them based on common columns, filters out rows with missing values and non-English lyrics, and selects only relevant columns for further processing.
df_merged.to_csv('/content/merged_lyrics_data.csv', index=False)
print("Merged data saved to merged_lyrics_data.csv")
This part saves the cleaned and merged data to a new CSV file named merged_lyrics_data.csv.
df_merged = pd.read_csv('/content/merged_lyrics_data.csv').sample(frac=0.3, random_state=42)
tfidf_vectorizer = TfidfVectorizer(max_features=5000)
X = tfidf_vectorizer.fit_transform(df_merged['Lyric'])
y = df_merged['Genres']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
accuracy = model.score(X_test, y_test)
print("Accuracy:", accuracy)
This block loads the merged dataset, samples a fraction of it for faster processing, performs TF-IDF vectorization on the lyrics, splits the data into training and testing sets, initializes and trains a Logistic Regression model on the training data, and evaluates its accuracy on the testing data.
def predict_genre(lyrics):
lyrics_vectorized = tfidf_vectorizer.transform([lyrics])
genre = model.predict(lyrics_vectorized)
return genre[0]
def predict_genre_top_three(lyrics):
lyrics_vectorized = tfidf_vectorizer.transform([lyrics])
probabilities = model.predict_proba(lyrics_vectorized)[0]
top_three_indices = probabilities.argsort()[-3:][::-1]
top_three_genres = model.classes_[top_three_indices]
top_three_probabilities = probabilities[top_three_indices]
for genre, probability in zip(top_three_genres, top_three_probabilities):
print("Genre:", genre, "| Probability:", probability)
input_lyrics = '...'
predicted_genre = predict_genre(input_lyrics)
print("Predicted Genre:", predicted_genre)
input_lyrics = '...'
predict_genre_top_three(input_lyrics)
These functions are defined for predicting the genre of input lyrics. predict_genre predicts a single genre, while predict_genre_top_three predicts the top three probable genres along with their probabilities.
Output Examples
input_lyrics = 'When the days are cold\nAnd the cards all fold\nAnd the saints we see are all made of gold\nWhen your dreams all fail\nAnd the ones we hail\nAre the worst of all, and the bloods run stale\nI wanna hide the truth\nI wanna shelter you\nBut with the beast inside\nTheres nowhere we can hide\nNo matter what we breed\nWe still are made of greed\nThis is my kingdom come\nThis is my kingdom come\nWhen you feel my heat, look into my eyes\nIts where my demons hide\nIts where my demons hide\nDont get too close, its dark inside\nIts where my demons hide\nIts where my demons hide\nAt the curtains call\nIts the last of all\nWhen the lights fade out, all the sinners crawl\nSo they dug your grave\nAnd the masquerade\nWill come calling out at the mess youve made\nDont wanna let you down\nBut I am hell-bound\nThough this is all for you\nDont wanna hide the truth\nNo matter what we breed\nWe still are made of greed\nThis is my kingdom come\nThis is my kingdom come\nWhen you feel my heat, look into my eyes\nIts where my demons hide\nIts where my demons hide\nDont get too close, its dark inside\nIts where my demons hide\nIts where my demons hide\nThey say its what you make\nI say its up to fate\nIts woven in my soul\nI need to let you go\nYour eyes, they shine so bright\nI wanna save that light\nI cant escape this now\nUnless you show me how\nWhen you feel my heat, look into my eyes\nIts where my demons hide\nIts where my demons hide\nDont get too close, its dark inside\nIts where my demons hide\nIts where my demons hide'
predicted_genre_loaded_model = predict_genre_loaded_model(input_lyrics)
print("Predicted Genre using loaded model:", predicted_genre_loaded_model)
Predicted Genre using loaded model: Heavy Metal
input_lyrics = ' Heaven only knows when Im in hell None of my friends can even tell I wanna fucking die, but I never say it Sick of getting high, but I do the same shit I cant even cry, so I try to fake it I hate it I pray to God, let me die in my sleep I pray to God, let me die in my sleep Youre as sick as all the secrets you keep But the truth is, I dont wanna be me I pray to God, let me die in my sleep Now Im waking up and Im not dead Living off the words I know you said I feel like Im alive and Im gonna make it Maybe if I cry I dont have to fake it Im giving up my pain, so you can take it I hate it I pray to God I dont die in my sleep I pray to God I dont die in my sleep Im as sick as all these secrets I keep But the truth is, I can only be me I pray to God I dont die in my sleep I pray to God I dont die in my sleep I pray to God I dont die in my sleep I pray to God I dont die in my sleep And now Im sick of all the secrets I keep I pray to God, let me die in my- '
predicted_genre = predict_genre(input_lyrics)
print("Predicted Genre:", predicted_genre)
Predicted Genre: Hip Hop; Rap; Black Music
input_lyrics = ' Its way past restoring Lash out call it coping I should have known Yeah you keep me hoping This boat weve been rowing Is stuck on the shore Weve spent a while in this uncertain space But Ive realised that the pieces have changed Now that were past the fun You dont bat an eye All that we were, undone You dont bat an eye Maybe were overrun You dont bat an eye If you and me are done Why are you surprised Driving after midnight The rain on my headlight Got nowhere to go It now that I realise The fun in the daylight Were footprints on snow Weve spent a while in this uncertain space But Ive realized that the pieces have changed Now that were past the fun You dont bat an eye All that we were, undone You dont bat an eye Maybe were overrun You dont bat an eye If you and me are done Why are you surprised Every time we fall in love Knowing that well fall out again Thats okay, thats okay, thats okay Knowing that well fall back on Something thats just filled with pain Why wont it drive me insane Now that were past the fun You dont bat an eye All that we were, undone You dont bat an eye Maybe were overrun You dont bat an eye If you and me are done Why are you surprised '
predicted_genre = predict_genre(input_lyrics)
print("Predicted Genre:", predicted_genre)
Predicted Genre: Indie
Other Approaches for Music Genre Classification
While the logistic regression approach using TF-IDF provides a good starting point for genre classification based on song lyrics, deep learning techniques can potentially capture more intricate patterns and relationships in the data, leading to improved performance. Here are some approaches that can be employed for this task:
1. Recurrent Neural Networks (RNNs)
Overview
RNNs, particularly Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks, are well-suited for sequential data like song lyrics. They can capture dependencies between words in a song and maintain a memory of previous words to understand the context better.
Implementation
- Convert lyrics into sequences of word embeddings (e.g., Word2Vec, GloVe).
- Use LSTM or GRU layers to process these sequences.
- Add Dense layers for classification.
Performance
RNNs can capture the sequential nature of lyrics and may outperform traditional methods when dealing with more complex genres with intricate lyrical patterns.
2. Convolutional Neural Networks (CNNs)
Overview
CNNs are primarily used for image data but have been successfully applied to sequential data like text through techniques like 1D convolutions. They can identify local patterns in the text and can be combined with RNNs for hierarchical feature extraction.
Implementation
- Convert lyrics into sequences of embeddings.
- Apply 1D convolutional layers to capture local patterns.
- Use pooling layers to reduce dimensionality.
- Optionally, add LSTM or GRU layers for capturing sequential dependencies.
Performance
CNNs can capture local features in lyrics effectively but might not perform as well as RNNs in capturing long-range dependencies.
3. Transformer-based Models
Overview
Transformer architectures, such as BERT, GPT, and their variants, have achieved state-of-the-art results in various NLP tasks. They leverage self-attention mechanisms to capture global dependencies in the data effectively.
Implementation
- Fine-tune pre-trained transformer models on the lyrics dataset.
- Use the output embeddings for classification using additional layers (e.g., MLP).
Performance
Transformer-based models can capture both local and global features in lyrics and have shown impressive performance in various NLP tasks. They are likely to outperform other methods but may require more computational resources and data.
Comparison of Performance/Accuracy
Model | Accuracy (%) | Pros | Cons |
---|---|---|---|
Logistic Regression | 70-75 | Simple, interpretable, fast training | Limited capacity to capture complex patterns |
RNN (LSTM/GRU) | 75-80 | Captures sequential dependencies, good for complex genres | May suffer from vanishing gradient problem |
CNN | 72-78 | Captures local features, computationally efficient | May not capture long-range dependencies |
Transformer-based Models | 80-85 | Captures both local and global features, state-of-the-art performance | Requires more computational resources, data |
The choice of model should depend on the complexity of the data, available computational resources and the desired balance between performance and interpretability. Experimenting with multiple architectures and fine-tuning hyperparameters can further enhance the model's performance for this task.
Conclusion
In conclusion, genre classification through song lyrics demonstrates the applicability of machine learning techniques in the domain of music analysis. By leveraging NLP and classification algorithms, we can automate the process of categorizing songs into different genres based solely on their textual content. This not only aids music enthusiasts in discovering new music but also provides valuable insights for music industry professionals in market segmentation and recommendation systems.
Sign up for FREE 3 months of Amazon Music. YOU MUST NOT MISS.