Project ideas for Data Science

Open-Source Internship opportunity by OpenGenus for programmers. Apply now.

In this article, we will get to know some ideas for data science projects.

Influencer Extraction from Social Networks
Traffic monitoring with satellite images
Extracting knowledge from informal texts
Language acquisition using neural networks
Real time sign-language interpretation
Personalized recommender systems
Disease detection
Music genre classification
Predicting financial trends
Movie analysis
Handwriting recognition
Customer Segmentation

Data is considered to be the oil of 21^st century and many businesses around the globe use data science to solve an array of problems. Hands-on experience in working with data and solving real world problems is what sets a good data scientist apart from others. Data science projects serve as a demonstration of our ability as it is a practical application of our skills and make us stand out. Discussed below are some data science project ideas.

Influencer Extraction from Social Networks

Today's social media generates huge amounts of data as informations are continuously shared or exchanged on various social media platforms like Twitter, Instagram and Facebook. The data thus generated shows us the trending topics, interests of people, latest trends and many more. But it has been a challenge to mine knowledge in an ordered manner from such network data. This can be done as a project. We can analyze large scale data from social networks and identify influencers, popular personalities or celebrities using graph analytics algorithms and visualizations. This can be considered similar to how Instagram accounts of celebrities and influencers are identified and verified (given blue ticks following their name) by Instagram. Dataset from Stanford Large Network Dataset Collection (SNAP) or any other dataset of our choice can be used for this project.

Dataset available: Stanford Large Network Dataset Collection (SNAP)
Models that can be used: Clustering, Random Forest
Related work: Detecting Influencers in Social Networks Through Machine Learning Techniques by Rishabh Makhija, Syed Ali & R. Jaya Krishna

Traffic monitoring with satellite images

The transportation sector is responsible for a large and growing share of greenhouse gas emissions, but reliable data on the amount of transport on roads at any given time are scarce. Many low and middle-income countries have limited ground-based traffic monitoring and surveying activities. We can use the satellite data that is becoming cheaper by the day and the recent advances in deep convolutional neural networks to develop a machine learning model that uses an object detection network to count vehicles in satellite images and predict average daily vehicle traffic from those counts.

Dataset available: Sentinel mission data
Models that can be used: Faster R-CNN for object detection and gradient-boosted tree-based models
Related work: Commercial Vehicle Traffic Detection from Satellite Imagery with Deep
Learning by Moritz Blattner, Michael Mommert and Damian Borth

Extracting knowledge from informal texts

A lot of work has been done on analyzing social media data. Most common ones are analyzing sentiment of users over a topic/event, behavioral patterns across regions, and stock market predictions. But these capture the sentiment/opinion of a crowd. But rather than concentrating on behaviors of a group, we can focus on the sentiment or mood of an individual. For example, Tweets of users can be used to identify the state of their mental health. Tweets can be rated according to the ANEW list which provides a set of emotional ratings for a large number of words in the English language. Two values called Valence Mean and Arousal mean are used to represent the emotional quotient value for each word. Based on this, we can identify if the user is depressed or not. And continuous depressed tweets shows the possibility of the user in early stages of depression.

Dataset available: Twitter or data from any social media
Models that can be used: Map-reduce framework by HADOOP
Related work:Early Mental Health Problem Detection using Twitter Data by Harsha Vardhan Galla and Syam Sundar Rao Kolla

Language acquisition using neural networks

Neural networks have significantly influenced research in cognitive sciences in the last decade and language is one of the most important human cognitive components. We can choose one type of Artificial neural network from the many available ones and construct one for language acquisition. It may be a model that acquires one language or one which acquires two languages simultaneously. This model can then be used understand cognitive process of the human mind while acquiring languages.

Dataset available: CHILDES database
Models that can be used: Artificial neural network
Related work:Bilingual Processing On Neural Networks by Marc Tucker and Jie Li

Real time sign-language interpretation

Sign language is an essential tool to bridge the communication gap between normal and hearing-impaired people. How ever, many hearing people fail to understand sign language. As an attempt to bridge this gap, we can build a machine learning model that that detects the sign using object detection and then interprets it. We can also use transfer learning to translate ASL signs to English. This can further be integrated to video calling applications so that the members of the deaf community could be understood by others effectively.

Dataset available: Usually image samples of gestures are captured during the experiment for training. There is also a pretrained ssd_mobilenet model model available for the same.
Models that can be used: Recurrent neural networks, LSTM and KNN
Related work:American Sign Language Recognition and Training Method with Recurrent Neural Network by C. K. M. Lee, Kam K.H. Ng, Chun-Hsien Chen, H.C.W. Lau, S.Y. Chung and Tiffany Tsoi.

sign-language-interpretation-min

Personalized recommender systems

Recommendation systems are primarily used in commercial applications. It helps businesses increase sales by recommending products based on a customer's choices. All the big companies like Netflix, YouTube, Facebook and Amazon use recommendation systems in one way or another. It would be a cool project idea that helps you show that you are keeping up with the trends and are able to build models that are extensively used.

Dataset available: Netflix Movies and TV Shows dataset
Models that can be used: Cluster models
Related work:Amazon.com Recommendations Item-to-Item Collaborative Filtering by Greg Linden, Brent Smith, and Jeremy York

Disease detection

With many diseases affecting people everyday and new ones getting added to the already lengthy list of diseases present, the best way to combat any disease is to detect it at an early stage. We can develop a machine learning model that detects diseases based on the images of a patient's scan or test results fed. This can modeled for any disease - cancer, retinopathy, diabetes, parkinson's and many more. For cancer detection, we can train the model using the various carcinoma datasets available which provides images for cancer-inducing malignant cells from previous cancer cases. Then we can build a convolutional neural network for the same as they are best suited for this purpose. This can be done to detect plant diseases too!

Dataset available:Plant Diseases Dataset, Pneumonia Data
Models that can be used: Random forest
Related work:Plant Disease Detection Using Machine Learning by Shima Ramesh Maniyath, Vinod P V, Niveditha M and Pooja R.

Music genre classification

This is an interesting project idea if we are passionate about music. Here, the idea is to classify music samples into various genres based on the audio inputs given. Classifying music automatically makes the process of song selection a lot easier and quicker. Else, one has to listen to every song in order to classify them. This is implemented particularly in enterprises like Spotify - which hosts millions of music tracks. This can help us find popular music genres and artists easily. We can use the GTZAN Music Genre Classification Dataset for this purpose or any other dataset of our choice.

Dataset available:GTZAN Music Genre Classification Dataset
Models that can be used: CNN

Predicting financial trends

This project idea has a broader scope than the previous ones discussed. One can play around with the data from financial sector and many projects could be developed! For example, we can build a machine learning model that predicts the future stock market trends of a particular company, the profit or loss in total revenue of a company, yearly turnover, predict future trends of different currency exchange rates and many more using historical data. Since finance is a part of every industry - be it healthcare, food or entertainment, this project idea allows us to explore and experiment with various possibilities.

Dataset available: NIFTY-50 Stock Market Data
Models that can be used: Time series models

Movie analysis

This is yet another cool project idea if you are interested in the entertainment industry. Again, there are many possibilities here. We can classify movies into various genres, classify the dialogs of actors into different emotions, find actors and movies with the most dialog, find genre of movies that did well in box office at different periods and why, find which actors are hated by critics but loved by fans, find if movie series are better than their books, movie rating prediction and many more.

Dataset available: Datasets specific to the project domain chosen are available on the internet.
Models that can be used: Type of model is chosen based on the specific analysis being done. But in a bird's eye view, any model can be used!

Handwriting recognition

Recognizing hand written characters is an application of pattern recognition via images. This has many useful applications like processing large set of hand written documents, convert hand written text to speech for blind people and language translation. This too uses convolutional neural networks. As for the dataset, it depends on what type of hand written data we want to recognize. If it is numbers, we can make use of the popular MNIST dataset. If it is text, there are various different datasets available for different languages that can be used. If we want to recognize drawings, we can use the Quick, Draw! Dataset by Google.
Handwriting-min

Dataset available:Quick, Draw! Dataset, MNIST
Models that can be used: CNN
Related work:Machine Learning for Handwriting Recognition
by Preetha S, Afrid I M, Karthik Hebbar P and Nishchay S K

Customer Segmentation

Customer segmentation is one of the most popular data science project ideas and many companies leverage it before their campaigns. This is an example of unsupervised learning where customers are segregated into different categories based on their age, gender, products they buy, purchase frequency, spending habits and etc., using K-means clustering. This helps businesses identify their target audience so that their campaigns and products can be modified to cater their needs for better sales. One dataset that can be used for this purpose is Mall_Customers dataset.

Dataset available: Mall_Customers dataset
Models that can be used: Clustering models
Related work: Customer Segmentation using machine learning by Aman Banduni and Prof Ilavendhan A