Multinomial/ Multimodal Naive Bayes


Multimodal naive bayes (also known as Multinomial Naive Bayes) is a specialized version of naive bayes designed to handle text documents using word counts as it's underlying method of calculating probability. Before diving into what multinomial naive bayes is, it's vital to understand the basics.

Basics

Assumption: Each feature has these properties
For instance, a feature could be a word

  1. Are independent
  2. Are equal

Bayes Theorem: P(A|B) = (P(B|A) * P(A)) / P(B)

  • This is a probabilistic equation that infers a probability from previous data.
Term Definition Formal Term
P(A|B) What is the probability of A given B Posterior
P(B|A) the probability of B given A Likelihood or Conditional probability
P(A) probability of A Prior or Class probability
P(B) probability of B Evidence

Naive Assumption
Given P(X1...Xn|Y) = product of (i = 1 to i = n) P(Xi|Y)
An example

  • P(I love you | Romance ) is equivalent to
  • P(I | Romance) x P(love | Romance) x P(you | Romance)
    • As you can see, the above equivalent equation is a product of the bayes theorem.

Multimodal Naive Bayes

Now what exactly is multimodal naive bayes?

Multimodal naive bayes is a specialized version of naive bayes designed to handle text documents using word counts as it's underlying method of calculating probability.

It's a simple but yet elegant model to handle classification that involve simple clsses that do not involve sentiment analysis (complex expressions of emotions such as sarcasm).
Based on an arbitrary document that is within your set of classes, for example, you have a set of classes that represent subjects in school (mathematics, art, music, and etc..). You are given a document with words closely related to art because they used a lot of keywords but there could be some mathematics, history, and etc too but after going through all the classes and determining their probability it turns out it's most likely art.
Even with low amounts of test data you can still produce an accurate or almost accurate classification model.

All of your classification results should be within your set of classes
Ex: set of classes = { Math, Science, History } and your results should only be either Math, Science, or History and nothing else like Music.

Before going through the steps of calculating the probabilties and determining the classification, there is a case where a document may have a bunch of new words that we have never seen which would result in a probability of zero, so how do we handle that special case?
Typically models will use a smoothing technique, and the one generally associated with multimodal naive bayes is the Laplacian smoothing / Add-one smoothing.

Laplacian Smoothing will be used to determine the probability of a word.
Equation: P(word) = (associatedClassCount(word) + 1) / ((Vocabulary size) + (total word count for class))

  • Associated class count for that word, so through all the corresponding documents that relate to that class, how many times did that word appear?
  • + 1 because we want to give some probability to words that we haven't encountered to avoid zero probability.
    • Because the probability is a product, so anything times zero will result in zero and mess up the results.
  • vocabulary size a set of words found shared among all the documents in the test data.
  • Total word count for that class.

Algorithm

Part 1 Training

  • Create a set to contain the classes and a set for vocabularly encountered.
  • Create a data structure that maintains which classification that document belongs to, the word count for each word in that document, and the total number of words,

Representation can vary

  1. Use a Testing Data Document
    • Associate that document with it's specific class name.
      • Add that class name to your set of classes if not added yet.
    • Go through each word in that document and keep track of the word with it's word count.
      • Add the words to the global vocabulary set.
  2. Repeat step 1 until the desired amount of testing data documents have been accounted for.
    • Ensure that your test results should be contained within the set of classes.

Part 2 Testing

Input: given a list of words / a document
Output: classification
Algorithm can vary if you have a different way of structing the data but the same underlying steps will need to be accounted for eventually

  1. Go through each classification
  2. A given class, get all associated test documents
  3. Determine the total word count for that class with the associated test documents.
  4. Go through each word in the list / document and calculate it's probability then product them.
    • find the total word count for that given word by summing the counts found in the documents associated to that class.
    • probability is determined by using the laplace smoothing equation mentioned earlier in the article.
    • Once the probability for the word is calculated, have a running product of the probability.
  5. Calculate the prior
    • associated class documents / total documents
  6. Multiply the running product probability with the prior
  7. If the product is a greater probability, update the best classification
  8. Once all classifications have been tested, return the best classificaiton.

Code

The algorithm seems long due to the length of text, but in reality it's not that difficult once you see it

import java.util.*;
public class MultimodalNaiveBayes {
	private class Document {
		private int docID;
		private String classification;
		private Map<String, Integer> wordCounts;
		private long totalWords = 0;

		public Document(int docID, String classification) {
			this.docID = docID;
			this.classification = classification;
			wordCounts = new HashMap<>();
		}

		public void addWord(String word) {
			wordCounts.compute(word, (k, v) -> v != null ? v + 1 : 1);
			totalWords++;
		}
	}

	private Map<Integer, Document> documents;
	private Set<String> classifications;
	private Set<String> vocab;

	public MultimodalNaiveBayes() {
		documents = new HashMap<>();
		classifications = new HashSet<>();
		vocab = new HashSet<>();
	}

	public void addDocument(int docID, String classification) {
		Document doc = new Document(docID, classification);
		documents.computeIfAbsent(doc.docID, k -> doc);
		classifications.add(classification);
	}

	public void addWordToDocument(int docID, List<String> words) {
        //Adding words to an associated document as long as it's a valid document
		documents.computeIfPresent(docID, (k, v) -> {
			words.forEach(word -> {
				v.addWord(word);
				vocab.add(word);
			});
			return v;
		});
	}

	public void addWordToDocument(int docID, String[] words) {
		addWordToDocument(docID, Arrays.asList(words));
	}
    
	public String testClassify(List<String> words) {
		String bestClassification = "";
		double highestProbability = Integer.MIN_VALUE, probability = 1.0;
		long totalWordCountForClassification = 0, wordCount = 0;

		for (String classification : classifications) {
			List<Document> associatedClassDocs = documents.values().stream().filter(doc -> doc.classification.equals(classification)).collect(Collectors.toList());

			totalWordCountForClassification = associatedClassDocs.stream().mapToLong(doc -> doc.totalWords).sum();
			
			probability = 1.0; //probability is set to 1.0 rather than 0.0 because anything multiplied by 0.0 is 0.0
			for (String word : words) {
				wordCount = associatedClassDocs.stream().mapToLong(doc -> doc.wordCounts.getOrDefault(word, 0)).sum();
				probability *= ((wordCount + 1) / ((double) totalWordCountForClassification + vocab.size())); 
			}

			double prior = associatedClassDocs.size() / (double) documents.size(); 
			probability *= prior;

			if (probability > highestProbability) {
				highestProbability = probability;
				bestClassification = classification;
			}
		}

		return bestClassification;
	}

	public static void main(String[] args) {
        //There are better ways to represent these documents but this is just an example.
		MultimodalNaiveBayes mnb = new MultimodalNaiveBayes();
        
        //Training
        //classes: { c, j } short for  { Chinese, Japanese }
        //Associating the test documents with their class
		mnb.addDocument(0, "c"); 
		mnb.addDocument(1, "c");
		mnb.addDocument(2, "c");
		mnb.addDocument(3, "j");
        //Adding the words to the test documents
		mnb.addWordToDocument(0, new String[] { "Chinese", "Beijing", "Chinese" });
		mnb.addWordToDocument(1, new String[] { "Chinese", "Chinese", "Shanghai" });
		mnb.addWordToDocument(2, new String[] { "Chinese", "Macao" });
		mnb.addWordToDocument(3, new String[] { "Tokyo", "Japan", "Chinese" });
        
        //Testing
		System.out.println(mnb.testClassify(Arrays.asList("Chinese", "Chinese", "Chinese", "Tokyo", "Japan")));
        //prints out c, for Chinese
	}
}

Pros and Cons

Pros

  • Extremely easy and simple to implement
  • Can out perform complex models with a small data set

Cons

  • Does not perform well for sentiment analysis as ordering of words is not applied
    • Ex: reviews may include sarcasm but can't be detected using this method.
  • Not applicable when the assumptions aren't met

Applications

  • Used as a classifier when classes are simple
    • Marking an email as spam or not spam
    • Classifying an article as sports, politics, technology, and etc

References