NLP 101 - N-Grams and Weights
Learn how n-grams are used in natural language processing and data science today!
N-Grams and Weight
One of the foundational ideas in Statistical NLP is the concept of an n-gram. This represents the frequency that some n number of text tokens appears (or is expected to appear) in a body of text.
N-grams are an aspect of Statistical NLP that utilizes an area of AI called Machine Learning. Unlike other areas of programming, Machine Learning does not use explicitly programmed rules. Instead, these systems are frameworks that are capable of “learning” information from training data.
A 1-gram (or unigram), is simply the frequency that a single particular token appears.
For example, if we have a corpus (i.e. a body of text) that has 10,000 words and the word “the” appears 320 times, then the probability that any randomly chosen word in the corpus will be “the” is 320/10,000, or 3.2%.
This probability is also called the weight of the n-gram and is a straightforward calculation using basic fractional arithmetic.
Each element of an n-gram is called a lexeme. The most common lexemes in a body of text are words, but punctuation, numbers, and abbreviations are also lexemes.
Preprocessing Tasks
Before the weights of n-grams in a text are calculated, it is customary to first preprocess the text to normalize certain aspects.
Case Sensitivity - The most common preprocessing implementation is to remove all case sensitivity. Thus, a word that is capitalized at the beginning of a sentence is counted equally along with uncapitalized instances of that word.
Tokenization - This separates text elements that may be connected, like genitive (or possessive) cases into separate tokens. In this instance, possessive clauses like “girl’s book” are separated into three separate tokens—“girl", "’s", and "book”. This allows the root token “girl” to contribute to the weight of other instances of “girl”.
Coreference Resolution - involves identifying the various morphological forms of a word. For example, “bring”, “brings”, “bringing”, and “brought” may all be counted toward the weight of the unigram “bring”. Depending on the NLP use case, we may want to preserve the metadata of how many past tense versus present tense instances of a verb are in a corpus. This information can also help to identify which nouns are associated with a given pronoun.
Natural Language Programming (NLP) has become a hot topic within AI over the recent years. To stay ahead of the game, learn more about NLP with our educational NLP 101 series.
Script derived from Chris Irwin Davis, PHD: https://www.linkedin.com/pulse/how-mu...