NLP 101 - What Does NLP Mean?
Natural Language Programming (NLP) has become a hot topic within AI over the recent years. To stay ahead of the game, learn more about NLP with our educational NLP 101 series.
In today’s world, artificial intelligence is integrated into nearly everything we do. However, we are still very much in the early stages when it comes to the development of AI. One of the primary traits associated with robust AI is the ability to understand human language.
Natural Language Processing, or NLP, is the intersection of computer science and linguistics, and is primarily concerned with exactly this: creating ways for computers to understand natural language data. Being able to create software that can reliably comprehend language has immense potential.
In fact, you probably already interact with some real-world applications utilizing NLP in day-to-day life. Personal assistants like Apple’s Siri or Amazon’s Alexa, online translation services, text-to-speech capabilities for faster typing, and even general web searching all deeply rely on Natural Language Processing. These are just a few examples of what we can do with NLP.
But what does it actually mean to perform natural language processing?
Language can be incredibly complex, ambiguous, and full of exceptions and strange rules that even sometimes cause humans confusion. As such, Natural Language Processing involves a multitude of intertwined tasks in an effort to unravel the precise meaning of a given piece of language data. Some of the most notable of these tasks fall under the following categories: Text and Speech Processing, Morphological Analysis, Syntactic Analysis, and Semantic Analysis.
1. Text and Speech Processing
This is the most self-explanatory of these categories. Common tasks under this umbrella include:
(i) Optimal Character Recognition, or OCR, for extracting text from sources that aren’t directly encoded as text, such as PDFs or images.
(ii) Speech recognition and text-to-speech services for processing continuous snippets of audio and converting them to text or vice-versa.
(iii) tokenization, which is the process of splitting chunks of continuous text into separate words.
2. Morphological Analysis
It is directly concerned with the internal structure of words in a text. This involves tasks like lemmatization, which uncovers the normalized form of a word. This can be useful when calculating something as simple as word counts. Based on an engineer’s end goal, it may be beneficial to count words with the same lemma together, such as adding present tense am, past tense was, and perfect tense been all to the count of their shared lemma be across data.
In addition to lemmatization, part-of-speech tagging, or POS-tagging, is also a key component of morphological analysis. It is just as important for humans as it is computers to understand whether a word is a verb or a noun given a specific context.
For example, take the following two phrases:
(i) The code of conduct says her behavior is justified;
(ii) She conducted herself well.
Identifying the part of speech of the word conduct is crucial to understanding the sentence correctly.
3. Syntactic Analysis
Syntactic analysis is centered around determining the correct grammatical structure of a phrase. Key NLP tasks which are concerned with the syntax of a language include:
(i) sentence boundary disambiguation
(ii) dependency parsing, which determines the low-level, grammatical relationships between words and particles in a text.
A high level of ambiguity can be found when conducting syntactic analysis. Take the following example sentence:
Lewis was intently listening to his coworker from Canada.
In this phrase, is Lewis’ coworker from Canada? Or is Lewis listening from Canada, perhaps on the phone or a video call? Both are valid answers and only the phrase’s grammatical structure can help disambiguate cases such as this.
4. Semantic Analysis
Lastly, semantic analysis deals with the explicit (and implicit) meanings of words and phrases. This involves:
(i) lexical semantics
(ii) word sense disambiguation, which are both directly concerned with determining the meaning of individual words.
(iii) named entity recognition, or NER, identifies proper nouns and other notable segments of a text and assigns them an appropriate label.
(iv) relationship extraction identifies relationships among entities, such as company A acquired company B. Recognizing named entities is extremely powerful.
Now that we have a good idea of some of the tasks carried out when performing natural language processing, how are all of these tasks actually done? Today, NLP engineers largely make use of machine learning processes to achieve high levels of linguistic comprehension. There are two main classes of machine learning algorithms that have shown significant results in NLP: Deep Learning and Rule-Based Machine Learning.
Deep Learning
Deep Learning is the more popular method of the two and often involves heavy statistic and mathematical computations across incredibly large amounts of data. This yields impressive results. However, not only can it be difficult to implement if there is not enough data available, but it can also be difficult to know exactly what a deep learning algorithm is actually doing given its complexity, as well as how and when to tweak it. Hence, these algorithms are sometimes referred to as black boxes.
Rule-Based Machine Learning
Rule-based machine learning, on the other hand, relies on sets of programmed rules which tell the algorithm how to behave given a certain input. These rules are often formatted in some variety of IF...THEN statements and avoid the black box dilemma of deep learning based models, since bugs can be more easily found when the algorithm isn’t behaving as expected. The downside to these models, however, is that creating rules can be a labor-intensive process.
K-Extractor™: Solution Overview
At Lymba, we make use of essentially every aspect of NLP discussed up until this point. In particular, we specialize in creating question answering and semantic search solutions for enterprise clients. This is done with K-Extractor, Lymba’s fully-configurable, domain-agnostic NLP pipeline. You can think of K-Extractor as an NLP assembly line which conducts each of the tasks discussed here, as well as some other tasks for robustness. Some aspects of the pipeline, such as syntactic and semantic parsing, utilize deep learning, while other aspects make use of rule-based machine learning, such as named entity recognition and semantic relation extraction.
Hopefully you now have a foundational understanding of what natural language processing means, how it is done, and how Lymba can help implement an NLP pipeline to fit any language-related business needs.