NLP 101 - Ontologies
Learn more about ontologies and how they can be used in data science today with our NLP 101 series.
What is an Ontology?
There are two definitions for the word ontology:
the branch of metaphysics dealing with the nature of being
a set of concepts and categories in a subject area or domain that shows their properties and the relations between them.
These two definitions are actually related to each other although the second definition is the one we're more concerned with today.
Philosophers have written book after book trying to explain the “nature of being” as referenced in definition one, but definition two gives us a road map of how to break down the details of the "nature of being" without going into a philosophical dissertation.
This is part of a branch of linguistics called semantics, which is the study of meaning.
For example, the word "diamond" can have a very different meaning depending on the context, such as when used by a baseball player, a jeweler, or a poker player. Determining which one is applicable for a given situation can be difficult for humans, let alone computers.
To explain what an ontology is, we begin by discussing these related linguistic terms:
Lexicon Folksonomy Taxonomy Glossary
They all differ by their level of complexity. But what do all of these terms have in common? They all define a collection of terms that are associated with a set of related information, or a knowledge domain. Let's visit each one beginning with the least complex.
1. Lexicon
A lexicon is just a collection of words, terms, and phrases associated with a knowledge domain. There is no other context provided in a lexicon. No definitions, no explanations of how terms are related, no prioritization of significance. A lexicon is roughly equivalent to a vocabulary.
2. Glossary
A glossary is like a lexicon but with glosses added.
A gloss is just a fancy linguistic term for a definition. Usually, each term has only a single definition provided the one that is applicable to the knowledge domain associated with the glossary.
For example, the definition for diamond in a glossary of baseball wouldn't mention anything about the context of the word in jewelry or poker. Even though there is no inherent order or structure to a glossary, they usually list terms alphabetically to accommodate easier search and lookup by humans.
3. Taxonomy
A taxonomy usually lacked the definition of a glossary, but the terms are organized by their relationship to one another in the form of a hierarchical tree.
Most of the time, this relationship is by hypernymy, also called the "is-a" relationship.
For example:
motorcycle is-a vehicle
vehicle is a hypernym of motorcycle
motorcycle is a hyponym of vehicle
many of us encountered our first wide-reaching taxonomy in middle school, learning the biological tree of life - organized by kingdom, phylum, class, order family, genus, and species.
a polar bear is a bear
A bear is a mammal
A mammal is a vertebrate
and so on.
Most of the time, attributes possessed by a concept higher in a taxonomy are also held by its sub-concepts. For example, an attribute of phylum Chordata is possessing a spinal cord. Therefore, as a subtree of phylum Chordata, bears have a spinal cord.
The ability to classify concepts and then draw inferences based on category memberships allows for great predictive analytical power. The same concept of "a bear is an animal" may appear in different taxonomies that are organized by different criteria. Perhaps another taxonomy classifies animals by geographic region or their association to humans, such as farm animals, circus animals, or work animals.
4. Taxonomy
A folksonomy shares features of both a lexicon and a taxonomy, but it is not as deeply structured as a taxonomy.
The primary identifier of a folksonomy is that it is created organically through crowdsourcing metadata about a knowledge domain: usually by data mining hashtags or word counts.
One type of folksonomy is a word cloud - A graphical representation of the frequency a term appears in some collection of text. The more frequent the term, the larger the term appears in the graphic.
5. Ontology
Like a taxonomy, an ontology entails relationships like hypernymy. An ontology may have hypernymy relationships from multiple overlapping taxonomies.
For example, in a single ontology, we can have both the zoological "a polar bear is a mammal" and the geographic "a polar bear is a Arctic animal".
Ontologies include additional types of relationships that are usually binary. They describe a relationship between exactly two concepts or entities. These relationships are commonly written as either xRY or in predicate form.
xRY entails that x and y are entities and R is a relationship. "Bear is a mammal".
In predicate form, we write it as, "is a bear" comma "mammal".
This form is consistent with first order logic representations used in the programming language Prolog.
Other binary relations commonly used in ontology are: Part-whole, property, and value.
We could include paw part bear, color property bear, and black value color. This is the kind of graph to which graph databases refer.
Each binary relation and the two entities x and y connected by it is called a "semantic triple" They are the type of triple that is being stored in specialized databases called a triple store. These triples can be connected whenever they share an entity in common. This is how graph databases are constructed from semantic triples.
Hopefully you now have a better understanding of how ontologies can be leveraged by data scientists. Thanks for watching from Lymba, please reach out to us with any questions or for help on your next project