Language is confusing, imprecise, and often times illogical. The fact that “Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo” is a grammatically correct sentence in the English language is enough to make you tear your hair out. But in order to be a useful tool, Salience needs to be able to distinguish, with some degree of accuracy, the difference between a city in New York, a large, shaggy member of the bovine family, and the act of bullying or intimidating.
First, for a little linguistics 101, a homonym is a group of words that are spelled the same and sound the same, but have different meanings. Technically, Salience differentiates between homographs, a group of words that are spelled the same, have different meanings, but may be pronounced differently. However, most people are more familiar with the term homonym, which is used colloquially as a blanket term, and it’s the term I’ll be using here.
When Salience runs across a word with multiple different meanings, the first thing it does is run the word through a part of speech tagger, which identifies whether the word in question is part of a noun phrase, verb phrase, etc. If the word is a noun, and there’s only one noun definition, then problem solved. If there are multiple noun definitions, then we’ve at least ruled out verb definitions, and narrowed our focus, and we move on to the next step.
The next step is an entity tagger, which uses context and other clues, such as word capitalization, to identify entities and proper nouns. This catches homonyms (technically capitonyms) like Apple versus apple. Of course, in some cases, it’s a bit more difficult. In German, all nouns are capitalized, meaning that the most obvious clue doesn’t exist. Likewise, the informal language people use on platforms like Twitter mean that often times, the conventions of grammar are eschewed, making it a bit more difficult for our engine to differentiate.
If ambiguity still exists, Salience uses context and queries to make the most appropriate estimation. Our topics feature, using the wealth of knowledge that Wikipedia provides, can also help identify which homonym is in use. If animals, mammals, and/or wings, are also being discussed in the text, Salience recognizes that “bat” doesn’t mean a wooden object used to strike balls in the game of baseball or cricket. For anything important that Salience might miss, customizable queries can help fill the gap, ensuring that the things that are important to the customer are as accurate as possible.
Words that look the same can also confuse sentiment analysis. That’s why Salience has something called a Subjective model. The model helps differentiate between subjective phrases and what we like to call perfunctory phrases. Subjective phrases contain an opinion like, “This has been a good day”, and are ripe for sentiment scoring. In a perfunctory phrase like “have a good day”, “good” is almost neutral, and shouldn’t be scored with the same weight as the former example. Although the word “good” technically has the same definition, the way we use it gives it vastly different connotations.