Extracting Context

Overview:

Lexalytics provides a number of methods for determining context/automatic topic detection/theme extraction. These methods operate competely automatically, and provide a wealth of information on the "what" is being discussed. In other words, with no tuning, you can:

  • extract buzz
  • see what the most negative aspects were on your survey verbatims
  • understand the context behind the sudden blip in media coverage
  • see a competitive shift as soon as it happens
  • detect subtle changes in consumer perception, and be able to act on them 

The two most interesting features in Salience are "themes" and "facets" - which use different technologies and are appropriate for different use cases.   

Detailed Information:

It turns out that nouns and noun phrases are the most generically useful parts of speech with which to determine context. To be even more specific, it’s those nouns that you’re not reaching through named entity recognition and extraction. Named entity extraction is a sentiment analysis process that deals (roughly speaking) with proper nouns. We’re considering those named entities that, largely, are not proper nouns. There may be proper nouns that are picked up as part of the “noun analysis” that you’re doing, but that is because they were not identified as a named entity.

Consider the following sentence. It’s politically controversial, but gives a good example of how important it is to separately recognize the named entities and the context.

President Barack Obama did a great job with that awful oil spill.

Named entity recognition and extraction will give you “President Barack Obama” as a person. Sentiment analysis will note a positive sentiment pointed back towards the person “President Barack Obama”. However, without understanding the additional nouns, you’ll have no idea of the context in which President Barack Obama is receiving praise.

And so, other than a vague positive sentiment, you don’t really know anything; as opposed to knowing that some author (or someone being quoted by some author) is giving thumbs up to President Barack Obama’s mad oil spill handling skills.

Extracting non-entity phrases is an excellent next step to greater understanding of the content.

We're going to talk about five computational techniques for extracting these contextual phrases, and finish with a summary: clusteringN-gramsnoun-phrase extractionthemes, and facets.

Clustering

Clustering dynamically creates topic categories across a set of content.  Clustering works by simultaneously examining many documents and automatically extracting a set of phrases that best represent the relationships between the documents. The phrases are typically some sort of n-gram (see the next section for N-grams).

Clustering is less useful than other techniques for ongoing analysis of content, in that it is designed to show a snapshot when presented with some large data set. The results of the clustering will change as documents are added and/or deleted from the data set.

Clustering is most useful as a navigational technique to augment search. Clustering can also be used as an analytical technique for looking at a large set of data and “getting the lay of the land”. As such, clustering algorithms are typically optimized around how many documents can be clustered in a given period of time (say, the amount of time you’re willing to wait for search results). 

The main problem with clustering is that its very nature requires more than one piece of text, and the addition of more text changes the clusters. In order to do things like “emerging topic detection”, you need to analyze a single piece of text in isolation from other pieces of text and then store the results for that text. Then when you get another piece of text, analyze it and store the results from that. This way you have results that you can trend over time. 

You can still “cluster” when using results from text processed a piece at a time, but you can also do many other interesting operations that aren’t easy with pure clustering.   

Advantages of Clustering

  • Very low latency processing of many thousands of documents for navigational purposes

Disadvantages

  • Not useful for trending
  • Any change in documents changes the clusters
  • Limited to words that appear in the text

N-Grams

N-grams are combinations of 1 or more words. 1: monogram, 2: bi-gram, 3: tri-gram, etc.  Rarely is it more than 3, unless looking for a specific slogan or turn of phrase.  Words are not taken from any part of speech class, so you’re going to get any and all strings. This is important because often times you can simply filter out all words of a particular part of speech class (nouns, verbs, adjectives, adverbs, etc) to improve your signal-to-noise ratio.

Monograms vs. bi-grams vs. tri-grams

Consider these phrases: “crazy good” and “stone cold crazy” (as well as the original phrase "President Barack Obama did a great job with that awful oil spill".

 

Mono-grams

Bi-grams

Tri-grams

Phrases Extracted (crazy good, stone cold crazy)

crazy (2)

cold

good

stone

crazy good

cold crazy

stone cold

stone cold crazy

Phrases Extracted

(President Obama)

a

awful

barack

did

great

job

obama

oil

president

spill

that

with

a great

awful oil

barack obama

did a

great job

job with

obama did

oil spill

president barack

that awful

with that

a great job

awful oil spill

barack obama did

did a great

great job with

job with that

obama did a

president barack obama

that awful oil

with that awful

Results:

Not specific enough

Just right

Very specific, misses important phrase

 

Generally not used for “phrase extraction” good for other things

Most often used

Used, gives very specific phrases

N-grams and stop words

The biggest problem with n-grams as phrase extraction is that it is a promiscuous algorithm. 

Stop words let you make a list of terms to exclude from analysis.  Classic stop words are things like: a, an, the, of, for, and… In addition to these very common examples, each domain has a set of words that are statistically too common to be interesting.

With most stop lists, all of the words “crazy, good, stone, cold” would probably make it through. Unless, perhaps, you were working on data for the “Cold Stone Creamery” (for those not in the USA, that’s an ice cream parlo(u)r. ), and you’d stopped the words in your name. 

Now, it’s important to note that if you “stopped” the phrase “cold stone creamery” that’s very different than stopping “cold”, “stone”, and “creamery”, as follows:

In the “cold stone creamery” case, if you got the phrase “cold as a fish”, that phrase would make it through and be decomposed into n-grams as appropriate.

In the “cold”, “stone”, and “creamery” case, if you got the phrase “cold as a fish”, that phrase would be chopped down to just “fish” (as most stop lists will also have the words “as” and “a” in them along with “cold”, “stone”, and “creamery”.

N-gram stop words generally stop entire phrases in which they appear. For example, the phrase “for example” would be stopped if the word “for” was in the stop list (which it generally would be). For the case “cold as a fish”, that phrase would be completely stopped out, as “cold fish” is not the relevant phrase.

Advantages of N-grams

  • You’ll catch everything that you don’t stop out, without any regard to parts of speech or 
    anything else
  • Computationally simple, easy to conceptually understand

Disadvantages

  • Promiscuous: requires long list of stop words to be interesting
  • Simple count does not necessarily give an indication of “importance” to text or of 
    its importance to an entity
  • Limited to words that appear in the text

 

Noun-Phrase Extraction

Noun phrases are parts of speech patterns that include a noun. They can include whatever other parts of speech make sense, and can include multiple nouns. 

As a consequence of English language ordering, a noun generally ends the phrase.  

Some common noun phrase patterns are:

  • Noun
  • Nouns
  • Adjectives Noun
  • Verb (Adjectives) Noun

Note that there is absolutely no reason why you can’t have verb phrases or whatever other part of speech patterns you care to. However, nouns are most generally useful to understand the context of a conversation – if you want to know “what” is being discussed. Verbs help with understanding what those nouns are doing to each other, but it simplifies things lots to just consider and work with noun phrases.

Noun phrase extraction takes into account parts of speech types. Many stop words are stopped simply because they are a part of speech that is uninteresting from a statistical standpoint of understanding meaning. Because you’re being very specific about classes of words that are interesting, most common stop words are instantly eliminated automatically. Stop lists can also be used with noun phrases, but it’s not quite as critical to use them as it is with n-grams.

Noun phrase extraction would provide both phrases (assuming appropriate patterns).

Input Phrases

Extracted Phrases

crazy good, stone cold crazy

crazy good

stone cold crazy

President Barack Obama

great job

awful oil spill

Consider this article:

Yahoo wants to make its Web e-mail service a place you never want to -- or more importantly -- have to leave to get your social fix.

The company on Wednesday is releasing an overhauled version of its Yahoo Mail Beta client that it says is twice as fast as the previous version, while managing to tack on new features like an integrated Twitter client, rich media previews and a more full-featured instant messaging client.

Yahoo says this speed boost should be especially noticeable to users outside the U.S. with latency issues, due mostly to the new version making use of the company's cloud computing technology. This means that if you're on a spotty connection, the app can adjust its behavior to keep pages from timing out, or becoming unresponsive.

Besides the speed and performance increase, which Yahoo says were the top users requests, the company has added a very robust Twitter client, which joins the existing social-sharing tools for Facebook and Yahoo. You can post to just Twitter, or any combination of the other two services, as well as see Twitter status updates in the update stream below. Yahoo has long had a way to slurp in Twitter feeds, but now you can do things like reply and retweet without leaving the page.

If asynchronous updates are not your thing, Yahoo has also tuned its integrated IM service to include some desktop software-like features, including window docking and tabbed conversations. This lets you keep a chat with several people running in one window while you go about with other e-mail tasks.

--Source: CNN

There are scads of noun phrases inside of here. Which ones are the important ones? You can simply frequency count them… But that doesn’t give any indication of whether these were lexically important or not. (Meaning, are they representative of the main topics of the content, as a human would read it, or are they tangential.)

Advantages of Noun Phrase Extraction

  • Restricts to phrases matching certain part of speech patterns, fewer stop words needed

Disadvantages

  • No way to tell if one noun phrase is more contextually relevant than another noun phrase
  • Limited to words that occur in the text

 

 

Themes: Best of All Worlds

In the opinion mining process, themes are noun phrases with contextual relevance scores. Themes extract exactly as described in noun phrase extraction. Once extracted, themes are then scored for contextual relevance using lexical chaining.

Themes

Theme Extraction and Scoring

First, potential themes are extracted based on the part-of-speech patterns. Then, the chains are scored, and themes that belong to the highest-scoring chain (sentences chained together), get the highest scores. If there are fewer than four chains, the algorithm gracefully degrades to scoring purely by count.

crazy good” and “stone cold crazy

Noun phrase extraction would provide both phrases (assuming appropriate patterns). With theme extraction, their scores would be different depending on where they occurred in the text (e.g. if they were associated with a central theme or if they were associated with a tangential thread.).

President Barack Obama did a great job with that awful oil spill.

The above sentence yields the same noun phrases as with straight noun phrase extraction (great job, awful oil spill). However, the score for each would be highly dependent on where this sentence fit in the grand scheme of things. Meaning, if there were further sentences that referenced concepts relating to oil, that will boost the score of the “oil spill” theme.

Consider the same article as from Noun Phrase Extraction:

Yahoo wants to make its Web e-mail service a place you never want to -- or more importantly -- have to leave to get your social fix.

The company on Wednesday is releasing an overhauled version of its Yahoo Mail Beta client that it says is twice as fast as the previous version, while managing to tack on new features like an integrated Twitter client, rich media previews and a more full-featured instant messaging client.

Yahoo says this speed boost should be especially noticeable to users outside the U.S. with latency issues, due mostly to the new version making use of the company's cloud computing technology. This means that if you're on a spotty connection, the app can adjust its behavior to keep pages from timing out, or becoming unresponsive.

Besides the speed and performance increase, which Yahoo says were the top users requests, the company has added a very robust Twitter client, which joins the existing social-sharing tools for Facebook and Yahoo. You can post to just Twitter, or any combination of the other two services, as well as see Twitter status updates in the update stream below. Yahoo has long had a way to slurp in Twitter feeds, but now you can do things like reply and retweet without leaving the page.

If asynchronous updates are not your thing, Yahoo has also tuned its integrated IM service to include some desktop software-like features, including window docking and tabbed conversations. This lets you keep a chat with several people running in one window while you go about with other e-mail tasks.

--Source: CNN

In this case, the top 5 themes are:

Theme Score
Cloud computing technology 4.11
Including window docking 2.976
Mail service 2.672
Top users requests 2.66
Rich media previews 2.635

You can see that those themes do a reasonable job of conveying the actual context of the article. The addition of contextual scoring information is hugely useful in determining what’s really important in the text, and is useful to compare across many articles across periods of time (to see what’s emerging, etc).

Specific to our text analytics software at Lexalytics, themes also carry the advantage of being scored for sentiment. This is particularly important when considering a case like the President Obama sentence where its important to be able to distinguish between the positive perception of the President and the negative perception of the theme “oil spill”.

Advantages to Theme Extraction and Scoring

  • Restricts to phrases matching certain part of speech patterns, more wheat from the chaff
  • Scored based on contextual importance
  • Sentiment analysis scores for themes

Disadvantages

  • Limited to words in the text (true for all algorithms)

 

 

Salience Facets: A New Way

Get more out of your text analysis software with Salience Facets from Lexalytics.

These are not "search facets", even though they could be used as search facets - they provide more information than clustering-based search facets. 

Salience Facets represent a completely new way to analyze social media and perform text mining.

Salience Five is the first text analytics tool to be able to directly track how real people are describing their real experiences, without pre-configuring a large taxonomy.

Take the sentence “My bed was hard.” No other text analytics product can actually take a collection of hotel reviews and automatically extract “bed” as being an important aspect (or, as we call it, “facet”). Facets have “attributes”, so you can quickly see that there were 10 people who said it was hard, and three people who said it was uncomfortable. Not only do you know that it was negative (from the sentiment analysis), you know why.  

Facets are intended to handle cases that aren’t handled well by themes. Themes present the best combination of intelligence and sentiment scoring for noun phrases, but sometimes you don’t have a good noun phrase to work with, but there’s still meaning and intent to be extracted.

Facets rely on “Subject Verb Object” (SVO) parsing.  So, in the case above, “Bed” is the subject, “was” is the verb, and “hard” is the object. In our case, “Bed” is the facet and “hard” is the attribute.

Because of the nature of SVO parsing, we require a collection of content. Any given document is going to have lots of SVO sentences, so, we only bubble the facets or attributes up to the top that occur at least twice.

As such, it should be noted that Salience Facets only work with collections processing in Salience Five. Please see this URL for a discussion of collections processing.

Here’s a specific example, based on a collection of 165 reviews of a cruise liner (there were other facets, but we picked 2 to show you):

Facet

Positive Documents

Neutral Documents

Negative Documents

Ship

45

127

14

Food

36

44

0

 

Top 5 Attributes for “Ship”

Attribute

Count

Beautiful

22

Clean

8

New

8

Huge

6

Nice

6

 

Top 5 Attributes for “Food”

Attribute

Count

Excellent

14

Good

12

Great

10

Best

4

Fabulous

4

 

So, yes, it’s a new ship and they’re doing very well with it.

One interesting feature that we’ve added to our Facet processing is the ability to combine Facets based on semantic similarity via our Wikipedia™ based Concept Matrix. We combine attributes based on word stem, and Facets on the semantic distance.

Consider the following example:

facet-rollup

You can see how Enterprise and Company are combined into a single facet – which gives richer information by combining the attributes from both.

Determining Context: Summary and Futures

Theme extraction and scoring provides a highly valuable combination of context scored noun phrases. The theme extraction algorithm degrades nicely with shorter content. There is nothing to prevent you from running multiple algorithms on your text, Lexalytics supports both n-gram and theme extraction.

We explicitly do not support clustering – we are a stateless engine that works on a single piece of content at a time. Clustering is most useful for making temporary groupings from large lists of search results, and not as useful for ongoing trending and analysis – where the clusters change depending on your data set.

The primary direction for future development is to reach beyond the boundaries of a single piece of text – right now, all themes are extracted exactly from words in this text. The next big challenge is to map these themes to higher-level concepts that can then be easily “rolled up” and compare across different texts that use different words to denote the same concepts. 

In the meantime, themes still provide an excellent view of the context of conversations, and are useful on all lengths of content – from tweets up to hundred-page secondary research reports.