Académique Documents
Professionnel Documents
Culture Documents
Excerpted from
Satnam Alag
MEAP Release: February 2008
Softbound print: August 2008 (est.) | 425 pages
ISBN: 1933988312
This article is taken from the book Collective Intelligence in Action. This segment shows an
example of how intelligence can be extracted from text.
Text process involves a number of steps including creating tokens from the text, normalizing the text,
removing common words that are not helpful, stemming the words to their roots, injecting synonyms, and
detecting phrases.
At this stage it is helpful to go through an example of how the term vector can be computed by analyzing text.
The intent of this section is to demonstrate the concepts and to keep things simple therefore we will develop simple
classes for this example.
Remember, the typical steps involved in text analysis is shown in Figure 4.8
1. Tokenization: parse the text to generate terms. Sophisticated analyzers can also extract phrases from the
text
3. Stemming: convert the terms into their stemmed form, i.e., remove plurals
Eliminate
Tokenization Normalize Stemming
Stop Words
For Source Code, Sample Chapters, the Author Forum and other resources, go to
http://www.manning.com/alag
Figure 4.8: Typical steps involved in analyzing text
In this section we will first setup the example that we will use. We will first use a simple, but naïve way to
analyze the text – simply tokenizing the text, analyzing the body and title, and taking term frequency into account.
Next, we will show the results of the analysis by eliminating the stop words, followed by the effect of stemming.
Lastly, we will show the effect of detecting phrases on the analysis.
Body: “Web2.0 is all about connecting users to users, inviting users to participate and applying their collective
intelligence to improve the application. Collective intelligence enhances the user experience”
There are a few interesting things to note about the blog entry
The blog entry discusses: Collective intelligence, Web2.0 and is pertinent to how it effects users
We have talked about metadata and the term vector – the code for this is fully developed in Chapter 8. So as
not to confuse things for this example simply think of metadata being represented by an implementation of the
interface MetaDataVector, as shown in listing 4.1
package com.alag.ci;
import java.util.List;
We have two methods: first for getting the terms and their weights and the second to add another
MetaDataVector. Further, assume that we have a way to visualize this MetaDataVector after all it consists of
tags or terms and their relative weights1.
1
If you really want to see the code for the implementation of the MetaDataVector jump ahead to Chapter 8 or
download the available code
For Source Code, Sample Chapters, the Author Forum and other resources, go to
http://www.manning.com/alag
Let us define an interface MetaDataExtractor for the algorithm that will extract MetaData – in the form of
keywords or tags – by analyzing the text. This is shown in listing 4.2.
import com.alag.ci.MetaDataVector;
The interface has only one method extractMetaData that analyzes the title and body of text to generate a
MetaDataVector. The MetaDataVector in essence is the term vector for the text being analyzed.
Figure 4.9 shows the hierarchy of more and more complex text analyzers that we will use in the next few
sections. First, we will use a simple analyzer to create tokens from the text. Next, we will remove the common
words. This will be followed by taking care of plurals. Lastly, we will detect multi-term phrases.
For Source Code, Sample Chapters, the Author Forum and other resources, go to
http://www.manning.com/alag
Figure 4.9: The hierarchy of analyzers used to create MetaData from text
With this background, we are now ready to have some fun and work through some code to analyze our blog
entry!
Naïve Analysis
Let’s, begin by simply tokenizing the text, normalizing it and getting the frequency count associated with each
term. We will also analyze the body and text separately and then combine the information from each. For this
we use SimpleMetaDataExtractor which is a naïve implementation for our analyzer and its implementation is
shown in listing 4. 2.
import java.util.*;
import com.alag.ci.*;
import com.alag.ci.impl.*;
import com.alag.ci.textanalysis.MetaDataExtractor;
public SimpleMetaDataExtractor() {
this.idMap = new HashMap<String,Long>();
this.currentId = new Long(0);
}
For Source Code, Sample Chapters, the Author Forum and other resources, go to
http://www.manning.com/alag
Since, the title provides valuable information as a heuristic let us say that the resulting MetaDataVector is a
combination of the MetaDataVector for the title and the body. Note that as tokens or tags are extracted from the
text we need to provide them with a unique id and the method getTokenId takes care of it for this example. In
your application, you probably will get it from the tags table.
Here, we create a MetaDataVector for the title and the body and then simply combine the two together.
The remaining piece of code, shown in listing 4.3 is a lot more interesting.
For Source Code, Sample Chapters, the Author Forum and other resources, go to
http://www.manning.com/alag
return normalizedToken;
}
}
Here, we will use a simple StringTokenizer to break the words into their individual form.
We want to normalize the tokens so that they are case insensitive, i.e., “user” and “User” are the same word
for us and also remove the punctuations “,” and “.”.
The normalizeToken simply lower cases the tokens and removes the punctuations
We may not want to accept all the tokens, so we have a method acceptToken to decide if a token is to be
expected.
if (acceptToken(token)) {
The logic behind the method is fairly simple – find the tokens, normalize them, see if they are to be accepted
and then keep a count of how many times they occur. Both title and body are equally weighted to create a
resulting MetaDataVector. With this we have met our goal of creating a set of terms and their relative weights to
represent the metadata associated with the content.
A tag cloud is very useful way to visualize the output from the algorithm. First, let us look at the title as shown
in Figure 4.10. The algorithm tokenizes the title and extracts four equally weighted terms: “and”, “collective”,
“intelligence” and “web2.0”. Note that “and” appears as one of the four terms and “collective” and “intelligence”
are two separate terms.
For Source Code, Sample Chapters, the Author Forum and other resources, go to
http://www.manning.com/alag
Figure 4.10: The tag cloud for the title – it consists of four terms
Similarly, the tag cloud for the body of the text is shown in Figure 4.11. Notice, that the words “the”, “to”, etc
occur frequently and “user” and “users” are treated as separate terms. There are a total of 20 terms in the body.
Figure 4.11: The tag cloud for the body of the text
Combining the vectors for both the title and the body we get the resulting MetaDataVector whose tag cloud
is shown in Figure 4.12.
Figure 4.12: The resulting tag cloud obtained by combining the title and the body
The three terms “collective”, “intelligence”, and “web2.0” stand out. However, there are quite a few noise
words such as “all”, “and”, “is”, “the”, “to” that occur too frequently in the English language that they don’t add
much value. Let us next enhance our implementation by eliminating these terms.
For Source Code, Sample Chapters, the Author Forum and other resources, go to
http://www.manning.com/alag
Listing 4.4 Implementation of SimpleStopWordMetaDataExtractor
package com.alag.ci.textanalysis.impl;
import java.util.*;
public SimpleStopWordMetaDataExtractor() {
this.stopWordsMap = new HashMap<String,String>();
for (String s: stopWords) {
this.stopWordsMap.put(s, s);
}
}
This class has a dictionary of terms that are to be ignored – in our case this is a simple list, for your application
this list will be a lot longer.
The acceptToken method is over written to not accept any tokens that are in the stop word list.
Figure 4.13 shows the tag cloud after removing the stop words – we now have 14 terms from the original 20
terms. The terms “collective”, “intelligence”, and “web2.0” stand out. But “user” and “users” are still fragmented
and are treated as separate terms.
Figure 4.13: The Tag Cloud after removing the stop words
To combine “user” and “users” as one term we need to stem the words.
For Source Code, Sample Chapters, the Author Forum and other resources, go to
http://www.manning.com/alag
Stemming
Stemming is the process of converting words to their stemmed form. There are fairly complex algorithms for doing
this, Porter stemming being the most commonly used one.
There is only one plural in our example: “user” and “users”. For now we will enhance our implementation with
SimpleStopWordStemmerMetaDataExtractor, whose code is in listing 4.5.
Here, we overwrite the normalizeToken method. First, it checks to make sure that the token is not a stop
word.
protected String normalizeToken(String token) {
if (acceptToken(token)){
token = super.normalizeToken(token);
Figure 4.14 shows the tag cloud obtained by stemming the terms. The algorithm transforms “user” and “users”
as one term and bubbles “user” up.
For Source Code, Sample Chapters, the Author Forum and other resources, go to
http://www.manning.com/alag
We now have four terms: “collective”, “intelligence”, “user”, “web2.0” as terms to describe the blog entry. But
“collective intelligence” is really one phrase, so let us enhance our implementation to detect this term.
Detecting Phrases
“Collective Intelligence” is the only two term phrase that we are interested in. For this we will implement
SimpleBiTermStopWordStemmerMetaDataExtractor, the code for which is shown in listing 4.6.
import java.util.*;
import com.alag.ci.MetaDataVector;
For Source Code, Sample Chapters, the Author Forum and other resources, go to
http://www.manning.com/alag
#1 store the normalized tokens in the order they appear
#2 take two tokens at a time and check if they are valid
#3 phrases are tested for validity against a phrase dictionary
Here, we overwrite the getMetaDataVector method. We create a list of valid tokens and store them in a
list allTokens.
Next, the following code combines two tokens to check if they are valid tokens.
In our case, there is only one valid phrase “collective intelligence” and the check is done in the method.
Figure 4.15 shows the tag cloud for the title of the blog after using our new analyzer. As desired there are 4
terms “collective”, “collective intelligence”, “intelligence” and “web2.0”.
Figure 4.15: Tag cloud for the title after using the bi-term analyzer
The combined tag cloud for the blog now contains 14 terms as shown in Figure 4.16. There are 5 tags that
stand out “collective”, “collective intelligence”, “intelligence”, “user”, “web2.0”.
Figure 4.16: Tag cloud for the blog after using a bi-term analyzer
Using phrases in the term vector can help improve finding other similar content. For example, if we had
another article “Intelligence in a Child”, with tokens “intelligence” and “child’ there would be a match on the term
For Source Code, Sample Chapters, the Author Forum and other resources, go to
http://www.manning.com/alag
“intelligence”. However, if our analyzer was intelligent enough to simply extract “collective intelligence” without the
terms “collective” and “intelligence” there would be no match between the two pieces of content.
Hopefully, this gives you a good overview of how text can be analyzed automatically to extract relevant
keywords or tags and build a MetaDataVector.
Now, every Item in your application has an associated MetaDataVector. As Users interact on your site you
can use the MetaDataVector associated with the Item’s to develop a profile for the user. Finding items similar to
an item deals with finding items that have similar MetaDataVector.
For Source Code, Sample Chapters, the Author Forum and other resources, go to
http://www.manning.com/alag