Vous êtes sur la page 1sur 12

Extracting Intelligence Step by Step

Excerpted from

Collective Intelligence in Action


EARLY ACCESS EDITION

Satnam Alag
MEAP Release: February 2008
Softbound print: August 2008 (est.) | 425 pages
ISBN: 1933988312

This article is taken from the book Collective Intelligence in Action. This segment shows an
example of how intelligence can be extracted from text.

Text process involves a number of steps including creating tokens from the text, normalizing the text,
removing common words that are not helpful, stemming the words to their roots, injecting synonyms, and
detecting phrases.

At this stage it is helpful to go through an example of how the term vector can be computed by analyzing text.
The intent of this section is to demonstrate the concepts and to keep things simple therefore we will develop simple
classes for this example.

Remember, the typical steps involved in text analysis is shown in Figure 4.8

1. Tokenization: parse the text to generate terms. Sophisticated analyzers can also extract phrases from the
text

1. Normalize: convert them into lower case.

2. Eliminate stop words: eliminate terms that appear very often

3. Stemming: convert the terms into their stemmed form, i.e., remove plurals

Eliminate
Tokenization Normalize Stemming
Stop Words
For Source Code, Sample Chapters, the Author Forum and other resources, go to
http://www.manning.com/alag
Figure 4.8: Typical steps involved in analyzing text

In this section we will first setup the example that we will use. We will first use a simple, but naïve way to
analyze the text – simply tokenizing the text, analyzing the body and title, and taking term frequency into account.
Next, we will show the results of the analysis by eliminating the stop words, followed by the effect of stemming.
Lastly, we will show the effect of detecting phrases on the analysis.

Setting up the Example


Let us assume that a reader has posted the following blog entry

Title: “Collective Intelligence and Web2.0”

Body: “Web2.0 is all about connecting users to users, inviting users to participate and applying their collective
intelligence to improve the application. Collective intelligence enhances the user experience”

There are a few interesting things to note about the blog entry
 The blog entry discusses: Collective intelligence, Web2.0 and is pertinent to how it effects users

 Notice the number of occurrence of “user” – “users”, “users,”, “user”

 The title provides valuable information about the content

We have talked about metadata and the term vector – the code for this is fully developed in Chapter 8. So as
not to confuse things for this example simply think of metadata being represented by an implementation of the
interface MetaDataVector, as shown in listing 4.1

Listing 4.1 The MetaDataVector Interface

package com.alag.ci;

import java.util.List;

public interface MetaDataVector {


public List<TagMagnitude> getTagMetaDataMagnitude() ; <#1>
public MetaDataVector add(MetaDataVector other); <#2>
}
#1 gets the sorted list of non-zero terms and their weights
#2 gives the result from adding another MetaDataVector

We have two methods: first for getting the terms and their weights and the second to add another
MetaDataVector. Further, assume that we have a way to visualize this MetaDataVector after all it consists of
tags or terms and their relative weights1.

1
If you really want to see the code for the implementation of the MetaDataVector jump ahead to Chapter 8 or
download the available code

For Source Code, Sample Chapters, the Author Forum and other resources, go to
http://www.manning.com/alag
Let us define an interface MetaDataExtractor for the algorithm that will extract MetaData – in the form of
keywords or tags – by analyzing the text. This is shown in listing 4.2.

Listing 4.2 The MetaDataExtractor Interface


package com.alag.ci.textanalysis;

import com.alag.ci.MetaDataVector;

public interface MetaDataExtractor {


public MetaDataVector extractMetaData(String title, String body);
}

The interface has only one method extractMetaData that analyzes the title and body of text to generate a
MetaDataVector. The MetaDataVector in essence is the term vector for the text being analyzed.

Figure 4.9 shows the hierarchy of more and more complex text analyzers that we will use in the next few
sections. First, we will use a simple analyzer to create tokens from the text. Next, we will remove the common
words. This will be followed by taking care of plurals. Lastly, we will detect multi-term phrases.

For Source Code, Sample Chapters, the Author Forum and other resources, go to
http://www.manning.com/alag
Figure 4.9: The hierarchy of analyzers used to create MetaData from text

With this background, we are now ready to have some fun and work through some code to analyze our blog
entry!

Naïve Analysis
Let’s, begin by simply tokenizing the text, normalizing it and getting the frequency count associated with each
term. We will also analyze the body and text separately and then combine the information from each. For this
we use SimpleMetaDataExtractor which is a naïve implementation for our analyzer and its implementation is
shown in listing 4. 2.

Listing 4.2 Implementation of the SimpleMetaDataExtractor


package com.alag.ci.textanalysis.impl;

import java.util.*;
import com.alag.ci.*;
import com.alag.ci.impl.*;
import com.alag.ci.textanalysis.MetaDataExtractor;

public class SimpleMetaDataExtractor implements MetaDataExtractor {


private Map<String, Long> idMap = null; <#1>
private Long currentId = null; <#2>

public SimpleMetaDataExtractor() {
this.idMap = new HashMap<String,Long>();
this.currentId = new Long(0);
}

public MetaDataVector extractMetaData(String title, String body) {


MetaDataVector titleMDV = getMetaDataVector(title); <#3>
MetaDataVector bodyMDV = getMetaDataVector(body);
return titleMDV.add(bodyMDV);
}

private Long getTokenId(String token) { <#4>


Long id = this.idMap.get(token);
if (id == null) {
id = this.currentId ++;
this.idMap.put(token, id);
}
return id;
}

#1 keeps a Map of all the text/tags that are found


#2 variable used to generate unique ids for tokens found
#3 Uses a heuristic of placing equal weight on title and body
#4 Generates unique ids for text/tags that are found

For Source Code, Sample Chapters, the Author Forum and other resources, go to
http://www.manning.com/alag
Since, the title provides valuable information as a heuristic let us say that the resulting MetaDataVector is a
combination of the MetaDataVector for the title and the body. Note that as tokens or tags are extracted from the
text we need to provide them with a unique id and the method getTokenId takes care of it for this example. In
your application, you probably will get it from the tags table.

The following code extracts metadata for the article

MetaDataVector titleMDV = getMetaDataVector(title);


MetaDataVector bodyMDV = getMetaDataVector(body);
return titleMDV.add(bodyMDV);

Here, we create a MetaDataVector for the title and the body and then simply combine the two together.

As new tokens are extracted a unique id is assigned to them by the code


private Long getTokenId(String token) {
Long id = this.idMap.get(token);
if (id == null) {
id = this.currentId ++;
this.idMap.put(token, id);
}
return id;
}

The remaining piece of code, shown in listing 4.3 is a lot more interesting.

Listing 4.3 Continuing with the implementation of SimpleMetaDataExtractor


private MetaDataVector getMetaDataVector(String text) {
Map<String,Integer> keywordMap = new HashMap<String,Integer>();
StringTokenizer st = new StringTokenizer(text); <#1>
while (st.hasMoreTokens()) {
String token = normalizeToken(st.nextToken()); <#2>
if (acceptToken(token)) { <#3>
Integer count = keywordMap.get(token);
if (count == null) {
count = new Integer(0);
}
count ++;
keywordMap.put(token, count); <#4>
}
}
MetaDataVector mdv = createMetaDataVector(keywordMap); <#5>
return mdv;
}

protected boolean acceptToken(String token) { <#6>


return true;
}

protected String normalizeToken(String token) { <#7>


String normalizedToken = token.toLowerCase().trim();
if ( (normalizedToken.endsWith(".")) || (normalizedToken.endsWith(",")) ) {
int size = normalizedToken.length();
normalizedToken = normalizedToken.substring(0, size -1);
}

For Source Code, Sample Chapters, the Author Forum and other resources, go to
http://www.manning.com/alag
return normalizedToken;
}
}

#1 uses a simple StringTokenizer – space delimited


#2 normalizes the token
#3 should we accept this token as a valid token?
#4 keeps a frequency count
#5 creates a MetaDataVector
#6 method to decide if a token is to be accepted
#7 convert to lower case and remove punctuations

Here, we will use a simple StringTokenizer to break the words into their individual form.

StringTokenizer st = new StringTokenizer(text);


while (st.hasMoreTokens()) {

We want to normalize the tokens so that they are case insensitive, i.e., “user” and “User” are the same word
for us and also remove the punctuations “,” and “.”.

String token = normalizeToken(st.nextToken());

The normalizeToken simply lower cases the tokens and removes the punctuations

protected String normalizeToken(String token) {


String normalizedToken = token.toLowerCase().trim();
if ( (normalizedToken.endsWith(".")) || (normalizedToken.endsWith(",")) ) {
int size = normalizedToken.length();
normalizedToken = normalizedToken.substring(0, size -1);
}
return normalizedToken;

We may not want to accept all the tokens, so we have a method acceptToken to decide if a token is to be
expected.
if (acceptToken(token)) {

All tokens are accepted in this implementation.

The logic behind the method is fairly simple – find the tokens, normalize them, see if they are to be accepted
and then keep a count of how many times they occur. Both title and body are equally weighted to create a
resulting MetaDataVector. With this we have met our goal of creating a set of terms and their relative weights to
represent the metadata associated with the content.

A tag cloud is very useful way to visualize the output from the algorithm. First, let us look at the title as shown
in Figure 4.10. The algorithm tokenizes the title and extracts four equally weighted terms: “and”, “collective”,
“intelligence” and “web2.0”. Note that “and” appears as one of the four terms and “collective” and “intelligence”
are two separate terms.

For Source Code, Sample Chapters, the Author Forum and other resources, go to
http://www.manning.com/alag
Figure 4.10: The tag cloud for the title – it consists of four terms

Similarly, the tag cloud for the body of the text is shown in Figure 4.11. Notice, that the words “the”, “to”, etc
occur frequently and “user” and “users” are treated as separate terms. There are a total of 20 terms in the body.

Figure 4.11: The tag cloud for the body of the text

Combining the vectors for both the title and the body we get the resulting MetaDataVector whose tag cloud
is shown in Figure 4.12.

Figure 4.12: The resulting tag cloud obtained by combining the title and the body

The three terms “collective”, “intelligence”, and “web2.0” stand out. However, there are quite a few noise
words such as “all”, “and”, “is”, “the”, “to” that occur too frequently in the English language that they don’t add
much value. Let us next enhance our implementation by eliminating these terms.

Removing Common Words


Commonly occurring terms are also called as stop terms (see Section 2.2) and can be specific to the language and
domain. We will implement SimpleStopWordMetaDataExtractor to remove these stop words. The code for this
is shown in listing 4.4.

For Source Code, Sample Chapters, the Author Forum and other resources, go to
http://www.manning.com/alag
Listing 4.4 Implementation of SimpleStopWordMetaDataExtractor

package com.alag.ci.textanalysis.impl;

import java.util.*;

public class SimpleStopWordMetaDataExtractor extends SimpleMetaDataExtractor {


private static final String[] stopWords =
{"and","of","the","to","is","their","can","all", ""}; <#1>
private Map<String,String> stopWordsMap = null;

public SimpleStopWordMetaDataExtractor() {
this.stopWordsMap = new HashMap<String,String>();
for (String s: stopWords) {
this.stopWordsMap.put(s, s);
}
}

protected boolean acceptToken(String token) { <#2>


return !this.stopWordsMap.containsKey(token);
}
}

#1 dictionary of stop words


#2 don’t accept the token if it is a stop word

This class has a dictionary of terms that are to be ignored – in our case this is a simple list, for your application
this list will be a lot longer.

private static final String[] stopWords =


{"and","of","the","to","is","their","can","all", ""};

The acceptToken method is over written to not accept any tokens that are in the stop word list.

protected boolean acceptToken(String token) {


return !this.stopWordsMap.containsKey(token);
}

Figure 4.13 shows the tag cloud after removing the stop words – we now have 14 terms from the original 20
terms. The terms “collective”, “intelligence”, and “web2.0” stand out. But “user” and “users” are still fragmented
and are treated as separate terms.

Figure 4.13: The Tag Cloud after removing the stop words

To combine “user” and “users” as one term we need to stem the words.

For Source Code, Sample Chapters, the Author Forum and other resources, go to
http://www.manning.com/alag
Stemming
Stemming is the process of converting words to their stemmed form. There are fairly complex algorithms for doing
this, Porter stemming being the most commonly used one.

There is only one plural in our example: “user” and “users”. For now we will enhance our implementation with
SimpleStopWordStemmerMetaDataExtractor, whose code is in listing 4.5.

Listing 4.5 Implementation of SimpleStopWordStemmerMetaDataExtractor


package com.alag.ci.textanalysis.impl;

public class SimpleStopWordStemmerMetaDataExtractor extends


SimpleStopWordMetaDataExtractor {

protected String normalizeToken(String token) {


if (acceptToken(token)) { <#1>
token = super.normalizeToken(token);
if (token.endsWith("s")) { <#2>
int index = token.lastIndexOf("s");
if (index > 0) {
token = token.substring(0, index);
}
}
}
return token;
}
}

#1 if it will be rejected, don’t bother normalizing


#2 normalize strings

Here, we overwrite the normalizeToken method. First, it checks to make sure that the token is not a stop
word.
protected String normalizeToken(String token) {
if (acceptToken(token)){
token = super.normalizeToken(token);

Then it simply removes "s" from the end of terms.

Figure 4.14 shows the tag cloud obtained by stemming the terms. The algorithm transforms “user” and “users”
as one term and bubbles “user” up.

Figure 4.14: The tag cloud after normalizing the terms

For Source Code, Sample Chapters, the Author Forum and other resources, go to
http://www.manning.com/alag
We now have four terms: “collective”, “intelligence”, “user”, “web2.0” as terms to describe the blog entry. But
“collective intelligence” is really one phrase, so let us enhance our implementation to detect this term.

Detecting Phrases
“Collective Intelligence” is the only two term phrase that we are interested in. For this we will implement
SimpleBiTermStopWordStemmerMetaDataExtractor, the code for which is shown in listing 4.6.

Listing 4.6 Implementation of SimpleBiTermStopWordStemmerMetaDataExtractor


package com.alag.ci.textanalysis.impl;

import java.util.*;

import com.alag.ci.MetaDataVector;

public class SimpleBiTermStopWordStemmerMetaDataExtractor extends


SimpleStopWordStemmerMetaDataExtractor {

protected MetaDataVector getMetaDataVector(String text) {


Map<String,Integer> keywordMap = new HashMap<String,Integer>();
List<String> allTokens = new ArrayList<String>(); StringTokenizer st = new
StringTokenizer(text);
while (st.hasMoreTokens()) {
String token = normalizeToken(st.nextToken());
if (acceptToken(token)) {
Integer count = keywordMap.get(token);
if (count == null) {
count = new Integer(0);
}
count ++;
keywordMap.put(token, count);
allTokens.add(token); <#1>
}
}
String firstToken = allTokens.get(0);
for (String token: allTokens.subList(1, allTokens.size())) {
String biTerm = firstToken + " " + token;
if (isValidBiTermToken(biTerm)) { <#2>
Integer count = keywordMap.get(biTerm);
if (count == null) {
count = new Integer(0);
}
count ++;
keywordMap.put(biTerm, count);
}
firstToken = token;
}
MetaDataVector mdv = createMetaDataVector(keywordMap);
return mdv;
}

private boolean isValidBiTermToken(String biTerm) { <#3>


if ("collective intelligence".compareTo(biTerm) == 0) {
return true;
}
return false;
}
}

For Source Code, Sample Chapters, the Author Forum and other resources, go to
http://www.manning.com/alag
#1 store the normalized tokens in the order they appear
#2 take two tokens at a time and check if they are valid
#3 phrases are tested for validity against a phrase dictionary

Here, we overwrite the getMetaDataVector method. We create a list of valid tokens and store them in a
list allTokens.

Next, the following code combines two tokens to check if they are valid tokens.

String firstToken = allTokens.get(0);


for (String token: allTokens.subList(1, allTokens.size())) {
String biTerm = firstToken + " " + token;
if (isValidBiTermToken(biTerm)) {

In our case, there is only one valid phrase “collective intelligence” and the check is done in the method.

private boolean isValidBiTermToken(String biTerm) {


if ("collective intelligence".compareTo(biTerm) == 0) {
return true;
}
return false;
}

Figure 4.15 shows the tag cloud for the title of the blog after using our new analyzer. As desired there are 4
terms “collective”, “collective intelligence”, “intelligence” and “web2.0”.

Figure 4.15: Tag cloud for the title after using the bi-term analyzer

The combined tag cloud for the blog now contains 14 terms as shown in Figure 4.16. There are 5 tags that
stand out “collective”, “collective intelligence”, “intelligence”, “user”, “web2.0”.

Figure 4.16: Tag cloud for the blog after using a bi-term analyzer

Using phrases in the term vector can help improve finding other similar content. For example, if we had
another article “Intelligence in a Child”, with tokens “intelligence” and “child’ there would be a match on the term

For Source Code, Sample Chapters, the Author Forum and other resources, go to
http://www.manning.com/alag
“intelligence”. However, if our analyzer was intelligent enough to simply extract “collective intelligence” without the
terms “collective” and “intelligence” there would be no match between the two pieces of content.

Hopefully, this gives you a good overview of how text can be analyzed automatically to extract relevant
keywords or tags and build a MetaDataVector.

Now, every Item in your application has an associated MetaDataVector. As Users interact on your site you
can use the MetaDataVector associated with the Item’s to develop a profile for the user. Finding items similar to
an item deals with finding items that have similar MetaDataVector.

For Source Code, Sample Chapters, the Author Forum and other resources, go to
http://www.manning.com/alag

Vous aimerez peut-être aussi