Vous êtes sur la page 1sur 15

On the Use of Self-Organizing Maps to Classify Journal Articles

Jason Fong
University of California, Los Angeles
Computer Science
jfong@cs.ucla.edu
Abstract
The feasibility of using self-organizing maps for classifying a collection of science journal articles
is considered. In particular, the effect of using only short article abstracts is examined. In
addition, the use of multiple self-organizing maps over successive time periods to gain insight into
the evolution of a field of science is considered.

Introduction
It is often useful to be able to create a visualization of the relationships between different

areas of a particular field of research. This can allow one to spot trends and find areas that might
be interesting to investigate further. Such a visualization can also be useful in examining the
evolution of a field of research and how it changes over time.
One possible method of constructing this visualization is to create a two-dimensional map
of the articles with related articles located spatially close to each other. A common unsupervised
learning approach for constructing such a map is to use self-organizing maps. There are many
prior works [4],[5],[6],[7] showing that self-organizing maps are effective in classifying a
collection of text documents and building two-dimensional maps.
The document collection that is considered here is a collection of 22,732 articles from the
Virtual Journal of Nanoscale Science & Technology (http://www.vjnano.org). These are articles
from various scientific journals with topics relating to nanotechnology. For reasons that will be
discussed in Section 3, only the article abstracts are used to build the self-organizing maps. While
previous studies have used full document texts to build self-organizing maps, this study will
explore the possibility of using only much shorter article abstracts to build self-organizing maps
on a collection of science journal articles.

Prior Work
In [4], Kohonen created a self-organizing map (SOM) using over a million documents

from 80 different Usenet newsgroups. These documents contain newsgroup postings covering a

variety of very different topics. The resulting SOM was successful in grouping together
newsgroups postings with similar topics. In [5], Kohonen, et. al., used similar methods to create a
SOM using patent documents.
In [7], Merkl and Rauber created a SOM using the 1990 edition of the CIA Factbook.
This is a collection of 245 text documents with information on the various countries and regions
in the world. The full texts of these documents were used to construct a SOM. The resulting map
had many geographically close countries located close together. There were also groupings for
items related for other reasons, such as the communist countries (as of 1990) or the various
oceans in the world.
In [6], Merkl created a self-organizing map using the manual pages from the NIH Class
Library. The NIH Class Library is a library for the C++ programming language for storing and
retrieving arbitrarily complex data structures. The resulting SOM successfully grouped together
related functions. This demonstrates that maps can be successfully created for documents of a
very technical nature.
These prior studies have shown that self-organizing maps can be very successful for
classifying text documents in a number of different fields. These prior works created SOMs based
on sizeable bodies of text. This study is different from those prior studies in that it focuses on
nanotechnology journal articles, and that it attempts to construct SOMs using only the smaller
amount of text available in the article abstracts.

Source Documents
The collection of documents used to create the self-organizing maps in this study is a set

of articles from the Virtual Journal of Nanoscale Science & Technology (VJN). These articles
come from 52 different journals and consist of 22,732 articles from January 2000 through
September 2005. The collection consists of articles from various science journals that include
articles concerning nanotechnology. This includes articles in the following subject areas:

Advances in Fabrication and Processing


Structural Properties
Electronic Structure and Transport
Nanomagnetism and Spintronics
Imaging Science and Technology
Optical Properties and Quantum Optics
Micro and Nano Electromechanical Systems (MEMS / NEMS)
Carbon Nanotubes, C60, and Related Studies
Quantum Coherence, Computing, and Information Storage

Supramolecular and Biochemical Assembly


Organic-Inorganic Hybrid Nanostructures
Surface and Interface Properties
Chemical Synthesis Methods
Only the abstract text of each article is used in training the self-organizing map. This is

done for two reasons. The first reason is that the abstract text is much shorter than the full article
text. For a large collection of documents, this saves disk storage and reduces the computation
time needed to select terms and construct the document representations (more on this in Section
4). The collection of documents used in this study is relatively small and would not run into
excessive difficulties if the full text of the documents were used. However, if the process used in
this study is to be extended to larger document collections, it would be useful to know if building
a SOM using only article abstracts can still produce useful results.
The second reason to only use abstracts is that they can be more easily obtained than full
article texts. In the case of the VJN articles, the full article text is available on the VJN website.
However, for a general application to articles outside of the VJN, full article texts are not always
easily available without charge.

Document Representation
The SOMs considered in this study operate on neurons with weights expressed as floating

point numbers. Thus, a numeric representation of each document is needed in order to construct
the self-organizing map. A common approach is to select a set of terms that are relevant to the
collection of documents and represent each document as a vector of numeric values that
correspond to the importance of each term in a document.
For the construction of the self-organizing map, the set of terms were chosen by selecting
the 200 terms that appear in the most number of different documents. A term is considered to be a
group of characters separated by whitespace, but excluding the following:

a single letter
a single letter followed by numbers (probably a variable in an equation)
mathematical formulas
numbers
common words such as the, it, and, is, etc.
Before selecting the 200 terms, the remaining terms were stemmed in order to combine

multiple forms of a term into a single term. For example, the words walk, walks, and

walking would all be stemmed to the same term walk. The stemming was performed using an
implementation of the snowball stemmer available in the Tsearch2 [2] full-text search extension
for the PostgreSQL database management system. One part of the Tsearch2 extension is an
implementation of the snowball stemmer that uses the stemming algorithm by Porter [1]. This
system was used to stem terms since the VJN data was already stored in a PostgreSQL database,
so the Tsearch2 extension was the most convenient method for stemming the terms.
The number of terms used was chosen to be 200 since after the top 200 terms, the terms
begin to appear in less than 5% of the documents. This is a bit of an arbitrary decision, but this
still resulted in groupings that appeared reasonable.
With the 200 terms selected, the importance of each term in a document was determined
using the term frequency inverse document frequency (tf-idf) method [9]. This is computed as
follows:

tfidf = tf idf
ni
tf =
k nk
D
idf = log
di
The term tf is the term frequency and measures how important a term is to a particular
document by measuring how often a term appears in a document with respect to the total number
of terms in the document. The tf value increases for terms that make up a greater fraction of the
total terms in a document. ni is the number of times the term i appears in the document.

nk is

the total number of terms in the document.


The term idf is the inverse document frequency and measures how important a term is to
identifying a specific document within the entire document collection. The idf value is greater for
terms that appear in fewer documents since such a term will match with a smaller subset of the
collection of documents. |D| is the total number of documents in the collection. |di| is the number
of documents in which the term i appears.
Each input document is represented by a vector of the tf-idf values for each of the 200
terms. Each element of this vector represents how important a term is to a document. This vector
can be imagined as representing a point in a 200-dimensional space. Points that are spatially close
to each other are likely to represent documents with similar content since the tf-idf values for
their terms are similar.

Self-Organizing Maps
Self-organizing maps have become a common tool for applying unsupervised learning to

the classification of text documents. The SOMs used in this study consist of a square array of
neurons. Each neuron i in the array has a weight vector assigned to it. This weight vector contains
200 elements, with one element corresponding to each of the 200 terms used to represent the
input documents.
The activation of each of these neurons is calculated as the Euclidean distance between
the input vector and the weight vector of the neuron. After these values are calculated for all of
the neurons with respect to a particular input document i, a single neuron is selected as the
winner. The winner neuron c is the neuron that is most similar to the input vector. This is
determined by finding the neuron which has the minimum Euclidean distance from the input
vector for the input document i:

c : mc (t ) = min x(t ) mi (t )
i

The winner neuron c in this case is the neuron with the lowest activation as calculated
above. x(t) is the tf-idf vector from the input document. mi(t) is the weight vector of a neuron.
mc(t) is the weight vector of the winner neuron.
After the winner neuron is selected, its weight vector is adjusted to move it a bit closer to
the input vector. This makes the neuron more similar to the input document and more likely to be
selected as the winner the next time the same document is presented to the SOM. In addition, a
number of neurons around the winner are also adjusted so that similar documents will tend to be
drawn toward locations near the winner neuron. The number of neighboring neurons affected in
this way decreases with time and is controlled by the following neighborhood function:

rc ri 2

hci (t ) = exp
2 2 (t )

This neighborhood function is a Gaussian that scales down the amount of the weight
adjustments with distance away from the winner neuron. The amount that it scales down is
greater for neurons at a greater distance from the winner neuron. ri is the two-dimensional vector
for the location of a neuron i in the two-dimensional SOM. rc is the two-dimensional vector for
the location of the winner neuron. ||rc ri|| is the distance between a neuron i and the winner
neuron. The usual process is to initially set the size of this adaptation neighborhood to a wide area
and then to gradually reduce this area until only the winner neuron is adaopted. This leads to an
initial formation of large clusters and then finer adjustments toward the end of the training

iterations. This reduction in the size of the adaptation area is determined by the time-varying
parameter .
As each document is presented to the SOM, the effected neurons calculate their new
weights according to the following formula:

mi (t + 1) = mi (t ) + (t ) hci (t ) [x(t ) mi (t )]
mi(t) is the current value of the weight vector of neuron i. mi(t+1) is the new value of the
weight vector of neuron i. [x(t) mi(t)] is the difference between the neurons weight vector and
the input vector for the input document. (t) is the time-varying learning rate. This learning rate
decreases with time so that finer adjustments are made to the neuron weights as the training
process progresses. hci(t) is the previously discussed neighborhood function that controls the size
of the area affected by the weight adjustment.
The result of an adjustment to a neurons weight vector is to move the neuron a bit closer
to the position of the input document vector. This makes the winner neuron more similar to the
input document so that the next time the same document or similar documents are presented, the
neuron will be more likely to win. Also, the neighboring neurons are also made a bit more similar
(but less so than the winner neuron) so that they are more likely to recognize the document or
similar documents. The end result is that neurons recognizing similar documents are located in
spatially close positions.

Constructing a Self-Organizing Map


The self-organizing map was constructed using the Java Object Oriented Neural Engine

[3]. This is a toolkit for building models of neural networks using the Java programming
language. The network was modeled as follows:

Input Neuron Layer

Winner-Take-All Neuron Layer

Kohonen
Synapse

The first layer of neurons is an input layer consisting of 200 neurons. Each of these
neurons corresponds to one of the 200 document description terms. This input layer receives
values from a file containing the tf-idf values for each term in each document.
The second (and last) layer is the winner-take-all (WTA) layer consisting of a rectangular
array of neurons. The arrangement of these neurons corresponds to points in the resulting twodimensional map. The response of this layer is that the neuron with the lowest input value is
selected as the winner and outputs a value of 1. All other neurons output a value of 0.
The input layer and the winner-take-all layer are connected by a Kohonen synapse. This
synapse handles the algorithm for a self-organizing map by adjusting the weight vector
assignments for each of the neurons in the WTA layer. When a document is presented to the
SOM, the input layer presents the Kohonen synapse with an input vector (the tf-idf values for
each of the 200 terms). The Kohonen synapse then calculates the Euclidean distance between the
input vector and each of the current weight vectors of the neurons in the WTA layer. These
Euclidean distances are sent to WTA layer, which selects the winner and informs the Kohonen
synapse. The Kohonen synapse then adjusts the weights of the winner neuron and some number
of neighboring neurons according to the SOM algorithm described in Section 5.
In this study the problem space can be visualized as a 200-dimensional space. The tf-idf
values of the input documents and the weight vector values of the WTA array neurons can be
visualized as occupying a point in this 200-dimensional space that corresponds with their vector
value. The weight adjustments of the WTA array neurons can be visualized as a process that
moves a WTA array neuron closer to the position of the input document.
A single iteration of the SOM training is complete when each document has been
presented to the SOM and the weights of the neurons in the WTA layer have been adjusted
accordingly. The training iterations are repeated until the weights of the WTA neurons change by
a negligible amount.
A number of different SOMs were created using this process. In order to observe a map
of the entire document collection, two maps trained with the entire document collection were
created. One used an 8x8 array of neurons in the WTA layer, and the other used a 5x5 array of
neurons in the WTA layer. In order to explore the possibility of using SOMs to observe the
change of a research field over time, additional maps were created with subsets of the entire
document collection. These subsets were taken so that each set contained articles from different
halves of a year. Twelve more maps were created for each half of a year from 2000 to 2005.

Evaluation Methodology
Even though the document collection is relatively small with 22,732 documents, it is still

too large to manually check all of the document classifications. However, a less stringent
verification can be performed by using ordinary human intelligence and some familiarity with
topics in the nanotechnology field. The top terms in each SOM group can be examined to verify
that they are reasonable for the nanotechnology field. Also, we can take a small random sample of
articles in each group and check that the article titles and abstracts fit with the assigned group.
The full article texts are also available, so those can be examined to further confirm that an article
has been properly classified.
After the self-organizing maps were trained, they were used to create tables with key
terms for each group. The key terms were chosen by selecting the terms that occur the most often
in a group, and also the terms that occur in the most documents in a group. This creates tables that
are somewhat cluttered and difficult to understand. In order to fit the tables in a page and in order
to make the table easier to understand, the tables shown in the results in Section 8 are abbreviated
versions of these tables. Most of the terms appear in both the appears most often and the
appears in the most documents lists, so the two lists can usually be combined. The abbreviated
tables are created by first selecting the top 5 terms from the appears most often list. If this does
not include the top two terms from the appears in the most documents list, then those two terms
replace the 4th and 5th terms from the appears most often list. This process attempts to strike
some balance between terms that appear very often in a few document and terms that occur in
many documents.
In the resulting maps, some cells contain many documents while other cells contain only
a few. The cells that contain many documents are likely to be actual classification groups, while
the cells that contain only a few documents are likely the result of documents that were not
strongly associated with any of the larger groups. In order to more easily distinguish the larger
groups, the table cells with larger groups of articles are in bold. Also included is a count of the
number of occurrences of each term in a group.

Results
Even though using only the article abstracts gives a relatively small amount of text to

describe each document, the resulting self-organizing maps appear to be effective at classifying
the documents. This is not an exhaustive analysis since there are too many documents to be able
to complete verify each of them. There are also some documents that seem to be out of place in

some classifications, but this is to be expected since self-organizing maps are not perfect in their
classifications.

The following is an 8x8 self-organizing map of the entire document collection:

nanotub(1801)
carbon(1476)
wall(920)
electron(762)
singl(707)

imperfect(3)
detector(2)
copi(2)
alon(1)
scheme(1)

micro(704)
quantum(114)
time(113)
structur(112)
temperatur(108)

fullerit(2)
action(1)
irradi(1)
undergo(1)
soften(1)

sampl(1047)
measur(275)
temperatur(246)
field(212)
tip(205)

laser(1255)
quantum(319)
puls(259)
optic(235)
temperatur(230)

state(1741)
entangl(604)
quantum(590)
local(450)
qubit(243)

mode(1450)
photon(254)
frequenc(248)
crystal(199)
structur(176)

fulleren(3)
obtain(3)
character(2)
symmetri(2)
microscopi(1)

phase(1602)
transit(390)
temperatur(332)
structur(261)
quantum(224)

analyz(4)
use(3)
fit(2)
nano(2)
langevin(2)

current(1640)
voltag(375)
electron(340)
quantum(279)
field(270)

reson(1435)
frequenc(393)
quantum(263)
electron(222)
field(221)

photon(10)
pair(4)
entangl(4)
scheme(4)
state(2)

polar(1089)
spin(482)
electron(219)
quantum(219)
field(193)

drug(4)
system(2)
deliveri(2)
therapeut(2)
nanoparticl(1)

dot(4712)
quantum(3561)
electron(1374)
state(1011)
energi(820)

copi(7)
provid(5)
distil(3)
entangl(3)
suffic(2)

atom(1392)
electron(283)
structur(272)
cluster(227)
surfac(218)

free(1)
tini(1)
float(1)
field(1)
driven(1)

emiss(1564)
field(704)
electron(386)
current(327)
nanotub(290)

electron(2592)
quantum(2161)
energi(1821)
structur(1803)
field(1621)

entangl(3)
present(2)
thermal(1)
mirror(1)
antiferromagnet(1)

cell(4)
mechan(3)
mechanotransd(2)
receptor(2)
vertebr(2)

bound(2)
distil(2)
entangl(2)
protocol(2)
bipartit(1)

devic(1104)
electron(237)
fabric(181)
gate(174)
base(160)

synthesi(2)
jet(1)
boron(1)
arc(1)
scandium(1)

conduct(1705)
electron(551)
temperatur(455)
quantum(383)
transport(292)

bind(3)
spacer(2)
mesh(2)
ion(2)
network(2)

paint(1)
substrat(1)
glass(1)
inexpens(1)
circuitri(1)

oper(461)
quantum(290)
state(248)
entangl(173)
qubit(168)

pattern(666)
surfac(124)
substrat(123)
fabric(112)
process(102)

photon(2112)
crystal(1382)
structur(518)
optic(462)
dimension(415)

layer(1961)
thick(440)
structur(401)
quantum(265)
dot(254)

defect(3)
perfect(2)
combin(2)
object(2)
problem(2)

charg(1173)
electron(345)
quantum(277)
state(233)
energi(191)

correl(606)
quantum(236)
electron(210)
state(180)
system(168)

signal(1)
flash(1)
light(1)
ring(1)
molecul(1)

nanodevic(1)
properti(1)
retain(1)
gallium(1)
macroscop(1)

spin(4681)
electron(1152)
magnet(851)
quantum(836)
field(672)

worldwid(1)
quantum(1)
mark(1)
commun(1)
world(1)

interfac(767)
structur(157)
layer(150)
electron(141)
surfac(121)

beam(624)
electron(240)
fabric(147)
optic(141)
structur(139)

thermal(626)
conduct(299)
temperatur(223)
measur(104)
effect(95)

xe(5)
load(3)
indent(3)
valu(3)
express(3)

si(1839)
ge(363)
surfac(319)
structur(250)
layer(239)

particl(1573)
size(337)
magnet(244)
nanoparticl(193)
interact(180)

forc(1121)
measur(273)
tip(246)
atom(234)
microscop(178)

silicon(870)
structur(158)
high(144)
oxid(132)
electron(130)

tunnel(1355)
electron(463)
quantum(296)
junction(288)
barrier(264)

molecul(1527)
electron(351)
singl(314)
structur(286)
surfac(259)

assembl(790)
self(690)
selfassembl(590)
structur(285)
surfac(189)

nanotub(2457)
carbon(1008)
electron(314)
wall(288)
singl(277)

band(1038)
gap(876)
photon(497)
structur(395)
crystal(255)

tip(4)
virus(4)
icosahedr(3)
rna(3)
genom(3)

wave(709)
function(170)
electron(153)
quantum(120)
field(100)

growth(1159)
surfac(288)
temperatur(249)
deposit(207)
substrat(206)

lattic(646)
structur(155)
crystal(136)
electron(109)
dimension(108)

surfac(2192)
energi(376)
structur(273)
atom(268)
electron(258)

nm(1422)
diamet(392)
structur(298)
nanowir(285)
electron(264)

magnet(3738)
field(1759)
spin(582)
electron(473)
effect(471)

film(2473)
thin(611)
deposit(487)
temperatur(459)
substrat(358)

vortex(4)
vortexantivortex(3)
type(3)
antivortex(3)
superconductor(2)

scatter(999)
electron(362)
quantum(246)
effect(198)
well(148)

The validity of this map was confirmed by randomly sampling documents from the
groups and checking if they can be reasonably related and that they match the terms identified for
the group. For the most part this held to be true. The following is a small sample of article titles
found within the groups:

This group appears to be about carbon nanotubes:


Low temperature burnable carbon nanotube paste component for carbon nanotube field
nanotub(1801)
carbon(1476)
wall(920)
electron(762)
singl(707)

Screen printed carbon nanotube field emitter array for lighting source application
Density functional theory calculations of energy-loss carbon near-edge spectra of small
diameter armchair and zigzag nanotubes: Core-hole, curvature, and momentum-transfer
orientation effects
Spindt tip composed of carbon nanotubes
Carbon Nanotube Single-Electron Transistors at Room Temperature
Structural Determination of Isolated Single-Wall Carbon Nanotubes by Resonant Raman
Single-Molecule Torsional Pendulum
Radial-breathing-like phonon modes of double-walled carbon nanotubes
Theoretical study of the adsorption of H2 on (3,3) carbon nanotubes
Adhesion between single-walled carbon nanotubes

This group appears to be about quantum dots:

Quantum dots in magnetic fields: Thermal response of broken-symmetry phases


dot(4712)
quantum(3561)
electron(1374)
state(1011)
energi(820)

On the nature of quantum dash structures


Temporal variation in photoluminescence from single InGaN quantum dots
Effect of carrier hopping and relaxing on photoluminescence line shape in self-organized
InAs quantum dot heterostructures
GaAs buffer layer morphology and lateral distributions of InGaAs quantum dots
Self-assembled quantum-dot molecules by molecular-beam epitaxy
Growth of high optical quality InAs quantum dots in InAlGaAs/InP double heterostructures
Incompressible states in double quantum dots
Growth and magnetic properties of self-assembled (In, Mn)As quantum dots
Fine structure of trions and excitons in single GaAs quantum dots

10

A self-organizing map of size 5x5 neurons was also created to explore the effect of
smaller map sizes. The smaller map will have a more coarse-grained classification of the
documents since there are fewer cells available and some merging of classifications will likely
occur. However, the smaller maps can be built quicker since fewer neurons need to be updated at
each learning iteration.

The following is a single 5x5 self-organizing map of the entire document collection:

cell(4)
mechan(3)
reveal(3)
receptor(2)
organ(2)

electron(5297)
quantum(3734)
temperatur(3070)
effect(3068)
structur(2901)

forc(1457)
tip(430)
measur(423)
atom(405)
microscop(287)

dot(5877)
quantum(4535)
electron(1777)
state(1348)
energi(1081)

surfac(3579)
layer(1935)
structur(1793)
growth(1437)
substrat(1144)

state(1856)
quantum(920)
entangl(832)
local(790)
qubit(455)

free(1)
tini(1)
float(1)
field(1)
liquid(1)

film(3074)
thin(782)
deposit(597)
temperatur(579)
surfac(526)

bind(3)
spacer(2)
mesh(2)
ion(2)
network(2)

laser(1559)
quantum(394)
optic(364)
puls(340)
temperatur(291)

nanotub(4592)
carbon(2531)
wall(1064)
electron(1016)
singl(976)

spin(5700)
electron(1448)
magnet(1102)
polar(1077)
quantum(1009)

particl(1789)
size(379)
magnet(275)
nanoparticl(237)
interact(199)

signal(1)
flash(1)
light(1)
ring(1)
molecul(1)

si(2029)
ge(408)
surfac(384)
layer(310)
structur(288)

worldwid(1)
quantum(1)
mark(1)
commun(1)
world(1)

photon(2970)
crystal(1913)
structur(937)
optic(715)
dimension(672)

paint(1)
substrat(1)
glass(1)
inexpens(1)
circuitri(1)

nanodevic(1)
properti(1)
retain(1)
gallium(1)
macroscop(1)

emiss(1877)
field(765)
electron(501)
current(390)
quantum(347)

drug(4)
system(2)
deliveri(2)
therapeut(2)
nanoparticl(1)

magnet(4979)
field(2459)
spin(838)
electron(773)
temperatur(741)

phase(2165)
transit(569)
temperatur(470)
structur(422)
system(354)

mode(1824)
frequenc(328)
photon(300)
reson(276)
optic(264)

molecul(1876)
electron(520)
singl(407)
structur(395)
surfac(343)

11

The large groups found in the 5x5 map also appear as large groups in the 8x8 map.
However, some of the large groups in the 8x8 map do not appear in the 5x5 map. Those missing
groups have probably been merged into more dominant groups in the 5x5 map. This is to be
expected since the 5x5 map can support fewer groups. An interesting result is that the small
groups of documents in the 5x5 map also appear as the same small groups in the 8x8 map. This
suggests that those small groups are not just spurious results from the SOM building process.
Those small groups could signify a new emerging area of study that does not fit with the existing
areas. However, the small groups could also be established areas of study that just happen to have
very few articles published.
The time to construct the different sizes of SOMs appears to be roughly proportional to
the number of neurons in the WTA layer. This is expected since as each document is presented to
the SOM, each neuron in the WTA layer needs to calculate its Euclidean distance from the input.
The time to complete the calculations for the entire WTA layer would then be roughly
proportional to the number of elements in the layer. This holds true for the observed times to
construct the SOMs. The time to construct the SOM for the 8x8 array was approximately 5 hours
and 45 minutes on a 2.4 GHz Intel Pentium 4. The time to construct the SOM for a 5x5 array was
approximately 2 hours. The time to construct the 8x8 SOM versus the time to construct a 5x5
SOM is a bit more than the 64:25 ratio of neurons in the 8x8 SOM versus the 5x5 SOM. This is
reasonable considering that the 5x5 SOM would probably need less time to adjust the weights
since the 5x5 SOM has a greater proportion of neurons on the edges of the map. Such neurons
would have fewer neighbors, so the time to adjust the neighbors weights would be lower.
Even though the 5x5 map is smaller and coarser grained in detail, it still appears to be
useful. Many of the large groups, as well as the small groups, appear in both the 8x8 and 5x5
SOMs. As can be seen by the sample maps, the smaller 5x5 map is easier to mentally grasp since
there are far fewer groups and terms to consider. However, this ease of quick understanding
comes at the cost of a loss of fine detail, as can be seen by the greater variety of groups in the 8x8
map.
With the ease of quick understanding in mind, more 5x5 maps were created for subsets of
the entire document collection. The subsets were for halves of each year from 2000 to 2005. A
total of twelve 5x5 maps for the subsets were created. (For the sake of brevity, these maps are not
included in this paper). These half-year maps can be used to analyze the change of the
nanotechnology field over time.
The resulting maps revealed some changes in the groups over time. Some groups
remained roughly the same throughout the different time periods. Examples of these are the group

12

involving quantum dots and the group involving magnetic fields. Some other groups showed
changes over the time periods. An interesting example of this is the group on carbon nanotubes.
The first time period analyzed is the first half of 2000. In this time period, there is a fairly
large number of articles in a group where carbon and nanotube are the top two terms. The last
time period analyzed is the second half of 2005. In this time period, there is not a group where
carbon and nanotube are the top terms. However, carbon and nanotube are found together in
multiple other groups, but as lower ranked terms.
One possible explanation for this is that the year 2000 was still relatively soon after the
discovery of nanotubes, so there were many articles that focused on carbon nanotubes. As the
year 2005 came around, other areas of nanotechnology may have emerged and spurred on many
articles focused on those areas. The dispersion of nanotube and carbon into various other cells
could suggest that carbon nanotubes may be receiving less exclusive attention.
Another possible explanation is that carbon nanotubes became so common in
nanotechnology research that the importance of the carbon and nanotube terms decreased in the
tf-idf measure. If a term appears in many documents, then the idf part would decrease in value.
Thus, if carbon nanotubes appeared in many documents, then their tf-idf values would decrease
and those terms might not be as likely to be dominant in a classification group.
The correctness of this explanation will take more than a self-organizing map to
determine. However, whether or not it is correct is not the focus of this study. What is important
is that the use of the self-organizing maps highlighted an aspect of the collection of
nanotechnology articles that may warrant closer examination by other means.

13

Future Work
In [6], Merkl discusses a method of using hierarchical maps to build large SOMs. This

approach has the advantage of being faster than building the entire map as one large map, and the
hierarchical structure helps to focus finer grained analysis on documents that are more likely to be
related. This approach could be attempted on the VJN article maps in order to further classify the
large groups into more specifically defined groups.
The ability to recognize categories in nanotechnology research could possibly be
improved by using terms that are known to be important to the field of nanotechnology.
Generating a list of such terms is not trivial, however, so this study used a generic approach using
all terms in the documents. Such a list of nanotechnology terms could potentially improve the
accuracy of the identification of subfields in nanotechnology since the terms used to construct the
SOMs would be related to the field. In [8], Lagus and Kaski suggest some methods for term
selection and labeling of map regions that may be useful.

10 Conclusions
This study demonstrated that self-organizing maps can be successfully used to classify
documents from a collection of journal articles. In addition, this classification can be successful
even if only a relatively short abstract text is available for each article. The sizes of the maps do
not need to be very large in order for the maps to begin to provide useful results. A small 5x5
map contained many of the details available in a larger 8x8 map. However, the larger maps
contain more different classifications than the smaller maps due to classifications being merged
together in the smaller maps. In addition, subsets of documents taken over different time periods
were shown to be useful in gaining some insight into the evolution of a field of research.

14

References
[1] M. F. Porter, An Algorithm for Suffix Stripping, Readings in Information Retrieval, San
Francisco, Morgan Kaufmann, 1997, pp. 313-316
[2] "Tsearch2 - full text extension for PostgreSQL,
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2
[3] Joone - Java Object Oriented Neural Engine, http://www.jooneworld.com
[4] T. Kohonen, Self-organization of very large document collections: State of the art, Proc of
the Int'l Conf on Articial Neural Networks (ICANN'98), Skovde, Sweden, 1998
[5] T. Kohonen, S. Kaski, K. Lagus, J. Salojrvi, J. Honkela, V. Paatero, A. Saarela, Self
organization of a massive document collection, IEEE Transactions on Neural Networks, Vol.
11, No. 3, pp. 574-585, 2000
[6] D. Merkl, Exploration of text collections with hierarchical feature maps, Proc Int'l ACM
SIGIR Conf on R&D in Information Retrieval (SIGIR'97), Philadelphia, PA, 1997
[7] D. Merkl, A. Rauber, Document classification with unsupervised artificial neural networks,
Soft Computing in Information Retrieval: Techniques and Applications, Vol. 50, pp. 102-121,
Heidelberg: Physica Verlag, 2000
[8] K. Lagus, S. Kaski, Keyword selection method for characterizing text document maps, Proc
of the Int'l Conf on Articial Neural Networks (ICANN'99), Edinburgh, UK, 1999
[9] G. Salton, C. Buckley, Term-weighting approaches in automatic text retrieval, Information
Processing & Management, Vol. 24, Iss. 5, pp. 513523, 1988

15

Vous aimerez peut-être aussi