Vous êtes sur la page 1sur 74

DocumentationoftheJaguarProject

(forQuantitativeCorpusAnalysis)
Author: Rogelio Nazar
Email: rogelio.nazar@upf.edu
Originally written in Spanish the 23
th
of January of 2008.
Translated to English on September 10
th
, 2010.
URL of this Document: http://melot.upf.edu/jaguar/doc/DocumentationJaguar.pdf
1
TableofContents
1 Introduction......................................................................................................................................5
1.1 List of functions of the program and overview of the documentation......................................5
1.1.1 Corpus...............................................................................................................................6
1.1.2 Analysis of the vocabulary of the corpus..........................................................................6
1.1.3 Extraction of concordances...............................................................................................6
1.1.4 Sorting of n-grams............................................................................................................7
1.1.5 Measures of association....................................................................................................7
1.1.6 Measures of distribution....................................................................................................7
1.1.7 Measures of similarity.......................................................................................................7
1.2 Use of the interface...................................................................................................................8
1.3 Use of the module.....................................................................................................................8
1.3.1 Installation of the code in a local computer......................................................................8
1.3.2 Instantiation of the module.............................................................................................10
2 Creation of the corpus....................................................................................................................10
2.1 General description.................................................................................................................10
2.2 Use of the interface.................................................................................................................11
2.2.1 Uploading texts to the server..........................................................................................11
2.2.2 Extraction of a corpus from the web...............................................................................13
2.2.3 Exporting the corpus.......................................................................................................15
2.3 Use of the module...................................................................................................................15
2.3.1 Assigning a path on a local computer.............................................................................15
2.3.2 File Management.............................................................................................................16
2.3.3 Downloading a corpus using URL addresses.................................................................17
2.3.4 Downloading a corpus with the help of a search engine.................................................17
2.3.5 Exporting the corpus.......................................................................................................18
3 First analysis of the corpus.............................................................................................................19
3.1 General description.................................................................................................................19
3.1.1 Indexing the corpus.........................................................................................................19
3.1.2 Analysis of the vocabulary..............................................................................................19
3.1.3 Coefficients of lexical richness.......................................................................................20
3.1.4 Charting the vocabulary growth......................................................................................20
3.2 Use of the interface.................................................................................................................22
3.3 Use of the module...................................................................................................................23
3.3.1 Analysis of vocabulary richness.....................................................................................23
3.3.2 Automatic detection of the language of the documents..................................................24
3.3.3 Analysis of the vocabulary and indexing........................................................................24
4 Extraction of Concordances (KWIC).............................................................................................25
4.1 General Description................................................................................................................25
4.2 Use of the interface.................................................................................................................26
4.3 Use of the module...................................................................................................................28
5 Sorting of n-grams..........................................................................................................................29
5.1 General Description................................................................................................................29
5.2 Use of the interface.................................................................................................................31
5.3 Use of the module...................................................................................................................33
6 Measures of association..................................................................................................................34
6.1 General description.................................................................................................................34
6.1.1 Mean, variance and standard deviation of lexical co-occurrence...................................35
6.1.2 Measures of association based on frequencies................................................................37
6.2 Use of the interface.................................................................................................................43
2
6.2.1 Association measures based on the distance of the co-occurring units..........................43
6.2.2 Measures of association based on frequency..................................................................45
6.3 Use of the module...................................................................................................................46
7 Measures of term distribution.........................................................................................................52
7.1 General description.................................................................................................................52
7.1.1 Inverse Document Frequency.........................................................................................55
7.1.2 Dispersion Coefficient....................................................................................................56
7.1.3 Other measures of dispersion..........................................................................................57
7.1.4 Diachronic analysis of the frequencies of terms.............................................................57
7.2 Use of the interface.................................................................................................................58
7.3 Use of the module...................................................................................................................61
8 Measures of similarity....................................................................................................................65
8.1 General description.................................................................................................................65
8.1.1 Measures for vector comparison.....................................................................................66
8.1.2 Similarity Coefficients....................................................................................................67
8.1.3 Distance measures...........................................................................................................68
8.2 Use of the interface.................................................................................................................69
8.3 Use of the module...................................................................................................................72
9 Concluding remarks........................................................................................................................76
10 Bibliography.................................................................................................................................77
3
4
1 Introduction
This is the documentation of the software Jaguar, which is aimed as a tool for corpus exploitation.
This software can analyze textual corpora from a user or from the web and it is currently available
as a web application and as a Perl module. The functions that are available at this moment are:
vocabulary analysis of corpora, concordance extractions, n-gram sorting and measures of
association, distribution and similarity.
Jaguar is essentially a Perl module instantiated as a web application ( http://jaguar.iula.upf.edu/ ). A
web application has the advantage of being executable in any platform without installation
procedures. As a Perl module, it can be used with a basic notion of Perl programming language,
which is commonly used among linguists. Using the module, the user is capable of building his or
her own sequence of procedures, taking the output of a process to be the input of another process.
The web interface, on the other hand, is intended for a user less familiar with programing
languages. It has the limitation that only one procedure can be executed at a time, meaning that the
output of a process has to be manually fed as input for the next process. Future versions of this
program will include certain capability of building pipelines or working plans using the graphic
interface. The other limitation is that the server where the application is now installed is not capable
of processing a corpus larger than 25-30 Megabytes of plain text files.
Considering that there are two types of users for this program, each of the functions that will be
explained in this documentation will be divided in three sections: a first one with a general
introduction, a second with the instruction for the use of the graphic interface and a third with the
explanation of how to use the module. Hopefully, this documentation will offer some guidance in
the use of the module even for those who do not have previous programming experience, because
all the elements of the syntax are explained and exemplified.
1.1Listoffunctionsoftheprogramandoverviewofthedocumentation
In its current state, Jaguar is divided in six main groups of functions:
Corpus creation
Plotofvocabularygrow
Extractionofconcordances
5
Sortingofn-grams
Measuresofassociation
Measuresofdistribution
Measures of similarity
1.1.1Corpus
Thefirstfunctionoftheprogramconsistintheselectionandconstitutionofthecorpus.Theuser
hasthepossibilitytoworkwithhisorherowncorpusor,alternatively, todownloadthecorpus
automaticallyfromtheweb.Therearealsofunctionstoexportthecorpusasasinglefileinxml
code.
Inthecaseofthewebbasedversion,thecorpuscanbeconstitutedbyuploadingfilestotheserver
orqueryingawebserveranddownloadingtheresultsfromtheweb.Inbothcases,thefilesare
aggregatedintoasinglecorpus.Inthecaseofthemodule,theusercanassignacorpussimplyby
indicating a path to a local folder or from the Internet, via a search engine or a list of URL
addresses.Thesizeofthecorpusdependsonhardwarelimitations,butinanycasethisprogramhas
neverbeentestedandwasnevermeanttobeusedwithacorpushigherthan30Megabytesof
plaintext.
1.1.2Analysisofthevocabularyofthecorpus
Thisfirstanalysisoffersinformationsuchastype/tokenratioandplotsofvocabularygrowth.Itisa
necessarypreviousstepformostoperationsbecauseitiswhenthecorpusisindexed.
1.1.3Extractionofconcordances
Like most corpus tools, this program includes the extraction of concordances or KWIC (keyword in
context). The program accepts a string of text or a regular expression as input and it outputs the
contexts that match the query. The size of the context can be specified in number of words or a full
sentence.
1.1.4Sortingofngrams
This function will output a vector with the n-grams of the corpus (1n5) sorted by decreasing
6
frequency order. This is one of the most simple methods to discover collocations, terminology or
multiword expressions.
1.1.5Measuresofassociation
This section offers different measures of statistical association between lexical units. The first one
analyses the variance of the distance that separates two units in different contexts by means of
histograms of lexical distance. Different parameters, but mainly the mode and variance of the
sample, allow one to observe patterns of co-occurrence that cannot be captured by simple frequency
based n-gram sorting. Other measures of association between lexical units based on probabilistic
criteria are the well known t-test, chi-square, Mutual Information and Cubic Mutual Information.
1.1.6Measuresofdistribution
This section is devoted to the analysis of the distribution of the frequencies of a term within the
collection of documents that constitutes the corpus. The number of times that a term occurs in a
document can be useful to determine how relevant or specific the term is to such document.
However, one can also obtain more information by taking into account not just the frequency of
occurrence but also how the term is distributed across the document. Similarly, if the analyzed
corpus is a chronologically ordered collection of documents, it is interesting to know not only how
frequent a term is at a certain period of time but also if the frequency shows any tendency over time.
1.1.7Measuresofsimilarity
The functions grouped in this section take as input a set of vectors and offers indexes of pairwise
similarity. In the simplest case, one of this functions will accept two vectors, which must be
presented as plain text files with a component in each line and a corresponding numeric value
separated by a tab. Some built-in functions will automatically transform text to vectors prior to the
operation. Typically, a user may be interested in obtaining approximate matches of a query
expression in a corpus by an orthographic similarity measure (which cannot be done with the simple
KWIC function) or to compare documents, selecting the n most similar documents of a corpus in
reference to one that is selected by the user.
1.2Useoftheinterface
As already said, the web interface user does not need to install the source code. The graphic
interface can be accessed by registering as a user with a form such as the one shown Figure 1,
7
which will ask for name, e-mail address, affiliation and an optional field for comments. Currently,
the interface is offered in English, Catalan and Spanish. Regular updates provide new information
and are displayed below the links of the left navigation bar.
Figure 1: Home Page of the Jaguar Project ( http://jaguar.iula.upf.edu )
1.3Useofthemodule
1.3.1Installationofthecodeinalocalcomputer
Unfortunately, it is not possible to freely distribute the source code of the program which is
necessary to install the module. The way to obtain it has to be by direct request to the Institute for
Applied Linguistics ( http://www.iula.upf.edu ).
The module is a zip file that has to be uncompressed somewhere in a local computer which has Perl
already installed. An arbitrary folder will force the user to specify the path in every piece of code
that he or she writes. Thus, it may be convenient to uncompress the zip file in a default library
folder, which in Linux (Ubuntu) may be something like:
/usr/share/perl5
8
In Windows, instead, the path can look like:
C:/Perl/lib
Installation in platforms other than Linux is not recommended, since there are dependencies on
other Linux programs that will probably be hard to find and, thus, some functionality will be lost.
For example, the conversion of documents between encodings and formats. The majority of the
functions should work fine in Windows, but the program is in general poorly tested in that platform.
It is a much better idea to use Jaguar on Linux, especially since Linux is free and that it is not
necessary to remove Windows to install it.
Optionally, there also exists a Perl script named GDGraph.pl, which is a file that can be executed
online (thus, installed on a server) and is a very convenient solution for the generation of images in
html results. In Linux (Ubuntu), the path for that script may be something like:
/usr/lib/cgibin/
If there is no server available and this script is not installed, then the images will be generated and
stored on disk.
The dependencies that are needed for specific functions are listed with their URL's in Table 1. LWP
is used to download corpus from the web. The Yahoo Api is actually not installed in the local
computer, it is instead a search engine device. The user must register in the provided URL and
request an ID number. Jaguar will expect that ID number in order to query the search engine.
GD::Graph is used for the generation of images. File::Type is used to guess the type of file that is
used and the following three are used to convert the files to plain text.
Componente: URL:
LWP http://www.linpro.no/lwp/
Yahoo Api Boss http://developer.yahoo.com/search/boss/boss_guide/
GD::Graph http://search.cpan.org/~mverb/GDGraph-1.43/Graph.pm
File::Type http://search.cpan.org/dist/File-Type/
pdftotext Noonburg, 2004
http://www.foolabs.com/xpdf/README
pstotext Birrell et al., 1995
http://pages.cs.wisc.edu/~ghost/doc/pstotext.htm
antiword van Os, 2003.
http://www.winfield.demon.nl/index.html
Table 1: Dependencies of the module
9
1.3.2Instantiationofthemodule
The first step to use the the module is to begin a Perl script. It is not mandatory to have previous
experience in Perl programing. It should be enough to just download the provided examples and
change the parameter values according to the user's purposes. The code in Example 1 shows how
Jaguar is instantiated in a variable with the arbitrary name $J. Please notice that it is not possible to
copy and paste source code from the pdf version of this documentation in its pdf format. Follow the
links download code instead, that will appear next of each example box.
Download this example
use strict;
use Jaguar;
my $J = Jaguar->new();
Example 1: Instantiation of the Jaguar module
2 Creation of the corpus
2.1Generaldescription
This section will not attempt to give a theoretical definition of what a corpus is. We assume a
corpus is a collection of documents gathered together with a determined criterion. There are
currently two ways of obtaining the corpus:
1) Touploadtextstotheserver(ortoassignapathincaseofusingthemodule)
2) To download the texts from the web.
Once the corpus has been created it will also be possible to export it, both with the web interface or
with the module, as described at the end of this chapter.
2.2Useoftheinterface
After registration, the user will be allowed to open a session, create a corpus and exploit it. A
corpus that is created will continue to be available for next sessions.
10
2.2.1Uploadingtextstotheserver
In order to work with a corpus provided by the user, these files are uploaded to the server using a
regular web form. The best idea is to upload plain text files encoded in UTF-8 if the language uses
diacritics. The program will accept and attempt to convert files uploaded in different formats such
as Word, Pdf, PostScript, Zip, html, xml or plain text files encoded in Latin 1, however conversion
always entails a probability of error.
To upload the files one has to use the upload file button, shown in Figure 2. Any code between
< of > brackets will be deleted. Any number of files can be uploaded, but the server will not be
able to process more than 30 Megabytes of plain text. There is also a checkbox, next to the
browse button, which has to be checked only if the files are vectors (tabulated data) instead of
plain text. A vector or tabulated data file is also in plain text format but data is displayed in form of
a table where each element is in a line, followed by a tab and then a numeric value. This files are
used for specific functions that will be discussed in Section 8.
Figure 2: Html form to upload a corpus
Files can be accumulated and listed at any moment with the button list files shown in Figure 3, as
well as deleted (with the button delete) or exported (as shown in Section 2.2.3 ).
11
Figure 3: Listing the files of the corpus
2.2.2Extractionofacorpusfromtheweb
This function will allow the user to download a corpus from the web by querying search engines or
a list of URL addresses provided by the user. It is important to take into account that the only thing
that the program does at this stage is to download the documents that the search engine returns,
there is no criteria for classification. It is expected that it will be, therefore, a noisy corpus. To
download an LSP corpus from the web, then the Wska
1
module should be used, which is an
independent application launched in parallel to Jaguar.
The following parameters are passed through the html form shown in Figure 4.
1) the query expression (up to ten words)
2) the number of documents to download
3) the language of the documents
4) the type of document (html / pdf / ps / doc)
5) if a cache is to be used
1 The Wska module can be executed online in the URL http://melot.upf.edu/wuska , however it is still not
documented and the interface is currently only available in Spanish.
12
Figure 4: Html form to query a search engine
Figure 5: Process of corpus downloading
If we query the web form with the expression Chomsky Kant, the program will start downloading
13
documents that include both words. If the query is enclosed in quotation marks, it will only
download documents matching the query as an exact phrase. Figure 5 shows the process of
downloading the documents, with a green progress bar indicating the percentage of documents
downloaded and converted. The program will report the cases where a document could not be
reached.
2.2.3Exportingthecorpus
The corpus can be exported at any moment as a single xml document. The function allows to export
the whole corpus or part of the documents. When listed, each file has a checkbox that can be used to
indicate which files should be exported. After pressing the export button, a link with the name
exported.xml will appear.
2.3Useofthemodule
Similarly to the web interface, when using the module the user has the alternative to analyze his or
her own documents or to download a corpus from the web, using lists of URL addresses or by
querying a search engine.
2.3.1Assigningapathonalocalcomputer
To use files stored in a local computer as corpus, one only needs to assign the exact (relative or
absolute) path to the folder containing those files. Ideally, those files should be in plain text encoded
in UTF8, however the program will try to automatically recognize and convert other formats.
Download this example

$J>corpus(path=>"/path/the/corpus/",
language=>"En",
encodingIn=>"utf8"
);

Example 2: Assigning a corpus on a local computer


As said before, it is not strictly necessary to indicate explicitly the language of the documents
because otherwise the program will try to detect the language automatically. Currently, the
languages that the program is able to recognize are the following:
'Ca'=>"Catalan",
14
'Es'=>"Spanish",
'En'=>"English",
'Fr'=>"French",
'It'=>"Italian",
'De'=>"German",
'Ne'=>"Flemish",
'Tr'=>"Turk".
The inclusion of a new language model is a fairly simple procedure, although in its current
implementation the program will not allow the user to do it. Users interested in working with
languages other than those listed above are invited to send a request by e-mail.
It could also be the case that the user does not wish to indicate a path to the corpus, but rather assign
a string of text already assigned to a Perl variable, as in Example 3 with the variable $corpus.
Download this example
$J>corpus(text=>$corpus);
Example 3: Assigning a corpus as a string of text contained in a scalar variable
2.3.2FileManagement
The module allows to easily read and write files and folders to the local computer. A folder will be
loaded as an array of file names. A file will be loaded as an array of lines of text. In order to write a
file, a path and a text string (in a scalar variable) should be assigned. If the path does not exist, the
program will attempt to create it. The operation will always return a true value, even in case of
error, with the corresponding error message replacing the content of the file.
Download this example
my@dir=$J>readDirectory("/path/to/the/directory");
my@file=$J>readFile("/path/to/the/file");
$J>writeFile("/path/to/the/file",$content);
#$contentisastringoftext
Example 4: File Management
15
2.3.3DownloadingacorpususingURLaddresses
Another possibility to load a corpus is by providing a list of URL's where the corpus is allocated.
The list of URL addresses can be contained in a variable, named $URLlistin Example 5. The
addresses should be separated by a new line character (\n). The variable $URLlistcan also be the
name of a local file including the URL addresses. It is important to activate the clean parameter in
order to have the program convert the different file formats to plain text and to UTF8 encoding. If
this parameter is activated, then it is also necessary to assign a path where the program can create
(and later delete) temporary files that are needed for the conversion process. From that moment the
downloaded content in plain text will be loaded into memory into the Jaguar object. The verbose
parameter is useful for debugging but it also provides information about the amount of information
that is lost during the conversion process.
Download this example
$J>corpus(url=>$URLlist,
verbose=>1,
clean=>1,
path=>/tmp/,
);
Example 5: Downloading a corpus through a list of URL addresses
2.3.4Downloadingacorpuswiththehelpofasearchengine
Jaguar includes a specific module called WebCorpus that accepts a query expression and other
parameters as input and outputs documents from the web in the form of a hash table that has the
URL addresses as keys and the plain text content of the documents as values. This is done by
querying the Yahoo Boss
2
platform which, as already said, will require an ID number from the user.
Example 6 shows an interrogation to the module, and is divided in two main parts. The first one is
where the interrogation is made, and a second part implements a loop printing the contents of the
file. The most important is the first part, because the second will depend on what the user wants to
do with the downloaded content.
Download this example
useJaguar::WebCorpus;
my$B=WebCorpus>new(
query=>"respiratorygerms",
2 http://developer.yahoo.com/search/boss/boss_guide/
16
verbose=>1,
tmp=>"/tmp/",
lang=>"En",
clean=>1,
type=>"pdf",
docs=>30,
apiKey=>*********,#YahooBossApiKey.
);
foreachmy$key(%{$B}){
print"\n$key=>$B>{$key}";
}
Example 6: Querying a search engine using Jaguar.
2.3.5Exportingthecorpus
This method will export the corpus, and the needed parameters are the following:
format => txt by default, but can be binary or xml.
name => The full path including the destination file name.
files => This is an optional parameter to specify the names of the files (a subcorpus) to be exported,
and is specified as an array reference. By default, all documents are exported.
verbose => Verbosity can simply have a true value or set to very. Non verbose output is given
by default.
Download this example
my$i=$J>export(verbose=>'very',
format=>'xml',
name=>'exportedCorpus',
files=>\@names,#thisisoptional
);
Example 7: Method to export the corpus
The code in Example 7 will generate an XML file with a syntax similar to the one shown in
Example 8, which shows a fragment of an English version of Kant's Critique of Pure Reason, that
the program downloaded from the web.
<corpus>
<title>salidaXML(1)</title>
<docs>
<doc1>
<url><![CDATA[acon.txt]]></url>
<content>
<![CDATA[CritiqueofPureReason

(AnalyticofConcepts)
17
P092
TRANSCENDENTALDOCTRINEOF
ELEMENTS
SECONDPART
TRANSCENDENTALLOGIC
INTRODUCTION
IDEAOFATRANSCENDENTALLOGIC
I
LOGICINGENERAL
OUR knowledge springs from two fundamental sources of the
mind;thefirstisthecapacityofreceivingrepresentations
(receptivityforimpressions),thesecondisthepowerofknow
inganobjectthroughtheserepresentations(spontaneity[in
theproduction]ofconcepts).Throughthefirstanobjectis
given to us, through the second the object is thought in
relation to that [given] representation (which is a mere
determinationofthemind).Intuitionandconceptsconstitute,
therefore,theelementsofallourknowledge,sothatneither
conceptswithoutanetc...
Example 8: Sample of exported corpus
3 First analysis of the corpus
3.1Generaldescription
The operations that is usually performed right after defining a corpus are the following: 1) indexing
the corpus, 2) analysis of the vocabulary growth and 3) analysis of the type-token ratio. Automatic
detection of the language of the documents will also be performed during this stage, and this is
important because the user will most likely be interested on having documents written in a single
language.
3.1.1Indexingthecorpus
It is necessary to index the corpus in order to undertake some of the most sophisticated functions of
the program. Only Kwic and N-gram sorting are possible without indexing. The corpus will have to
be re-indexed every time the corpus is altered by adding or deleting documents.
3.1.2Analysisofthevocabulary
The analysis of the vocabulary consists of a table with values expressing the evolution of the
vocabulary growth in a corpus divided in parts (usually documents or arbitrary partitions of the
same size). Table 2 shows the result of the analysis corresponding to the already mentioned English
version of the Critique of Pure Reason by Kant downloaded from the web. In this case, the
segments correspond to chapters of the book. The first column of the table with an id number of the
part, the name of the part, followed by the extension (in tokens), the cumulative extension, the
18
vocabulary of the document, the vocabulary growth and the cumulative vocabulary.
Part
ID
Part
Name
Size
Cumulative
Size
Vocabulary
Vocabulary
Growth
Cumulative
Vocabulary
1 antin.txt 38357 38357 1594 1594 1594
2 ancon.txt 29732 68089 1200 389 1983
3 prefs.txt 20293 88382 1244 342 2325
4 acon.txt 29732 118114 1200 0 2325
5 dmeth.txt 37775 155889 1853 541 2866
6 paral.txt 33210 189099 1460 248 3114
7 ideal.txt 34351 223450 1562 268 3382
8 anpri.txt 44538 267988 1622 265 3647
9 aesth.txt 9774 277762 556 21 3668
Tabla 2: Result of the first analysis
3.1.3Coefficientsoflexicalrichness
Different coefficients have been proposed for the measure of the vocabulary richness of a text or
corpus, often called the temperature of discourse (Mandelbrot, 1961). These coefficients are
different variants of the most basic type-token ratio, the ratio between vocabulary and the size of the
text. The main problem of the lexical richness is that they are sensible to the size of the analyzed
text, thus one cannot compare samples of different size. Jaguar implements a coefficient proposed
by Herdan (1964) shown in Equation 1, which is intended to make up for the size effect, at least in
part.
H(d)=log(types(d))/log(tokens(d))
(1)
3.1.4Chartingthevocabularygrowth
The information obtained in the previous section can be graphically represented to reflect in a more
intuitively way the ratio of vocabulary growth at each part of the corpus. Ideally, in a representative
sample, this vocabulary growth should tend to zero. An asymptotic curve would be indicative of the
fact that a point has been reached in which few new words appear to the collection as more
documents are added.
Figures 6 and 7 show, respectively, the curves of vocabulary growth and size of the text of the
different chapters of the same English version of the Critique of Pure Reason.
19
Figure 6: Cumulative extension in chapters of the Critique
of Pure Reason
Figure 7: Vocabulary growth of the Critique of Pure Reason
The function of vocabulary growth is well known is linguistics and in the field of information
retrieval (Heaps, 1978). The logarithm of the vocabulary (D) is a lineal function of the logarithm of
the extension (N), thus, D = kN

, where k and are constants, (9k35 and 0,50,66). Figure 8


represents the vocabulary growth as a function of size in IULA's Spanish LSP Corpus IULA
3
.
3 http://bwananet.iula.upf.edu [con acceso Mayo 2009].
20
1000 10000 100000 1000000 10000000
1
10
100
1000
10000
100000
1000000
Vocabulary as a Function of Extension
extension
v
o
c
a
b
u
l
a
r
y
Figure 8: Relation between vocabulary growth and size of the collection
3.2Useoftheinterface
The process of corpus indexing and analysis of the type-token ratio is triggered by the button
index, located in all pages until it is pressed, and always present in the corpus page to allow re-
indexing the corpus as needed. The output of this process (Figure 9) is the table already explained
in Section 3.1 , with the automatic detection of the language of the documents, the size, the
vocabulary growth and the vocabulary richness of each document expressed in Herdan's coefficient.
21
Figure 9: Indexing the corpus
3.3Useofthemodule
3.3.1Analysisofvocabularyrichness
There are different methods to analyze the vocabulary richness of a text or corpus. Example 9
shows a complete script with an instantiation of the module, assignment of a corpus with a variable
$corpus and extraction of different values, namely the extension (tokens), vocabulary (types), the
type-token ratio (ttr) and Herdan's coefficient.
Download this example
useJaguar;
my$J=Jaguar>new;
$J>corpus(text=>$corpus);
print$Jaguar>tokens;
print$Jaguar>types;
print$Jaguar>ttr;
print$Jaguar>Herdan;
Example 9: Methods to analyze the extension, vocabulary and their ratio
22
3.3.2Automaticdetectionofthelanguageofthedocuments
This method guesses the language of a given document if the language is not new to the program.
The method is shown in Example 10. The chances of success are greatly influenced by the size of
the text, the larger the better. Again, the program has currently been trained to recognize the
following languages (others will be added progressively): 'Ca' => "Catalan", 'Es' =>
"Spanish",'En' => "English",'Fr' => "French", 'It' => "Italian", 'De' =>
"German",'Ne'=>"Flemish",'Tr'=>"Turk".
Download this example
$text=Thisisastringoftextwritteninsomelanguage;
print$J>languageDetection(
sample=>$text,
verbose=>very,
);
Example 10: Method for language detection
3.3.3Analysisofthevocabularyandindexing
The method for indexing the corpus (shown in Example 11) needs to be executed once at the
beginning of the analysis. From then on, any number of different analysis of the corpus can be
undertaken, until the corpus is altered by addition or elimination of texts, in which case the corpus
will have to be indexed again.
Download this example
$Jaguar>index(corpusPath=>"/path/to/corpus/directory/",
indexPath=>"/path/to/write/index/",
language=>"En",
encodingIn=>"utf8",
encodingOut=>"utf8",
clean=>1,
verbose=>"very",
n=>3,
);
Example 11: Method for indexing the corpus
The parameters are:
corpusPath => Full path of a directory containing the files of the corpus
indexPath => Path of a directory where the index file will be stored.
23
language => Any of the languages known by the program, shown in Section 3.3.2 ('Ca', 'Es', 'En',
'Fr' , 'It' , 'De', 'Ne', 'Tr').
limitN => The maximum number of n, which is the number of words to be indexed as n-grams that
will constitute the entries of the index. With value 1, only single words are indexed.
With 2, single words and bigrams (sequences of two words) are included. With 3, single
words, bigrams and trigrams and so on. It is not recommended to index with low
frequency and high n because of the exponential growth of computational cost.
limitFrec => The minimum frequency of the units to be indexed (again, an important factor
because it would not be convenient to index large chains of words of frequency 1).
clean => If the corpus contains some sort of tagging (such as html, sgml or xml), if it is in different
languages or document types (doc, pdf, ps) or in heterogeneous encoding, with the
activation of this parameter the program will attempt to strip the tags and convert the
documents to clean plain text in UTF8 encoding, eliminating those documents which are
not in the specified language.
verbose => By default, the program works in silence. It becomes verbose if this parameter is set to
any true value and very verbose if the parameter is set to very.
4 Extraction of Concordances (KWIC)
4.1GeneralDescription
ThetermKWICorkeywordincontextreferstotheprocessofretrievingcontextsofoccurrenceofa
given expression within the analyzed corpus. The contexts are defined as windows of a
parameterizablesize.Table3showssomeexamplesofcontextsofthewordinnateintheChomsky
KantcorpuswhichwasdownloadedfromthewebonAugust30
th
,2010.
# left context query right context
1)
kant, chomsky and the problem of
knowledge language as
innate mental organ
2)
a universal grammar underlying all
languages and corresponding to an
innate
capacity of the human brain. chomsky and
other linguists
3)
, believing that individual human beings
are born with no
innate mental content; it is a blank slate, tabula
5)
am i selecting a signal from a finite
behavioural repertoire
innate or learned. furthermore, it is wrong to think
6)
interplay of available data, heuristic
procedures, and the
innate
schematism that restricts and conditions the
form of the acquired
24
7)
conditions that must be met by any such
assumption about
innate
structure are moderately clear. thus, it
appears to
8)
solve in a satisfactory way. we must
postulate an
innate
structure that is rich enough to account for
the disparity
9) to data. at the same time, this postulated innate
mental structure must not be so rich and
restrictive as
10)
exact character of the complexity that can
be postulated as
innate
mental structure. the factual situation is
obscure enough to
11)
much difference of opinion over the true
nature of this
innate
mental structure that makes acquisition of
language possible. however
12)
problem for tomorrow is that of
discovering an assumption regarding
innate
structure that is sufficiently rich, not that of
finding
13)
notion of plausibility, no a priori insight
into what
innate
structures are permissible, that can guide
the search for
Table 3: Some contexts of occurrence of the word innate
Giventhatcurrentlytheprogramdoesnotofferlemmatizationofthetexts,theKWICsearchescan
onlybedonebywordform.InChapter8ofthisDocumentation,anothertypeofKWICisoffered,
this time not by string matching but rather based on approximate string matching using
orthographicsimilaritycoefficients.
4.2Useoftheinterface
The KWIC link, in the left navigation bar, directs the user to the page to obtain the contexts of
occurrence of a given query expression in the corpus. The user will then be presented with a form
such as the one in Figure 10, which includes a field for the query expression and another to specify
the size of the context window in number of words at each side or, alternatively, a context of a
sentence. The screen with the results is also shown in Figure 11.
25
26
Figure 11: Some of the contexts retrieved
Figure 10: Web form for KWIC extraction
4.3Useofthemodule
The method to extract the concordances is shown in Example 12. The two necessary parameters are
the query expression, which is not case sensitive, and the size of the context window measured in
number of words at each side. Alternatively, the context can be of a sentence, specifying then the
value sentence. It must be said that the sentence recognition is rather naive, because it does not
include strategies for the disambiguation of the punctuation marks. However, in the vast majority of
the cases the detection of the sentence boundaries will be correct.
Download this example
my@matches=$Jaguar>kwic(query=>"innate",
window=>10);

print"\nConcordances:".(scalar(@matches));

formy$i(0..$#matches){
print"\n".($i+1).")"
.$matches[$i]>{'left'}.
"\t#".$matches[$i]>{'query'}."#\t".
$matches[$i]>{'right'};
}

Example 12: Method for the extraction of concordances


As shown in the example code, the result of the operation is stored in an array. The number of
matches is the number of elements in the array, and in order to display the context, a loop has to be
implemented. Each element of the array is composed of the left context:
$matches[$i]>{'left'}.
The query expression:
"\t#".$matches[$i]>{'query'}."#\t".
and finally the right side of the context:
$matches[$i]>{'right'};
27
5 Sorting of n-grams
5.1GeneralDescription
Wecandefineanngramasasequenceofnunitsinatext.Thus,wecanimagineatextasaseries
ofpointsinalineWandw
i
asthepositionofthei
th
elementinsuchline.Abigram(anngramof
n=2)isdefinedthenas:
w
i
w
i+1
Similarly,atrigram(ngramofn=3)isdefinedas:
w
i
w
i+1
w
i+2

Aunitcanbeacharacter,aword,aPartofSpeechtagorsomethingelse,dependingonhowone
defines a token. If we define tokens as single words, then the sequence New York would be a
bigram, and New York Post a trigram. If we sort the bigrams of the collection of articles of a
newspaperliketheNewYorkPostbydecreasingfrequencyorder,ofcoursewewillfindthatthe
bigramNewYork isamongthemostfrequent.However,therewillalsobeotherfrequentbutless
interestingbigramssuchas ofthe. Ifthepurposeofanngramanalysisistofindsignificantor
interestingcombinationsofwordssuchascollocationsormultiwordexpressions,thenthiskindof
noninterestingbigramscanbefilteredoutusingastoplist,alistoffunctionwords(suchas the,
and,that,for,are,this,not,with,etc.)whicharehighlyfrequentinmostcorporaandare,thus,non
informative. Chapter 6, however, presents a series of more sophisticated alternatives mainly by
sorting ngrams not just by decreasing frequency order but by the statistical significance of the
combinationoftheirelements.
Sortingngramsbydecreasingfrequencyisafairlysimpleoperationanditcanbedonewithfew
linesofcodeofanyprograminglanguage.Thefunctionforngramfrequencysortingacceptsatext
orcorpusasinput,tokenizesthetext(separatingwordsfrompunctuationmarksandothersymbols)
andsortsthengramfilteringstopwordsifinstructedtodoso.Table4showsanexamplewiththe
30 most frequent bigrams from the corpus downloaded earlier from the web with the query
expression Chomsky Kant. The tables show frequency rank, bigram, absolute frequency and
relativefrequency.
28
n Unit Absolut Frec. Relative Frec.
1) noam chomsky 99 0.00083898
2) new york 51 0.00043220
3) innate knowledge 43 0.00036440
4) human language 41 0.00034745
5) human nature 39 0.00033051
6) university press 30 0.00025424
7) universal grammar 27 0.00022881
8) jean piaget 26 0.00022034
9) standard argument 26 0.00022034
10) intuitiondeduction thesis 24 0.00020339
11) continuum studies 24 0.00020339
12) accept cookies 24 0.00020339
13) human beings 24 0.00020339
14) color experiences 23 0.00019491
15) reflecting abstraction 23 0.00019491
16) cognitive structures 22 0.00018644
17) knowledge thesis 22 0.00018644
18) spoken language 21 0.00017796
19) external world 20 0.00016949
20) problem situation 20 0.00016949
21) generative grammar 19 0.00016102
22) language acquisition 19 0.00016102
23) golden rule 19 0.00016102
24) cognitive-behavioral therapy 18 0.00015254
25) united states 18 0.00015254
26) cognitive therapy 17 0.00014407
27) innate concept 17 0.00014407
28) sense experience 17 0.00014407
29) knowing what 17 0.00014407
30) chomsky hopes 16 0.00013559
Table 4: Sorting bigrams by frequency
Ifwelookattherelationbetweenthefrequencyofabigramanditsrank,wecanseethatfrequency
is(roughly)afunctionoftherank,suchthat:
f(x)=c/r
wherecisaconstantvalueobtainedmymultiplyingfrequencybyrank.
c=f.r
Amonglinguists,thisfunctionisattributedtoJ.B.Estoup,whodescribeditin1916,butwaslater
spreadbyG.Zipfsince1935.TheinterestinthesocalledZipfLawbeguntodecay,however,
after the work of Mandelbrot (1961, 1983), who reformulated it in order to reach a better
adjustmentparticularlyintheupperandlowerportionsofthefrequencyrank:
f(x)=P(r+p)
B
29
InMandelbrot'sformula, p andB areconstantparameters,althoughHerdan(1964)believesthese
parametersactuallydependonthesizeofthecorpus.
Asalreadysaid,theseresultscanbefilteredwiththehelpofastoplist,notlistinganybigramsthat
startorendwithamemberofsuchstoplist.Otheralternativesaretousea(previouslyacquired)
model of the language of the texts, which is nothing else than a collection of documents
representativeofgenerallanguage(mostlypressarticles).Theprogramcontainsthiskindofmodels
for most European languages, and it is fairly easy to build a new model, because the model is
nothingelsethanasetofngramfrequencylists.Withthehelpofthesemodels,theprogramisable
toeliminatefromthengramliststhoseunitsthathaveafrequencyinthemodelgreaterthanavalue
passedasaparameter.Inasense,thiswouldbelikeagraduallyadjustablestoplist.
5.2Useoftheinterface
For the extraction of n-gram frequency lists, the user must follow the link n-grams in the left
navigation bar to find an interrogation form such as the one shown in Figure 12.
Figure 12: Form for the extraction of n-grams
30
Figure 13: Result of an extraction of n-grams
The parameters that can be set are
a) the number of n, which can be 1 for the extraction of single words, 2 for bigrams, 3 for
trigram and so on.
b) to list only n-grams that include a given component
c) to ignore n-grams that include numbers
e) to ignore n-grams that are shorter than a given number of characters
f) to ignore n-grams with an extension greater than a given number of characters
h) to ignore n-grams that begin or end with words that have an absolute frequency in the
reference corpus greater than a given number
i) the language of the corpus (if not already detected)
j) to ignore n-grams that begin or end with a member of the stoplist
k) the minimum frequency
l) the maximum number of results
5.3Useofthemodule
Example 13 shows how to perform a query to the module at any number of n.
Download this example
my@bigrams=$J>ngrams(frecMin=>2,
31
n=>2,
ignoreNumbers=>1,
stoplist=>1,
resultMax=>1000
);
formy$i(0..$#bigrams){
print"\n".($i+1).")".$bigrams[$i]>[0]."\t".$bigrams[$i]>[1];
}
Example 13: Method for the extraction of n-grams
As in the case of the extraction of concordances, the result of the n-gram extraction is an array, and
we need a loop to print the array elements. It is a bi-dimensional array, meaning that each array
element is itself a reference to a different array. The first element of this second array is the n-gram,
the second is the absolute frequency, and third is the relative frequency ($bigrams[$i]>[2]
multiplied by 1000 to make it easier to read).
As in Section 5.2 , the parameters that can be adjusted are the following:
n => the number of n, which can be 1 for the extraction of single words, 2 for bigrams, 3 for
trigram and so on.
component => to list only n-grams that include a given component
ignoreNumbers => to ignore n-grams that include numbers
sizeMin => to ignore n-grams that are shorter than a given number of characters
sizeMax => to ignore n-grams with an extension greater than a given number of characters
reFrec => to ignore n-grams with begin or end with words which have an absolute
frequency in the reference corpus greater than a given number
language => the language of the corpus (if not already detected)
stoplist => to ignore n-grams that begin or end with a member of the stoplist
freqMin => the minimum frequency
resultMax => the maximum number of results
order => sorting will be by decreasing frequency by default, however, order can also be set
to alphabetical.
punctuation => n-grams with punctuation marks are not sorted by default, however if this
parameter is active they will be sorted.
hashResult => if set to the value 'hash', it will return a hash with the n-grams as keys and the
frequencies as values, instead of an array which is the default behavior. The value
32
'fullHash', in turn, will return the full structure with the key 'abs' for the absolute
frequency, 'rel' for the relative frequency, 'mod' for the expected frequency (which is the
frequency of an element in the language model) in case the reFrec parameter is active.
raw => This parameter can be activated if the user knows that the corpus is clean, that is, if it
does not contain nonlinguistic elements and other noise as typically found in corpora
downloaded from the web. Its activation significantly speeds up the processing.
6 Measures of association
6.1Generaldescription
It was already mentioned in the previous chapter that an important property of the combination of
lexical units in discourse is the fact that certain units show a significant statistic association while
others do not. The study of this phenomenon is at the heart of the analysis of collocations, one of
the most active fields of research in linguistics. The techniques that are offered in the present
chapter can be applied to the analysis of collocations, but not only, because many linguists or
terminologists can have other motivations to use the same techniques. A syntagmatic construction
which is consolidated in a given corpus may be, for instance, indicative of multiword terminology
coinage. Another example of possible application could be to find words or terms which are
semantically or conceptually associated, rather than forming a single terminological unit. The way
in which this part of the program is structured follows the analysis of collocations as presented by
Manning and Schtze (1999).
The first type of analysis that the program offers is the study of co-occurrence on the basis of the
variance that is found in the relative distance that two lexical units show in a context window. The
rest of the measures commented in this chapter, t-score, chi-square and Mutual Information, do not
take distance between collocates into account (assuming that they are in consecutive order) but
rather the statistical significance in terms of frequency of co-occurrence. Optionally, these measures
can be applied using reference corpora of general language, which will help to inform about normal
combinations in language in order to exclude them of the analysis, given these combinations are
usually uninteresting for terminologists. These measures may be complementary to those described
in the next chapter, which do not only take into account the frequency of occurrence or co-
occurrence but also the way in which units are distributed across the documents or given partitions
of the corpus.
33
6.1.1Mean,varianceandstandarddeviationoflexicalcooccurrence
Themeasuresstudiedinthissectioncanbeusefultostudywordsthataresyntagmaticallyrelated
butdonotnecessarilyappearinconsecutiveorder,asitwasthecaseintheanalysisof ngrams
describedinChapter5.Herewecananalyzemorecomplexsituationsinwhichtwounitsappearat
differentdistancefromeachotherbutstillshowatendencyorpattern.
ThehistograminFigure14showsthecooccurrencepatternofthewordsstock andmarketinthe
corpusthatwasalreadyusedinChapter5.Thehistogramshowsthecontextsofoccurrenceofthe
wordmarket, whichisinposition0ofthehistogram,andthen10positionsatleftandright.The
verticalaxisshowstheabsolutefrequencyofthecollocate,stock,ineveryposition.Thehistogram
shows, in this case, that the position 1 is the most frequent, thus, it can be concluded that,
accordingtothissample,stockmarketisaconsolidatedsyntagmaticexpression.
Figure 14: Histogram of stock and market
AsimilarsituationisshowninFigure15forthepairstockandvalue.Inthiscasewehavetheword
stockinposition0andweseethatmostoftheoccurrencesofthewordvalue areconcentratedin
position+1,thuswecanconcludethatstockvalueisaconsolidatedexpression.
Figure 15: Histogram of stock and value
The same situation is given again in Figure 16 for the bigramcapital stock, where stock is in
position0andmostoccurrencesofcapitalareconcentratedonposition1.
34
Figure 16: Histogram of stock and capital
Figure 17: Histogram of stock and place
Figure17shows,instead,adifferentsituationforthepairstockandplace.Bothunitsseemtohave
astrongcoocurrenceinthesamecontextwindow,howevertheircollocationaldistancedoesnot
show a pattern. Having the word stock in position 0 of the histogram, we can see that the
occurrencesofplaceareevenlydistributedatleftandrightwithoutindicatingapattern.
Inordertocharacterizethesetendencies(andtorankcollocatesaccordingly),wecanusethewidely
knownmeasuresofcentraltendency:mode,meanandstandarddeviation.Themodeissimplythe
mostfrequentvaluewithinasample.Intheexamplesabove,itwas1forFigure14,+1forFigure
15,andsoon.Themeanisthesumofallvaluesdividedbythenumberofvaluesinthesample.In
Equation1,themeanofsampleXisexpressedastheexpectedvalueofthesample.
E( X)=

x
p( x)
(2)
The standard deviation, expressed as and defined in Equation 2, is a way to measure the
differencebetweentheindividualvaluesofasamplewithrespecttothemean.Nisthenumberof
valuesinthesample,x
i
iseachindividualvalueand
x isthemeanofthesample.
c=
.

i=1
n
( x
i

x)
2
n1
(3)
Standard deviation is useful to determine which pairs of units show a tendency to appear at a
determineddistance.Aswesawontheexamplesabove,evenwhenthatdistanceisnotfixed,itwill
beshownasapeakatsomepartofthehistogram.PairsofunitssuchasthoseshowninFigures14,
35
15and16showatendencyofthevaluestoconcentrateincertainpositions.Thisisreflectedbythe
standarddeviation,whichcanthusbeusedtosorttheunits.
6.1.2Measuresofassociationbasedonfrequencies
Thissectionoftheprogramistobeusedasaninstrumentfordeterminingthestatisticalsignificance
ofagivenpairofcooccurringunitsortoobtainalistofallpairsofunitsorderedaccordingtothis
significance(insteadofjustbyplainfrequency,asinChapter5).Themeasuresthataredescribed
areverywellknowninlinguisticcircles:ttest,chisquareandMutualInformation.Inasense,the
measuredescribedinsection6.1.1.isalsobasedonthefrequencyofcooccurrence,howeverinthis
casethemeasuresdonottakeintoaccountthepositionswithinthecontextofeveryoccurrence.
Thus,thereasonwhythesemeasuresaregroupedunderthistitleisthat,ascurrentlyimplemented,
theyarebasedonplainfrequencies.
Themainproblemthatthesemeasuresaresupposedtoaddressistodeterminewhetheracertainco
occurrenceifrelativelyhighlyfrequentbecausebothelementsformasyntagmaticunit,orif,onthe
contrary,thecooccurrenceisveryfrequentonlybecauseoneorbothofitselementsarealsovery
frequentandalsoengagewithotherunits.Toputitverygraphically(andexcusethemetaphor),if
instead ofwordcooccurrences weweremeasuringmonogamy in committedrelationships, then
thesemeasurescouldbeappliedinexactlythesameway:thosecoupleswhichtendtobefaithful
andnottoengageinrelationswithotherpeople(ornotasfrequently)willobtainhighratingsby
thesemeasures.Thishappenswithindependenceoftheabsolutefrequencyoftherelationsofa
particularcouple,becausethescoreisalwaysrelativetothefrequencyoftherelationstheymight
havewithothers.
Instatisticalterms,wecancasttheproblemusinganullhypothesisaccordingtowhich,inthiscase,
the elements of the analyzed pair are not intrinsically related, thus their cooccurrence is not
motivatedbutinsteadgivenbychance.Thisistosaythattheprobabilityofcooccurrenceofboth
elementsisthesameastheproductoftheirindividualprobabilities,asshowninEquation4.
P(w
1
w
2
)=P(w
1
)P(w
2
)
(4)
Ingeneral,however,whatalinguistwillwantisnottotestanullhypothesisofthistype,thatis,to
determinewhetheragivencooccurrencepairisorisnotstatisticallysignificant.Morefrequently,
intheapplicationofquantitativemethodsinlinguisticsoneismoreinterestedindiscoveringnew
36
datafromthecorpusratherthantotestaparticularcase.Therefore,themoreconvenientwayto
applythesemeasuresistosortallthepairsofunitsofthecorpusaccordingtotheirsignificance
value.
Before starting the description of each measure, it should be important to distinguish their
applicationwithandwithoutparameters.Thisdifferenceisnotofsecondaryimportance,becauseit
determineswhetherornotapreviousmodeloftheanalyzedlanguageistobeusedtoobtainthe
expectedwordfrequencies(thenotionoflanguagemodelwasintroducedinSection5.1.).Without
theapplicationofthisparameter,theprogramhasnowaytoknowthatawordsuchaswhatismore
frequentingenerallanguage(andlessinformative)thananotheronesuchashaploid.Instead,the
programhastorelyondataobtainedfromthesamecorpusitisanalyzing,andthiscanonlybe
possible when analyzing a corpus of considerable size (more than half a million tokens). Both
possibilities have advantages and disadvantages. The application of the model always entails a
genreandlexicalbias,becauseinallcasesthemodelsaremadefrompressarticles.
Ttest
Asinthecaseoftherestofthemeasuresthatwillbedescribedbellow,ttestisusedherewiththe
purposeofreorderinglistsofbigramssuchasthoseshowninChapter5,thistimebytheirtscore.
AsshowninEquation5,thetestconsistsincomparingthemeanofanobservedsample(
x )with
themeanitshouldhaveifthenullhypothesiswastrue(,theexpectedfrequency),inrelationtothe
variance(s)andsize(N)ofthesample.
t=
xj
.
s
2
N
(5)
Backtoourexampleofthebigramstockmarket,wecananalyzeitssignificanceusingthistimea
corpusof13megabytesofplaintext(arandomsample100.000sentences)downloadedfromthe
Wall Street Journal and the Financial Times. The bigram in question appears 281 times in this
corpus, while the componentmarketoccurs2521timesandstock1507.Theseareonlyapproximate
figures,becauseinordertodeterminetheexactfrequencyofoccurrenceofeachelementoneshould
takeintoaccounttokenizationproblemsderivedfromthefactthattherewillalsobeoccurrencesof
thepluralformmarkets (676occurrences)andininitialuppercaseletter, Market (79).Thesame
37
happensin cases such as Stocks (27)and Stock (449). If we ignoreletter case, weobtain3276
occurrences of market and 1983 occurrences of stock. For convenience, we will ignore other
realizationssuchasstockmarket(19)andstockmarket(5).Theanalyzedcorpushasanextensionof
2'121'057tokens,whichmeansthattheprobabilityofoccurrenceofthebigramstockmarket,inthe
caseofthenullhypothesisbeingtrue,shouldbe:
P(stockmarket)=P(stock)P(market)
=(1.983/2.121.057)x(3.276/2.121.057)
(9,3491x10
5
)x0,0015=1,4023x10
6
Theexpectedfrequencyofstockmarketasarandomeventisverylow.Theobservedfrequencyis
insteadmuchhigher:
(281/2.121.057)=1,324x10
4
Thequestionishowtodetermineifthedifferencebetweentheobservedandexpectedfrequency
canbeproducedbymerechanceornot.Replacingthesymbolsinthettestbythenumberswehave
inthiscase,asshowninEquation6(
x andsareexpectedtobeapproximatelythesameinlarge
samples),weobtainthevalue16.580. As a reference, the probability of obtaining a result of 2.576
is 0.005 and the higher the number, the less probable it is that the result is given by chance.
Consequently, we can reject with confidence the null hypothesis that states that the elements stock
andmarketare statistical independent.
t =
1324. 10
4
1402. 10
6
.
1324 .10
4
2.121.057
(6)
Chisquare
Thechisquaretestisanotherwaytomeasurethedifferencebetweensomeobservedandexpected
values.Asitwasthecasewiththepreviousmeasure,usuallyoneisnotinterestedintestingthe
significanceofaparticularpairofunits,butrathertosortthebigramsofacorpusaccordingtothe
scoretheyobtainedusingthechisquare.Toapplythetesttocooccurrencedata,thedatahastobe
38
tabulated in order to register all the possible events. These events are represented in a 2x2
contingencytableshowninTable5:O
11
,intheupperleftcell,representsthenumberoftimesterms
t
1
andt
2
cooccurinthesamecontext(asentenceorafixedsizecontextwindow). O
12
,theupper
rightcellinthetable,representsthenumberofcontextsofoccurrenceoftermt
2
withoutt
1
.O
21
,on
thelowerleftcorner,representsthenumberofcontextsofoccurrenceof t
1
without t
2
. O
22
isthe
numberofcontextswhereneitherofthemoccur.Nisthetotalnumberoftokens.RandCrepresent
themarginalfrequencies,thesumofthefrequenciesoftherowsandcolumns,respectively.The
marginalfrequenciesareusedtocalculatetheexpectedfrequencies,asshowninTable6.
t1 t1
t2 O11 O12 = R1
t2 O21 O22 = R2
= C1 = C2 = N
Table 5: 2 x 2 contingency table
t1 t1
t2 E
11 = (
R
1
C
1) / N
E
12 = (
R
1
C
2) / N
t2 E
21 = (
R
2
C
1) / N
E
22 = (
R
2
C
2) / N
Table6:Marginalfrequencies
AsdefinedinEquation7,withnasthetotalnumberofcellsinthetable,therationaleofthechi
squaretestistosumthedifferencesbetweenobservedandexpectedundertheassumptionthat,the
largerthistotaldifferenceis,thelessprobableitisthatthefrequencyofcooccurrencecanbedue
tochance.
X
2
=

i=1
n
(
O
i
E
i
)

E
i
(7)
X
2
=
N(O
11
O
22
O
12
O
21
)
(O
11
+O
12
)(O
11
+O
21
)(O
12
+O
22
)(O
21
+O
22
)
(8)
Toexemplifywithagivenbigram,Table8showsthecaseofthebigrammoreoften.Inthesame
39
corpususedearlierinthissection,thereare12occurrencesoftheexpressionmoreoften. Inthis
corpusthereare454bigramsinwhichthesecondcomponentisoftenbutthefirstoneisnotmore;
thereare4943inwhichthefirstismoreandthesecondisnotoften,andthereare2'115'647bigrams
inwhichneitherofthetwocomponentsismoreoroften.
w
1
=more w
1
=more
w
2
=often
12
ej:"moreoften"
454
ej:"happensoften"
w
2
=often
4943
ej:"morefrequently"
2115647
ej:"almostnever"
Tabla 7: Contingency table for more often
IfweapplyEquation8 toournumbers,asshownin(9),weobtainanumbercloseto109.646.A
chisquarevalueof3.841hasa0.05probabilityofbeingtheresultofpurechance,andanynumber
higherthanthatisevenlessprobable,thuswecanbeconfidentthatthecombinationisnotrandom.
2121056 (12 x 2115647 - 454 x 4943)2
(12+454)(12+4943)(454+2115647)(4943+2115181)
(9)
Assaidearlierinthedescriptionofthettest,usuallyoneisnotinterestedindeterminingwhether
thecombinationwasornottheproductofchance(afterall,wealreadyknewthatmoreoftenisnota
randomcombinationbecauselanguageisnotrandom).Asinthecaseoftherestofthemeasures
describedinthischapter,thepurposeoftheapplicationofthechisquaretestistorankthebigrams
orpairsoflexicalunitsofacorpusaccordingtothesignificanceoftheirassociation.
MutualInformation
The concept of Mutual Information (MI) derived from Information Theory and represents the
amount of information given by an event i about a possible event j (Church y Hanks, 1991;
Manning y Schtze, 1999). As showninEquation 10, it is the ratiobetweenprobability of co
occurrenceofbotheventiandjandtheirindependentprobability,expressedinbinarydigitsbythe
logarithmofbase2.
MI (i , j )=log
2
P(i , j )
P(i) P( j )
(10)
40
Asinthepreviouscasesofttestandchisquare,thehighertheMIweightofagivencombination,
thehighertheirsignificance.Inanextremecase,thehighestMIwouldbeobtainedforacasewhen
oneeventonlyoccurswhentheotheroccurs.Intheoppositecase,thelowestMIwouldbeforthe
caseinwhichgivenoneevent,alargenumberofdifferentpossibleeventscanoccur.Asinmost
measuresofthiskind,MIshouldnotbeappliedforveryinfrequentevents.IfweapplytheMI
formulatothecaseofthebigrammoreoften,analyzedabove,asshownin(11),weobtainavalue
of3.4624bits.Asalreadymentionedforthecaseofthepreviousmeasures,thisvalueisnotvery
meaningfulinisolation,butrathertosortthebigrams(orpairsofunits)ofacorpusaccordingto
theirMIscore.Daille(1994)reportsgoodresultsfortheextractionofmultiwordterminologyusing
a variant of MI which is the Cubic Mutual Information, similar to the original but with the
numeratortothepowerof3,asshowninEquation12.
MI (more , often)=log
2
12
2121065
4955
2121065
.
466
2121065
(11)
MI3(i , j )=log
2
P(ij)
3
P(i ) P( j )
(12)
6.2Useoftheinterface
6.2.1Associationmeasuresbasedonthedistanceofthecooccurringunits
This part of the interface of the program offers the following possibilities: The first one is the
computation of statistic association using the standard deviation of the sample of different distances
between collocates, and is appropriate to find collocates that, even when they show a flexible
distance between them, they also show a tendency or pattern to appear at a given distance (two or
three positions, etc.) as it was already explained in Section 6.1.1 , in the general description offered
by this chapter.
In the corresponding section of the interface, following the link association in the left navigation
bar. This will lead to the two possibilities of Histograms of lexical distances and association
41
coefficients between lexical units. Following the first option, the user will find a form for querying
the system, as shown in Figure 18. The parameters are, firstly, the query expression in question, the
number of histograms in the output, and the criteria to rank the histograms, which can be the
variance, the mode, and the frequency.
Figure 18: Form for the extraction of collocations according to variance
The rest of the parameters are similar to the ones shown in the form for the sorting of n-grams
(Section 5.2 ): the possibility to ignore numbers, to ignore units shorter than a given length (in
number of characters) or to ignore units that which have a frequency in the language model equal or
greater to a given number, the language of the corpus, the inclusion of a stoplist and a minimum
frequency of the units in the corpus.
This interrogation will have as a result a list of histograms, as shown in Figure 19, with the analysis
of the collocatives of a given query expression, as described in Section 6.1.1 . The order of the
histograms is given by the selected parameter, between variance, mode and frequency. Each
histogram includes a randomly selected example of context of occurrence of both units.
42
6.2.2Measuresofassociationbasedonfrequency
This section is divided mainly in two options. One of them is to rank all the n-grams of the corpus
according to their syntagmatic association with a term introduced by the user. The second option is
to sort all the n-grams of the corpus according to the association of their components.
Figure 23 shows a screenshot of the form for the analysis with these association measures. First,
there is an optional input field in case the user only wants to list n-grams which have a specific
component. The following two buttons are for the alternatives of listing only n-grams which include
the component entered in the previous input field and the second for listing all indexed n-grams.
Subsequently, the user will find two checkboxes with the options to ignore n-grams that contain
numbers or punctuation signs. Then, the desired association measures (between t-test, chi-square,
Mutual Information or Cubic Mutual Information). The next checkbox activates the use of a
reference corpus of general language (whose function has been explained in Section 6.1 ). Finally,
the language of the analyzed corpus can be defined manually if it has not been automatically
detected already.
43
Figure 19: Units that co-occur with market in a corpus downloaded from the web
6.3Useofthemodule
As it has been already explained in the previous sections, there are basically two ways of using this
module for the analysis of the corpus with measures of association. The first one is the study of the
lexical distances by means of histograms. The second is to sort the n-grams of the corpus according
to their syntagmatic association.
The first way to use the module is shown in Example 14. This method will return the results
alternatively in html format or as a data structure. Some of the parameters that can be adjusted have
already been introduced in Section 5.3 , however there are also new ones.
44
Figure 20: Form for the sorting of n-grams according to statistical measures
Download this example
my@a=$J>association(
frecMin=>5,
measure=>'variance',
query=>"something",
ignumer=>1,
stoplist=>1,
sincontrol=>1,
reFrec=>200,
verbose=>0,
result=>"html",
context=>10,
image=>"./img/",
link=>
"htp://melot.upf.edu/cgibin/jaguar.pl?nav=kwic&query=",
histograms=>30,
frecMin=>2,
);
Example 14: Method for the analysis of collocations using histograms of lexical distance

query => This is the query expression, which is necessary to perform the analysis.
order => This is the criterion according to which the histograms will be sorted. The three possible
values are frequency, mode o variance. The default value is mode.
freqMin => The minimum frequency
context => Extension of the context window in number of words. In this case context windows
cannot be of the extension of a sentence because histograms need to have a regular size.
stoplist => If activated with any true value, the program will ignore n-grams that begin or end with
a member of the stoplist
resultMax => The maximum number of results
With respect to the way to manipulate the resulting data structure, the procedure is shown in
Example 15. What we have there is a loop that runs through an array arbitrarily named @a, as
shown in Example 14, which is the variable that contains the results. The code in Example 15 might
seem complicated because it shows loops inside other loops, however it is necessary since the result
is a complex data structure. This code would not be necessary if the user only needs to print the
result in html format. However, data structures have a lot more potential.
45
Download this example
foreachmy$i(@aso){
print"\n\n";
foreachmy$ii(keys%{$i}){
if($iieq"histograma"){
foreachmy$iii(@{$i>{$ii}}){
foreachmy$iiii(keys%{$iii}){
print"\n\t$iiii=>$iii>{$iiii}";
}
}
}else{
print"\n$ii=>$i>{$ii}";
}
}
}
Example 15: Loops to print the result of the analysis of collocations with histograms
Let us analyze the data structure in its different levels. In the first level we find an array of the
different collocates of the query. The order of the elements in the array is given by the parameter
order, that was set before the analysis. Each element in this data structure is presented in the form
of attribute-value pairs, which can be numeric or alphanumeric, depending on the case. The most
complex element is the one named histogram, because it is the one that contains the data for the
histograms. This element is actually a reference to another list, which in turn has elements that are
also presented as attribute-value pairs.
For illustration, consider Example 16, where we see the result of the first cycle of an execution of
the module with the query expression palabra (word, in Spanish) in a corpus consisting of all the
editions of EL PAIS newspaper between January and February 2007. The first collocate, and thus,
the first cycle of the execution, correspond to the word tom (took), which is typically found in
the collocation tomar la palabra (to take someone's word). The key -10 in the hash histogram
corresponds to the position -10 in the histogram, that is, ten positions to the left of the target word,
palabra in this case, located in position 0.
46
form=>tom
histogram=>(
10=>0
9=>0
8=>0
7=>0
6=>0
5=>0
4=>2
3=>0
2=>7
1=>0
0=>0
1=>0
2=>0
3=>0
4=>0
5=>0
6=>0
7=>0
8=>0
9=>0
10=>0
)
mode=>2(7cases)
example=>...bandaactuayerdeportavozy,cuandotomla
palabralasmsde500personascongregadasenelfrontn
guardaron...
standardev=>0.87
variance=>0.75
frequency=>9
mean=>2.20
Example 16: Data structure of the result of the analysis with histograms of the Spanish word palabra
The rest of the ways to use the module to extract n-grams by measures of association are based on
co-occurrence frequencies and not on the distance between collocates, as it was already explained in
Section 6.1.2 . Example 17 shows that the way to use the module in this case is very similar to the
previous case (most of the parameters have already been used in Section 5.3 ). Basically, the only
new parameter is measure, which is the one that determines how the resulting list of n-grams will
be sorted.
47
Download this example
my@a=$J>association(
n=>2,
frecMin=>5,
measure=>'MI',
ignumer=>1,
stoplist=>1,
sincontrol=>1,
verbose=>"very",
result=>"html",
frecMin=>2,
);
Example 17: Method for the analysis with measures of association

n => As it was the case in the list of n-grams sorted by frequency, n is the number of components
that the n-grams will have. Typically, it will be bigrams (the default value), but in
principle there is no reason not to do the analysis with larger numbers of n.
measure => The value of this parameter is MI by default, but it also could be t-score, chi-
square or MI3.
component => to list only n-grams that include a given component
result => if set to html, the results of this analysis will be printed as an HTML table. By default,
however, the result will be an array with the lexical units as elements sorted according
to their corresponding value, as determined by the selected association measure. Each
element of this array is a hash table, where the key is the unit itself and the value the
coefficient obtained with the selected measure.
reFrec => This is the parameter that activates the use of a reference corpus of general language
which was commented in Section 6 , which will inform the expected frequency of a
given lexical unit in a corpus. It is obviously necessary that the language of the corpus
is one of the languages for which the program has a model.
As in the previous case, with the analysis with histograms, the code in Example 18 shows how the
resulting data structure can be traveled with the help of a loop. In this case we do not have a
complex data structure as it was the case with histograms, therefore only two loops are needed. The
first one is to travel through the elements of the array. It is important to remember that the order of
the elements in the array is given by the value obtained with the association measure. The most
informative elements will be in the first positions and least interesting at the end. Each element of
48
the first array is in turn a hash table, thus we need to run a new loop inside each element. The keys
of this hash are the n-gram in question, its frequency in the corpus, the expected frequency
(obtained from the model of general language) and, finally, the value obtained with the selected
measure. As it was explained in Section 6.1 , this score has no intrinsic value. It is there for the
only purpose of ranking the set of units of the analyzed corpus.
Download this example
foreachmy$i(@a){
foreachmy$ii(keys%{$i}){
print"\n\n\n$ii=>$i>{$ii}";
}
}
Example 18: Loop to print the results of the analysis with measures of association
Example 19 shows the first three positions of the array that results from the application of measures
of association to the same corpus of EL PAIS newspaper used in Example 16. It can be seen that the
three of them correspond to proper nouns that, even when they do not have a high frequency in the
analyzed corpus, they obtain a high score with Mutual Information because they have null
frequency in the language model of Spanish that the program has. Example 20, on the contrary,
shows the last three positions of the array, which correspond to the least informative units, units
which have a high frequency in the reference corpus and, therefore, receive a low Mutual
Information score.
forma=>hastypudding
frec=>3
MI=>19.1928
mod=>||
forma=>kyrasedgwick
frec=>3
MI=>19.1928
mod=>||
forma=>yanniskontos
frec=>3
MI=>19.1928
mod=>||
Example 19: First three positions of an analysis of bigrams using Mutual Information

49
forma=>ayerle
frec=>6
MI=>2.8558
mod=>4035|7644|
forma=>menosle
frec=>3
MI=>2.8715
mod=>2357|7644|
forma=>todosle
frec=>3
MI=>3.2470
mod=>2986|7644|
Example 20: Last three positions in the same analysis
7 Measuresoftermdistribution
7.1Generaldescription
Inthissection,Jaguaroffersalternativesforthestudyofthedistributionoftheoccurrencesofa
term or terms, both within a document divided in parts or a collection of documents. The total
frequencyofagiventerminadocumentcanbeusefultoevaluatetherelevancyofsuchtermtothe
particulardocument.However,wecanalsoobtainmoreinformationifwetakeintoconsideration
notonlythefrequencyofoccurrencebutalsothewayinwhichoccurrencesaredistributed,i.e.,if
mostareconcentratedatthebeginningofthedocumentorsomeothersection,orifitismoreorless
homogeneouslydistributed.Ifweobservethattheoccurrencesofatermareconcentratedatthe
beginning of the document, we might suspect that such term is only used in that document to
introduce the reader into the subject but that not necessarily pertains to the core subject of
discourse.If,onthecontrary,thetermisregularlydistributedanditisnotaveryfrequentwordof
thelanguageinquestion,thenitistobeexpectedthatthetermwillbeagooddescriptorofthe
contentofthetext.
Because of the way the program is structured, there is no difference between internal division
within a document and the different documents. If a document has internal divisions, then the
program will divide such document in different files with a suffix in the name indicating the
50
numberofthepart.
Theprogramoffersdifferentmeasuresofdistributionoftermacrossthepartsofthedocuments,
extractinglistsoftermssortedbytheweightsuchtermsobtainwithmeasuressuchasthedocument
frequency, the inverse documentfrequency or IDF (Spark Jones, 1972), a dispersion coefficient
(Juilland y Chang-Rodrguez (1964) and a measure of distribution original of this program.
Letusanalyzeafirstexample.Oneofthefunctionsoftheprogramistoacceptagiventermorset
oftermsandtheprogramwillplottheoccurrencesofsuchtermswithinthedocumentorcorpus.
Using,again,theEnglishversionoftheCritiqueofPureReasondownloadedfromtheweb,Table 8
showstheresultofplottingtherelativefrequency(relativetothesizeofeachpart)oftheterms
intuition,empirical,concepts.
51
Section Form Relative Frequency Absolute Frequency
antin.txt
intuition 0.00148604 57
empirical 0.0043799 168
concepts 0.00130354 50
ancon.txt
intuition 0.00665949 198
empirical 0.00386789 115
concepts 0.00565048 168
prefs.txt
intuition 0.0011334 23
empirical 0.00147834 30
concepts 0.00231607 47
acon.txt
intuition 0.00665949 198
empirical 0.00386789 115
concepts 0.00565048 168
dmeth.txt
intuition 0.00137657 52
empirical 0.00174719 66
concepts 0.00288551 109
paral.txt
intuition 0.00231858 77
empirical 0.00198735 66
concepts 0.00246914 82
ideal.txt
intuition 0.00023289 8
52
empirical 0.00256179 88
concepts 0.00253268 87
anpri.txt
intuition 0.00410885 183
empirical 0.00287395 128
concepts 0.00386187 172
aesth.txt
intuition 0.0108451 106
empirical 0.00276243 27
concepts 0.00194393 19
Table 8: Results of the analysis of the distribution of a set of terms
Thefunctiondescribedabovecanbeusefulforexploratoryorvisualizationprocedures,however,
whatwillbemoreoftenneededisnottoanalyzethecaseofaparticulartermorsetoftermbut
rather to obtain lists of all the terms or ngrams of a corpus according to the way they are
distributed.
Different measures have different effects, however, all of them are based in the study of the
relationsthatexistbetweenthethreevariablespresentedinTable8.
tf
i,j
Frequencyoftermiindocumentj.
df
i
Numberofdocumentsinwhichioccurs.
cf
i
Frequencyoftermiinthewholecollection
Table 9: Variables for the analysis of term distribution in corpora
7.1.1InverseDocumentFrequency
Thisisalgorithmthatweightsagiventerminadocumentbymeansofestimatingtheprobabilityof
occurrenceofsuchterminsuchdocument.Itis,thus,amodelthatformalizestheexpectationfora
term w
i
occurring k times in a document d. Equation 13 shows how the weight is calculated,
assumingthattf
i,j
>0.
w(i , j )=(1+log(tf
i , j
)) log
n
df
i
(13)
7.1.2DispersionCoefficient
Juilland and ChangRodriguez (1964) describe a measure to identify the core vocabulary of a
53
language,thesetofunitsthatissupposedtobebasicofthelanguageanddonotpertaintoaspecific
subjectfield,thusexcludingterminologyandpropernouns.Thisdistinctionshouldbeclear:there
arelexical unitswhoseoccurrenceisnotconditionedbythesubjectofthetext,whilethereare
othersthathaveahighcontentdensityandarethustotallydependentonthesubject.Awordsuchas
example isawordofthecoreEnglishvocabulary,awordthateverypersonshouldknowtobe
consideredacompetentspeakerofEnglish.Thetermhaploid,onthecontrary,doesnotformpartof
thecorevocabularyanditsoccurrenceindiscoursewilldependonthetopicofthetext.Referring
expressionsingeneral,suchaspropernounsandspecializedterminology,donotformpartofthe
language. They are, instead, units of the knowledge of the world that a given speaker has.
Obviously,bothplanesareinherentlyrelated,andthusweexpectthatapersonfluentenoughina
Languagewillalsohaveacertaindegreeofknowledgeoftheworldthatis(usually)referredtoby
thecommunityspeakingthatlanguage.
Juilland and ChangRodriguez reported an experiment in the Spanish language in which they
analyzedacorpusofanextensionofhalfamilliontokens,dividedinfivepartsofthesamesize.
Eachpartrepresentedadifferenttextualgenre,suchasliterature,technicaldocumentation,press
articlesandothers.Byextractingthoseunitsthatwheremorehomogeneouslydistributedacrossthat
collection,theywereabletoidentifythoselexicalunitswithinthecorpusthatpertaintothissetof
thecorevocabularyofalanguage.
Thedispersionindexrangesfrom to1.Valuescloseto1indicatethatagivenlexicalunitis
homogeneouslydistributedacrossthecollectionand,thus,morecloseitistothesetofthecore
vocabularyofalanguage.Foranygivenlexicalunit,Equation14definesitsdispersionindex(D),
whereVisthevariationcoefficient,definedinturninEquation15.Thevariablenisthenumberof
partitionsofthecorpus.InEquation15,isthestandarddeviation,definedinEquation16,andtf
i
theobservedrelativefrequencyofthelexicalunitinthei
th
partitionofthecorpus.Etf
i,
inEquation
16,istheexpectedrelativefrequencyinsuch partition. Themostsimplewayofcalculatingthe
expectedfrequencyistodividetherelativefrequencyofthelexicalunitinthewholecorpusdivided
bythetotalnumberofpartitions.Inthisway,theexpectedfrequencywouldassumethattheunitis
homogeneouslydistributedacrossallpartitions.
D=1
V
. n1
(14)
54
V=
c
(

i=1
n
tf
i
n
)
(15)
c=
.

i=1
n
(tf
i
Etf
i
)
2
n
(16)
7.1.3Othermeasuresofdispersion
Apart from the two measures described above, which are very well known in linguistic circles,
Jaguar also offers the possibility to sort the vocabulary of a corpus by document frequency, which is
another simple way to measure the dispersion of the vocabulary, and also a measure -original from
the program- which takes into consideration both the term frequency and document frequency but in
relation to reference corpus, a previously acquired model of general language (already mentioned in
Sections 5.1 and 6.1.2 ). This measure, defined in Equation 17, will thus assign a high score to a
lexical unit i if: 1) it has high frequency in all the corpus, 2) it is well dispersed in the corpus
(having a large document frequency) and 3) it has a low frequency in the reference corpus.
w
i
=log(
cf
i
. df
i
Ef
i
) (17)
7.1.4Diachronicanalysisofthefrequenciesofterms
Conceptually, in this subsection there is nothing new with respect to what has been said in this
Chapter. This section is only meant at showing another example of how this units could be applied
to a particular kind of analysis. If we divide a corpus in partitions that represent periods of times,
then we can undertake interesting diachronic studies of the vocabulary. To use Jaguar in this way, it
would be necessary to encode the names of the partition in such a way an order can be inferred,
such us using numbers in increasing order.
55
Figure 21: Evolution of the relative frequencies of the Spanish words hombre (man) and mujer (woman) in EL PAIS
newspaper from the year 1976 to 2007
Figure 21 shows an example of a corpus partitioned in chronological order. This corpus represents
the collection of all the editions of EL PAIS newspaper from the year 1976 to 2007, and the two
lines plotted in the graph correspond to the relative frequencies of the Spanish words hombre (man)
and mujer (woman) and is an interesting picture of the change in sexist language in the Spanish
press and possibly the fact that now there are more women taking part in events of public life. In the
first years of the sample, hombre is much more frequent than mujer, but that tendency is
progressively correcting until the last year of the collection.
7.2Useoftheinterface
The section of web interface for the analysis of term distribution is divided in two main types of
interrogation. The first one is accomplished by entering a term or a list of terms as input and the
output will be a graph plotting the curves of the relative frequencies of the units in the corpus, along
with the same data in tabulated form. The second option for the analysis consists of the sorting of
all the n-grams of the collection according to the available distribution coefficients.
Figure 22 shows a screenshot of the form to introduce a term or a set of terms in order to plot how
the term is distributed across the corpus. This form appears when following the link distribution
in the left navigation bar and, subsequently, by clicking Introduce one or more queries and obtain
curves of their distribution....
56
Figure 22: Form for the analysis of the distribution of a given set of terms
Figure 23 showstheresultof an interrogationwith a setof terms in a corpus composed ofthe
different chapters of the Philosophy of Mind from Hegel. The terms are universality,
57
Figure 23: Plot of the relative frequencies of several terms in a collection of documents
subjectivity,selfconsciousness,actuality,intuition,immediacy,individuality,infinite,judgment
,finite,asetoftermsthat,accordingtotheirspreaddistributionacrossthechaptersofthework,are
keyconceptsinthetext.
To mention another example, using the corpus of 50 documents downloaded from the web in
Section 2.2.2 with the query expression Chomsky Kant, it is now possible to analyze the
distributionofdifferenttermssuchas NoamChomsky,generativegrammar,contextfree,context
sensitive, anarchosyndicalism. TheresultinggraphwouldshowthetwofacetsofChomsky,asa
linguisticandasapoliticalactivist.OnecanclearlyseethatChomskyappearswithdifferentterms
depending on the document. It is normal to see the name of Chomsky cooccurring in some
documentswiththeterms contextfreeor contextsensitive,howeverthetermsChomsky, context
freeand anarchosyndicalismwillrarelybefoundinthesamedocumentotherthanencyclopedic
text.
Figure 24 shows theresult of the second optionforthe analysisoftermsdistribution, which is
presentedintheformoftabulateddata,withalistoftermsfromthecorpussortedaccordingtotheir
weightgivenbytheselectedcoefficient.Inthecaseofthisparticularscreenshot,theprogramis
sortingthevocabularyofthePhilosophyofMindusedearlieraccordingtoJaguar'sdistribution
coefficient(definedinSection7.1.3).Foreachrecordinthetablewecanobserveitfrequencyon
thecorpus(columnfrec);thedocumentfrequency(columndocs);andthefrequencyofthatunitin
the model of general language plus 1. The last column is correspond to the weight of the unit
accordingtothecoefficient.
58
Figure 24: Sorting term according to their distributional properties
7.3Useofthemodule
The methods for the analysis of the distribution of terms are divided into two types. The first
method accepts as input a term or a series of terms and outputs their distribution across the corpus,
in graphical and tabulated data forms. The second method consist of extracting lists of terms from
the corpus according to the weight they obtain by the selected distribution coefficient. As it is the
case for the association measures described in Chapter 6, in this section the corpus needs to be
previously indexed (as described in Section 3.3.3 ).
59
Download this example
my @terms = ("universality", "subjectivity", "selfconsciousness",
"actuality", "intuition", "immediacy", "individuality", "infinite",
"judgement","finite);
my@result=$J>distribution(
frecMin=>2,
pathIndex=>"/path/to/the/index/",
language=>'En',
terms=>\@terms,
result=>html
);
Example 21: Method for the analysis of term frequency distribution in a corpus
As a result of the code presented in Example 21, Jaguar will print html code for a table with the
results if the parameter result has the value html. In any case, the data can be stored in an array,
as in the case of the example with the array @result. The structure of this array can be sketched as
follows:
$result[$case]>[$counter]>{$part}>{$form}>{'fabs'}
The first level of the array is called a case. The purpose of this first division in cases is to allow for
the possibility of querying with different sets of units in a single operation, for instance, for
comparative studies. The second level ($counter) is the number that defined the position of a given
document or partition of the corpus within the corpus. This order is essential in particular when
undertaking diachronic studies. The next level is the name of the document or partition ($part),
followed by the analyzed form or term ($form) and, finally, the relative or absolute frequency of
such form (('frel' y 'fabs'). Example 22 shows how to write the different loops to travel through the
different levels of the array.
60
Download this example
foreachmy$i(@result){
foreachmy$ii(@{$i}){
foreachmy$iii(keys%{$ii}){
print"\n".$iii."=>";
foreachmy$iiii(keys%{$ii>{$iii}}){
print"\n\t".$iiii."=>".$ii>{$iii}>{$iiii};
}
}
}
}
Example 22: Loops to travel through the resulting data structure
If, on the contrary, the user prefers to obtain lists of terms based on the weight they obtain
according to different distribution coefficients, then the way to define the interrogation is given in
Example 23.
Download this example
my@result=$J>distribution(
frecMin=>2,
pathIndex=>"/path/to/the/index/",
ignumber=>1,
stoplist=>1,
resultMax=>10000000,
result=>"html",
language=>'En',
extraction=>1,
measure=>Jaguar
);
Example 23: Method for the sorting of terms from the corpus according to distribution coefficients
The available parameters in the case are:
extraction => This is the parameter that tells Jaguar whether the interrogation method is to analyze
a term or a set of terms provided as input or if all the vocabulary of the corpus is to be
analyzed. If activated with any true value, the second option will be assumed. The default
behavior is to expect terms as input.
61
measure => (and also, coefficient). This parameter defines which distribution coefficient is to be
used for the sorting of terms. The values can be DocFreq, Jaguar, tfidf or
dispersion. For a detailed explanation see Section 7.1 .
The rest of the parameters have already been used. The parameter result with value html will
print the result of the analysis in html tables, including the graphs. If, instead of printing html, the
user prefers to obtain a data structure, this will be stored, again, in an array as in the case of the
arbitrarily selected name @result. Example 24 shows how to loop through the data structure. The
result of this code is given in 25.
Download this example
foreachmy$i(@result){
foreachmy$ii(keys%{$i}){
print"\n".$ii."=>";
foreachmy$iii(keys%{$i>{$ii}}){
print"\n===>".$iii."=>".$i>{$ii}>{$iii};
}
}
}
Example 24: Loops to travel through the resulting data structure
universality=>
===>freq=>75
===>weight=>1.8808
===>docs=>8
===>mod=>1
subjectivity=>
===>freq=>73
===>weight=>1.8692
===>docs=>7
===>mod=>1
selfcosciousness=>
===>freq=>59
===>weight=>1.7782
===>docs=>4
===>mod=>1
Example 25: Result of the first three results of the code presented in 24
62
8 Measures of similarity
8.1Generaldescription
Itispossibletocalculatethesimilaritythatexistsbetweendifferenttypesofcomplexobjectssuch
astextsusingmeasuresofvectorsimilarity.Avectorcanrepresentavarietyofthings:adocument,
thebeamoflexicalcooccurrencesofagivenlexicalunit,thepredicatesassociatedtoagivenname,
etc.Avectoris,inessence,asequenceofvalues,asshownin(18).Thenumberofdifferentvalues
definesthedimensionalityofavector,n,wherex
i
iseachcomponent.
x=( x
1,
x
2,
x
3
... x
n
)
(18)
Themostintuitivemannertoconceiveavectorisperhapsastherowofamatrix,astheoneshown
inTable10,forabinarytermxdocumentmatrix.
term1 term2 term3 ...
doc1 1 0 1
doc2 0 1 1
doc3 0 1 0
...
Table 10: Term x Document matrix
Atermshouldbeinterpretedhereasanabstractentity.Itcouldalsobecalledaneventorinstance.
Aneventcanrepresentanorthographicword,asastringofcharactersbetweenblankspaces,an n
gram,amultiwordexpression,anounphrase,etc.,dependingonuserneeds.
Ameasureofsimilaritybetweenstringsofcharacterscan,amongotherpossibilities,beusedasa
formofpseudolemmatization,whenoneisforcedtoworkwithnonlemmatizedtexts.Appliedin
this way, for example, a similarity coefficient could detect the relation between units such as
disease anddiseases,orlongerstringssuchasvariantsofthesameterm,e.g. lungsurface and
surfaceofthelungs.
The user can analyze vectors if they are in the form of tabulated data files (plain text files).
63
However,atypicalsituationwouldbethattheuserentersastringoftextandtheprogramoutputs
stringsfromthecorpuswhichshowamorphologicalresemblanceaccordingtoaselectedsimilarity
coefficient. The same can happen with documents. The user can select a given document as
referenceandtheprogramwilloutputtherestofthedocumentsofthecollectionaccordingtothe
similaritytheyhavetosuchreferencedocumentusingoneoftheavailablesimilaritycoefficients.
ThesimilaritycoefficientsavailableatthemomentareMatching,Jaccard,Dice,Overlap,Cosine,
andalsotheEuclideanandManhattandistances.
8.1.1Measuresforvectorcomparison
Givenasetofvectors,themostsimpleoperationsthatcanbeperformedarethecalculationofthe
union,theintersectionandthedifference.Ifwehavevectorsthatrepresent,forinstance,listsof
words and their frequency in different documents, then the program will offer the following
possibilities:
1.Theunionvector,includingallthecomponentsandthesumoftheirvalues
2.Theintersectionvector,withthecomponentsincommonandthesumoftheirvalues
3.Asetvectorswiththecomponentsthatareonlyfoundineachvector
8.1.2SimilarityCoefficients
The following are the similarity coefficients currently implemented in the program (all of them
describedinmoredetailinManningandSchtze,1999):
Matching
Thismeasure,definedin(19),isappropriateforthecomparisonofbinaryvectors.Itsimplycounts
thenumberofdimensionsthattwovectorshaveincommon.
XY
(19)
64
Dice
Diceissimilartomatchingbutitnormalizesthecomparisonbydividingtheintersectionbythe
numberofdimensionswithvalue1,asshownin(20).Thesymbol|X|isthecardinalityofsetX,i.e.,
thenumberofcomponents.Thenumeratorisdividedby2inordertoobtainavalueintherange0
1,being1thevaluecorrespondingtototalsimilarity.
2XY
X+Y (20)
Jaccard
Jaccard,definedin(21),issimilartoDicebutitintroducesapenalizationforthosevectorswhich
haverelativelylowcomponentsincommoninproportiontothecomponentsthattheydonotshare.
XY
XY (21)
Overlapping
Theoverlappingcoefficient,definedin22,willassignthemaximumvaluestothecomparisonof
twovectorsifthecomponentsofoneareallpresentintheother,regardlessofthefactthatthereare
manycomponentsinthesecondvectorthatarenotpresentinthefirst.
XY
min(X,Y)
(22)
Cosine
TheCosinecoefficient,definedin(23),willalsopenalizethosecomparisonswithfewcomponents
in common with respect to the number of components that differ. The value of this coefficient
65
rangesfrom1foridenticalvectors,0fororthogonal(notsimilar)vectorsto1foroppositevectors.
XY
.X . Y
(23)
8.1.3Distancemeasures
The only difference between a similarity coefficient and a distance coefficient is that, while the first
accuse more similarity as the value increases, the opposite occurs in the case of the distance
measure: the greater the value, the less similar two objects are. The two distance measures
implemented at the moment are well known in linguistic circles, the Euclidean and Manhattan
distances.
Euclidean Distance
Probably the most simple measure, the Euclidean distance will tell the difference between the
values that correspond to each component in each vector, as shown in 24.
.

i=1
n
( X
i
Y
i
)
(24)
Manhattan Distance
Also known as Minkowski, taxi cab or city block distance, it calculates the distance between two
vectors as defined in (25). An intuitive way to conceive this distance is to image a space that can
only be traversed in straight lines, as when one is traveling in a taxi cab in Manhattan.

i=1
n
X
i
Y
i

(25)
66
8.2Useoftheinterface
The use of the interface will depend on the type of file that has been uploaded as a corpus. The
program can, in principle, compute the similarity between vectors with independence of what these
vectors represent. In this case, the data has to be presented in plain text files in tabulated form, with
each component per line and the values of the components separated by a tab character.
Having these kind of files as corpus (one file per vector), the most simple operation can be to
extract the intersection, the union and the difference of the vectors. Another possibility is to select a
given vector as reference and to rank the rest of the vector according to the similarity they obtain
with a given similarity coefficient or distance.
If instead of tabulated data files the corpus is composed of free text, then similar operation can be
performed. For instance, one can submit a string of characters as input and the program will return
contexts of occurrence of similar strings (approximate string matching). As explained in Section
8.1 , the practical utility that this kind of operation can have is, for instance, the possibility to find
in a text lexical units that are not exactly the same as our query expression. The same can be done at
the document level. Given one document of the corpus as reference, the rest of the documents will
be sorted according to the selected coefficient.
Following the link similarity in the left navigation bar, the user will find the three options of this
section (Figure 25): 1) to submit a word and to obtain orthographically similar words from the
corpus, 2) to submit a document and to obtain similar documents from the corpus and 3) to compare
vectors directly (with a corpus composed of tabulated data files). Figures 26 and 27 show
screenshots of the input and output of an extraction of forms from the corpus using a approximate
string matching on the basis of orthographic similarity coefficients.
67
Figure 25: Options for the interrogation in the analysis of similarity
68
Figure 26: Form for the extraction of concordances using approximate string matching
69
Figure 27: Result of the extraction of forms using approximate string matching
8.3Useofthemodule
The most simple way of using the similarity functions of the Jaguar module is given in Example 26.
It is the comparison of two strings of text $A and $B. The output of this code is the number 0.476,
which represent the Dice similarity of the two words taking the bigrams of characters as
components of the vectors.
Download this example
my$A="universality";
my$B="externality";
print$J>similarity(objets=>expressions,
objet1=>$A,
objet2=>$B,
measure=>"Dice"
);
Example 26: Method for comparing two strings of text
If instead of comparing two string of texts (which will rarely be the case) the idea is to extract the n
strings from a (previously indexed) corpus that show the highest similarity to an input string, then
the code to be used is the one displayed in Example 27.
Downloadthisexample
my@sim=$J>similarity(
pathIndex=>"/path/to/the/index/",
objets=>"expressions",
component=>"universality",
measure=>"Dice",
threshold=>0.001,
result=>"html",
);
foreachmy$m(@sim){
print"\n$m$m>{unit}$m>{value}";
}
Example 27: Method to compare a string of text against the whole corpus
The code in Example 27 will have as a result the sorting of all the vocabulary of the corpus
70
according to the orthographic similarity to the word universality. Another useful alternative would
be to compare two lists of expressions, as shown in Example 28.
Download this example
my$A=/path/to/a/file/with/a/list/of/words.txt
my$B=/path/to/another/file/with/a/list/of/words.txt
print$J>similarity(
objets=>lists,
listA=>$A,
listB=>$B,
measure=>"Dice",
charN=>2,
threshold=>0.75
);
Example 28: Method for the comparison of lists of expressions
If instead of comparing expressions the idea is to compare documents, the method would be very
similar, as shown in Example 29. The difference is that we need to set the parameter objects to the
value documents and the parameter referenceDocument to the name of the document that is
going to be taken as reference, which of course has to be included in the indexed corpus. The path
also has to be indicated, as in previous examples, using the parameter indexPath.
Download this example
my@sim=$J>similarity(
pathIndex=>"/path/to/the/index/",
objets=>"documents",
referenceDocument=>"20070111.txt",
coefficient=>"Dice",
threshold=>0.00001,
);
foreachmy$m(@sim){
print"\n$m$m>{document}$m>{value}";
}
Example 29: Method for the sorting of documents according to their similarity to one provided as reference
71
The rest of the methods that are explained in this section are useful for comparing vectors directly.
As it was already explained, vectors are text files in which each line represents a component and the
numeric values of those components are separated by a tab character. If the value is not specified or
if it is 0, it will be interpreted that such component is not present in the vector.
The first possibility, thus, is to submit a reference vector, as in Example 30, and to list the rest of
the vectors according the selected similarity coefficient or distance. Notice that in this case the
result is an array which, as it was the case with the n-grams method, has the elements sorted
according the resulting values, and it is a two dimensional array where each element is itself a new
list of two values: the name of the vector and the similarity value.
Download this example
my@sim=$J>similarity(
path=>"/path/to/the/folder/of/vectors/",
objets=>"vectors",
referenceVector=>"20070111.txt",
coefficient=>"Dice",
threshold=>0.00001,
);
foreachmy$m(@sim){
print"\n$m>[0]$m>[1]";
}
Example 30: Method for sorting vectors according to the similarity they have with a reference vector
Finally, the last possibility is to compare all vectors to obtain the union, the intersection and/or the
difference between them. Example 31 shows how the interrogation is to be made. It is very similar
to the previous example, only that in this case it is not necessary to specify a reference vector, but
any (or all) of the following: union, intersection and difference. The result will be a data structure
in the form of a hash table which has the selected operations as keys of the first level, each of the
vectors involved in as keys in the second level and, in a third level, the list of elements sorting
according to the value that has been set for the parameter order. It the value of such parameter is
value, the elements will be sorted according to their original value. If, instead, the value is set to
alphabetical, the elements will be sorted in alphabetical order.
The code in Example 31 is slightly more complex than in the previous case, depending on the
activation of the parameter difference. If the difference of the vectors has to be computed, the
72
resulting data structure needs a new dimension for each vector, which was not necessary in the
previous cases. This affects the way in which the results are printed. Notice the condition -if($op
eqdifference) after the first loop. Of course, this condition is not necessary if differences are
not computed.
Download this example
my%sim=$J>similarity(
path=>"/path/to/the/vectors/",
objects=>"vectors",
union=>1,
intersection=>1,
difference=>1,
result=>html
);
foreachmy$op(keys%sim){
if($opeqdifference){
foreachmy$vector(keys%{$sim{$op}}){
foreachmy$rank(@{$sim{$op}{$vector}}){
print"\n$rank>[0]$rank>[1]";
}
}
}else{
foreachmy$rank(@{$sim{$op}}){
print"\n$rank>[0]$rank>[1]";
}
}
}
Example 31: Method for the extraction of the union, the intersection and the difference between vector

9 Concluding remarks
The Jaguar project was conceived as a tool for the introduction to quantitative methods in
linguistics, aimed mainly at students or beginners. However, the program surely has enough power
to be used in scientific research, especially when used as a Perl module (because of the already
mentioned hardware limitation of the server). With the Jaguar module, a researcher is in possession
of a set of tools that can be organized creatively in data flows for purposes different to those shown
as example in this documentation.
Jaguar has existed as a web application running IULA's web servers since June 2006, and it has
73
been (slowly but) continuously progressing, with new functions and parameters being added thanks
to the requirements of its users. By September 2010, the web based interfaced of Jaguar has almost
200 registered users, and their feedback and bug reports have been of great help.
10 Bibliography
Birrell, A., Mcjones, P, Lang, R. & Goatley, H (1995). Pstotext - extract ASCII text from a
PostScript or PDF file. README file. http://pages.cs.wisc.edu/~ghost/doc/pstotext.htm
Church, K., Hanks, P. (1991). Word Association Norms, Mutual Information and Lexicography.
Computational Linguistics, Vol 16:1, pp. 22-29.
Daille, B. (1994). Approche mixte pour l'extraction automatique de terminologie: statistiques
lexicales et filtres linguistiques. Thse de Doctorat en Informatique Fondamentale.
Universit Paris 7.
Evert, S. (2004). The Statistics of Word Coocurrences. PhD Thesis; IMS; University of Stuttgart.
Gale, W., Church, K. (1991). Identifying Word Correspondences in Parallel Text. Fourth Darpa
Workshop on Speech and Natural Language, Asilomar, pp. 152-157.
Herdan, G. (1964). Quantitative Linguistics. Washington, Butterworths.
Juilland, A., Chang-Rodrguez, E. (1964). Frequency Dictionary of Spanish Words. The Hague:
Mouton.
Kilgarriff, A. (2004). The Sketch Engine. In Proceedings of Euralex, pp 105116.
Lancia F. (2007). Word Co-occurrence and Similarity in Meaning. In Salvatore S., Valsiner J.
(eds.), Mind as Infinite Dimensionality, Roma, Ed. Carlo Amore (forthcoming, 2008).
Mandelbrot, B. (1961). On the theory of word frequencies and Markovian models of discourse.
Structure of Language and its Mathematical Aspects, Proceedings of the Symposia on
Applied Mathematics, v. 12, American Mathematical Society, pp 190--219.
Mandelbrot, B. (1983). Los objetos fractales, Madrid, Tusquets.
Manning, C., Schtze, H. (1999). Foundations of Statistical Natural Language Processing. MIT
Press.
Marcus, S., Nicolau, E., Stati, S. (1978). Introduccin a la lingstica matemtica, Barcelona,
Teide.
Muller, C. (1973). Estadstica Lingstica, Madrid, Ed. Gredos.
Noonburg, D. (2004). Pdftotext - Portable Document Format (PDF) to text converter (version
3.00) README File. http://www.foolabs.com/xpdf/README
Quasthoff, U., Richter, M., Biemann, C. (2006). Corpus portal for search in monolingual corpora.
In Proceedings of the LREC 2006, Genoa, Italy.
Sparck Jones, K. (1972). A statistical interpretation of term specificity and its application in
retrieval. Journal of Documentation, 28 (1), pp 11-21.
Van Os, A. (2003). Antiword Linux Command README File
http://www.winfield.demon.nl/index.html.
74

Vous aimerez peut-être aussi