Vous êtes sur la page 1sur 2

23/11/2017 COS 126: Assignment 8.

Text indexing

COS 126
Programming Assignment 8

Text indexing Due: Wednesday, 11:59pm

Write a program that builds an index from a large piece of text, then does fast searches to find the position of
``query'' strings within the text.

Your program should take two filenames as command-line arguments: the text file (which we refer to as the
corpus) and a file containing the queries. Assume that both files contain only lower-case letters, blanks, and
newline characters, with queries in the query file delimited by newline characters. This is not much of a
restriction, since you can convert any file to this format by using tr filters (useman tr to find out about this
process, if you are interested).

Your program should read the corpus, store it as a (potentially huge) string, and possibly build an index for it,
as discussed below. Then it should read queries one by one, assuming one query per line in the second named
file on the command line, and print out the position in the corpus where each query first appears. For
example, for the corpus /u/cs126/data/corpus0:

it was the best of times it was the worst of times it was the age of
wisdom it was the age of foolishness it was the epoch of belief
it was the epoch of incredulity it was the season of light it was
the season of darkness it was the spring of hope it was the winter
of despair

and the query file /u/cs126/data/query0, the result of a.out corpus0 query0 should be
69 wisdom
176 season
87 age of foolishness
-- age of fools
19 times

indicating that wisdom appears at character position 69, age of fools does not appear, the first occurrence
of times is at character position 19, and so forth.

There are numerous different ways to solve this problem, all with different characteristics in terms of ease of
implementation, space requirements, and time requirements. Part of your task for this assignment is to
assimilate this information to help you decide which method to use and how to apply it to this specific task.

You may begin with a brute-force implementation based on Program 3.15 in the book. That is, don't build an
index at all: just search through the corpus for each query string. This solution is fine if the corpus is tiny, or
if there are not many queries. A complete solution along these lines will be worth 7 points for this program.
But this method is much too slow when the corpus is huge and there are a huge number of queries, so to get a
10 on this program, you will need to implement a faster solution.

One method for making the search fast is to do a pointer sort on the corpus (one pointer for each character
position), then use binary search. If you wish to take this approach, you can use the functions qsort and
bsearch from the standard C library. Your main challenges in this approach are to fully understand Program
3.17; develop a similar program to build an index of pointers that access the keys in sorted order (as in
Figure 3.13); and figure out the necessary interface to bsearch to use the index to do the queries. In
particular, you need appropriate ``comparison'' functions (different ones!) for the sort and the search.

Another possible approach is to start with Program 12.10. This code basically provides a full solution, but to
get it working, you have to make a number of small changes, both because it differs from the problem
specified here in a number of details, and because various small pieces of code are missing. You may want to
edit this code, or write your own code from scratch. Again, the ``comparison'' function has to be carefully
considered.

http://www.cs.princeton.edu/courses/archive/spring03/cs126/assignments/textindex.html 1/2
23/11/2017 COS 126: Assignment 8. Text indexing

The files corpus1 and query1 in /u/cs126/data comprise a larger test case, and the files corpus2 and
query2 comprise a huge test case. If your program handles the huge case, great! If it doesn't, try to
understand the reasons why and document them in your readme file, but don't spend an inordinate amount of
time on new code to handle the huge case, because there aren't that many points available between the 7 for
the brute-force solution and the 10 for a perfect solution.

Important note: the ``training wheels are off'' on this assignment---you're starting with a blank screen, not
our code (although you do have a lot of code from the book to work with). This means that you must pay
particular attention to the process of building up your program from scratch. Consider carefully how your
program should be structured and how you are going to implement and debug the various functions before
plunging in. You may wish to add code just for your own use in debugging (for example, to print out the
contents of any data structure that you build). Get the brute-force solution working first. Not only will it
provide a framework for more complicated solutions (like the code we usually provide) but also it will give
you a way to compute correct output for comparison when debugging more complicated solutions.

Extra Credit: Modify your program to count the number of occurrences of each query string.

Copyright 1998 Robert Sedgewick

http://www.cs.princeton.edu/courses/archive/spring03/cs126/assignments/textindex.html 2/2

Vous aimerez peut-être aussi