Vous êtes sur la page 1sur 6

Review of Search Engines: Information Retrieval in Practice

by Croft, Metzler and Strohman


B. Barla Cambazoglu Yahoo! Research Barcelona, Spain barla@yahoo-inc.com Despite the common public use of Web search engines, their internal design details mostly remain as a black art. The speculation is that there is a significant knowledge gap between what is published by academia and what is guarded behind the doors of largescale search companies. "Search Engines: Information Retrieval in Practice" is one of the few books that make an attempt to cover issues involved in search engine design and is probably the most comprehensive book published so far on this topic. Unfortunately, the book fails to be a complete search engine guide as its content is dominated by the topics from information retrieval, text processing, and statistics. More precisely, the focus of the book is biased towards the "search" rather than the "engines" as, in most places, discussions on effectiveness dominate those on efficiency by a great margin. However, the book stands as a very solid IR book. The book claims to have a systems-oriented view of search engines, which is not very evident from the content and organization. It occasionally goes beyond its main perspective and covers topics from fields that are closely related to search engines, such as machine learning, information sciences, and social networks. The search problem in the book is restricted to text search; however, it is by no means restricted to Web search as the book touches on (though briefly) a wide variety of search applications such as sponsored search, XML search, verticals, and social search. The primary target audience for the book is undergraduates in computer science. Hence, the book is written in a language that most people who have a basic understanding of computers and the Web can follow (though certain sections of the book may require background on probability and statistics). The book is organized in quite a top-down fashion, with a will to cover and categorize every concept under a heading, which occasionally makes the placement of concepts in the hierarchy and sequencing of sections disputable. In general, the relative coverage of the topics seems to reflect authors' personal knowledge on those topics rather than the topics importance in practice. Some of these issues will be pointed out throughout the detailed discussion of chapters. The book is composed of 13 chapters. The first two chapters are introductory chapters, providing an overview of key concepts and issues in search engine design. Chapters three through eight, where basic building blocks of a search engine are explored, form the core of the book. The final three chapters can be seen as supplementary chapters or extensions as they do not fully fit into the theme of the book. The rest of this review iterates over the chapters of the book in sequence, summarizing their content.

The book starts with Chapter 1 (The Big Issues), a short chapter trying to establish a connection between information retrieval and search engines. The "big issues" in information retrieval are summarized as relevance, evaluation, and information needs. For search engines, these issues are extended to include performance, data incorporation, scalability, adaptability, and specific problems. This chapter provides a sufficiently well written overview to search engines. Chapter 2 (Architecture of a Search Engine) goes into more depth on the issues given in the previous chapter and discusses search engines from an architectural point of view. A basic but general architecture is described, where functions of the search engine are divided into two as the indexing process (further divided as text acquisition, text transformation, and index creation) and the query process (further divided as user interaction, ranking, and evaluation). More than 20 different search engine components are classified under these headings. Although this chapter provides a very good coverage of relevant topics, the classification of topics is occasionally problematic (e.g., caching is classified under the data distribution heading). Perhaps, it would have been better to replace "performance optimization" and "distribution" headings as "search performance" and discuss performance issues at different granularities such as node-level, cluster-level, and data-center-level issues. As a side note, in this chapter, the topics under each heading are listed in boxes on the margins of pages. This is a very useful convention, which is, unfortunately, not followed in succeeding chapters. Issues related to discovery, acquisition, conversion, and storage of text are discussed in Chapter 3 (Crawls and Feeds). The topics covered also include character encoding issues, duplicate document detection, and identification of document structures. Interestingly, only 15 pages are dedicated to Web crawling, which is probably one of the most important components in a search engine, and almost the entire "The Web Crawler" section is allocated to the politeness issue, which should have its own subsection. Consequently, some important concepts are either omitted (e.g., coverage and seed selection) or not discussed in much detail (e.g., the issues in hidden Web crawling are stated but none of the solutions are reviewed, and distributed crawling is not explained well). Moreover, a high variation is observed in the level of provided details. For example, while the discussion about politeness remains at the level of "angry site owners", the freshness discussion involves integrals over Poisson distributions. Sections 3.7 (Detecting Duplicates) and 3.8 (Removing Noise) are fun to read, but the chapter remains to be the weakest of the book. Chapter 4 (Processing Text) is the strongest chapter in the book (together with Chapter 5, but excluding Sections 5.6 and 5.7). The three main topics that the chapter discusses are text statistics, document parsing, and link analysis. Several interesting text processing problems are investigated with a statistical perspective. These sample problems are well selected and could be very motivating for the students. Document parsing covers standard issues such as tokenization, stopword elimination, stemming, and phrase identification. The section on link analysis is mainly dedicated to PageRank, and to a short extent, discusses link spam. Other topics included in this chapter are text processing in nonEnglish languages and information extraction.

The core of the book is Chapter 5 (Ranking with Indexes), where index creation, posting list compression, and query processing components are explained. The chapter starts with a review of alternative inverted index structures, that is, different storage and ordering possibilities for postings. There is also a short section describing auxiliary data structures used in ranking. Unfortunately, this section omits some important auxiliary structures such as the document feature array. The discussion on list compression, although misses some important literature, is well written and provides a very good overview of the basics. The subsection about parallelism and data distribution (also MapReduce), however, is quite distracting, as it does not conform well to the flow of the text. Perhaps, it would have been better to place all performance-related issues in a separate chapter rather than raising them irregularly in the form of short discussions. The final section is on query processing, including various query evaluation techniques, optimization strategies, distributed evaluation, and caching. Given the vast amount of literature, the discussion provided on this important topic appears to be a bit limited (only 17 pages, compared to 6 pages solely on spell checking in Section 6.2.2). The book continues in Chapter 6 (Queries and Interfaces) with the user-facing side of search engines. The two main headings of this chapter are query transformation and result generation. The first heading includes issues such as spell checking, query suggestion, query expansion, relevance feedback, and personalization. This part is fairly well written and provides an abundance of examples that make understanding easier. The second heading includes creation of snippets, sponsored search, and result clustering. It is a bit questionable why sponsored search is discussed under this heading, but otherwise, the provided content is satisfactory. Also, issues such as diversity and result uniquing (e.g., not allowing more than a certain number of results from the same host) are, somehow, completely omitted. In this chapter, it would perhaps be good to display a full search engine result page, together with technical terms used to describe the entities on this page (e.g., search shortcuts). Chapter 7 (Retrieval Models) is the longest chapter (64 pages). The chapter starts with the basic information retrieval models such as the Boolean and vector space models and then goes into others such as probabilistic models and language models. A brief discussion of machine-learned ranking is also included. This chapter requires some background on probability and may be difficult for an undergraduate student to follow, compared to the other chapters of the book. Another well-written chapter is Chapter 8 (Evaluating Search Engines), in which various metrics that are employed in search engine evaluation are discussed. Although being slightly repetitive due to textual overlap with others, this chapter is very informative. The metrics are discussed under two headings as efficiency and effectiveness metrics (a quite natural and standard separation). The chapter covers most of the effectiveness metrics, whereas efficiency metrics are somewhat incomplete, especially the ones regarding performance in content acquisition. Chapter 9 (Classification and Clustering), as the name suggests, discusses a slightly different topic. The chapter first provides an overview of the data classification problem with a few illustrative algorithms (e.g., naive Bayes and support vector machines). Also,

some classification applications are discussed in the context of search engines (namely, spam, sentiment, and online advertising). Then, the clustering problem is discussed and several example algorithms are given (e.g., hierarchical, k-means, and k nearest neighbor clustering). Overall, the chapter covers the topic in sufficient detail and is very well diagrammed. Social search is defined in Chapter 10 (Social Search) as searching in environments where a community of users actively participates in the search process. Some topics covered by the chapter are searching and browsing manually tagged items, communitybased question answering, collaborative search, filtering, and recommendation. Although they do not seem to fit well, peer-to-peer search, metasearch, and distributed search are also included in this chapter. Chapter 11 (Beyond Bag of Words) is about a mixture of topics, some of which are mentioned in the previous chapters. This chapter, although being slightly repetitive, covers many topics, such as XML retrieval, entity search, question answering as well as image, video, and music search. Each chapter in the book ends with a section that contains a summary of the chapter and pointers for further reading. In general, these sections provide references to key papers that started the line of research on particular topics discussed in the current chapter. Overall, the authors seem to have done a good job in selecting these references. Each chapter also includes a single "Exercises" section. Some of the exercises in these sections require dealing with the search engine codes (called Galago) that accompany the book. A stylistic concern is that some figures in the book appear to be drawn using different tools. It would have been better to have some homogeneity in line widths, arrowhead styles, and shading of the objects in the figures (also the tables). Finally, it could have been better to place the pieces of text discussing Galago in an appendix rather than scattering them throughout the book. Some final comments can be made, taking existing IR books into account. About two decades after the publication of the two classical IR books by Van Rijsbergen (1979) and Salton & McGill (1983), we now have more than a dozen of high-quality IR books available in bookstores. Unfortunately, most of these books have either a very specific perspective or limited scope (Korfhage, 1997; Grossman & Frieder, 2004; Van Rijsbergen, 2004; Berry & Brown, 2005; Ingwersen & Jrvelin, 2005; Langville & Meyer, 2006; Meadow, Boyce, Kraft, & Barry, 2007; Hersh, 2009). Unlike these books, Search Engines: Information Retrieval in Practice achieves to keep a general perspective and still have extensive coverage. The currency of the book forms yet another advantage (the two other relatively new IR books with similar coverage are by Kowalski & Maybury (2005) and by Manning, Raghavan, & Schtze (2008)). It is not daring to say that the book is a candidate soon to be one of the most widely used and highly respected IR books (e.g., Baeza-Yates & Ribeiro-Neto, 1999; Witten, Moffat, & Bell, 1999; Chakrabarti, 2002). In conclusion, although not complete, the book is quite comprehensive and well written.

As the authors state in the book, there are not enough courses that focus on IR or search engines. Nevertheless, the book is especially suitable to undergraduate students and can be used as the main book of a computer science course (other candidates are Belew, 2001 and Chowdhury, 2003).

References:
Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern Information Retrieval (2nd ed.). Addison-Wesley-Longman. Belew, R. (2001). Finding Out About. Cambridge University Press. Berry, M. W., & Browne, M. (2005). Understanding Search Engines: Mathematical Modeling and Text Retrieval (2nd ed.). SIAM. Chakrabarti, S. (2002). Mining the Web: Analysis of Hypertext and Semi Structured Data. Morgan Kaufmann. Chowdhury, G. G. (2003). Introduction to Modern Information Retrieval (2nd ed.). NealSchuman. Grossman, D. A., & Frieder, O. (2004). Information Retrieval: Algorithms and Heuristics. Springer. Hersh, W. R. (2009). Information Retrieval: A Health and Biomedical Perspective (3rd ed.). Springer. Ingwersen, P., & Jrvelin, K. (2005). The Turn: Integration of Information Seeking and Retrieval in Context. Springer. Korfhage, R. R. (1997). Information Storage and Retrieval. John Wiley & Sons. Kowalski, G., & Maybury, M. T. (2005). Information Storage and Retrieval Systems (2nd ed.). Springer. Langville, A. N., & Meyer, C. D. (2006). Google's PageRank and Beyond: The Science of Search Engine Rankings. Princeton University Press. Manning, C. D., Raghavan, P., & Schtze, H. (2008). Introduction to Information Retrieval. Cambridge University Press. Meadow, C. T., Boyce, B. R., Kraft, D. H., & Barry, C. L. (2007). Text Information Retrieval Systems (3rd ed.). Academic Press. Salton, G., & McGill, M. J. (1983). Introduction to Modern Information Retrieval. McGraw-Hill.

Van Rijsbergen, C. J. (1979). Information Retrieval (2nd edition). ButterworthHeinemann. Van Rijsbergen, C. J. (2004). The Geometry of Information Retrieval. Cambridge University Press. Witten, I. H., Moffat, A., & Bell, T. C. (1999). Managing Gigabytes: Compressing and Indexing Documents and Images (2nd ed.). Academic Press.

Vous aimerez peut-être aussi