Tuesday, January 27, 2009

[week 4] Reading Notes

In the last paragraph of chapter 1, the authors talk about the practical usage of regular expressions for searching. I wanted to try the behavior of web search engines (google, yahoo, msn) in their response to a simple query involving a regular expression: "hello" and then "hel*o". In general, the * means 0 or more characters. The results are presented in the tables below (showing the first two results per each query in each search engine)

Query "hello":


Query "hel*o":
(sorry for posting screenshots, but blogger doesn't have a good table tool in its WYSIWYG editor)

First, I saw that there is some treatment of the * symbol, since there is no result with the exact match hel*o. Now, the question is if it is replaced just as in a regular expression match.

If the regular expression just as we know them had been working in these search engines, I would have expected more results for the search hel*o than for hello, but it was the contrary.

It seems that search engines replace the * for characters different than just one alphanumeric. The results showed that they looked for especial characters, such as - (hyphen) or for more than one concatenated with alphanumeric characters. However they don't seem to replace * per each letter in the alphabet, to then perform successive queries such as helao, helbo, helco, etc

Finally, I note the differences in the amount of returned results between search engines, it's incredible (Yahoo returned more than 1 and a half billion pages to query hello, whereas google 476 million and msn 81 million).

Sunday, January 25, 2009

[week 3] Muddiest Points

My first muddiest point is referred to the data structures used for Access Methods. Are Hash table, Binary Trees and B-Trees the only options? There is no other structure like a linked list or a double linked list commonly used for this purpose?

About the compression methods, I just wonder how the indexers deal with the characters codification, for example Unicode UTF-8 characters, UTF-16 or ISO-8859. Should these codifications change the way IR systems index or compress the data? in which way?

Tuesday, January 20, 2009

[week 3] Reading Notes

This week we read about Index construction and Index compression. About the first topic, I have one doubt:
  • At the end of page 72, talking about dynamic indexing, the authors says "where n is the size of the auxiliary index". May be I just miss some part of the reading, but I don't see how they measure the size of the index: numbers of terms in the dictionary, number of postings, the addition of both, etc. I don't know.
In the second topic:
  • In page 79 is said "in this chapter we define a posting as a docID in a posting list"... does this imply that they didn't assume the same definition for previous chapters?
  • In page 81 Heap's law suggests that "(i) the dictionary size continues to increase with more documents in the collection, rather than a maximum vocabulary size being reached". Is this situation always true even for specific (and sometimes large) domains?
  • In the description in pages 84-85 of a long-string-based dictionary as a replacement of the fixed-width entries, it is clear that a good amount of space is saved. However, I think that the costs of operations for maintaining the index (delete terms from the dictionary, add or delete postings) can be more expensive in the second case, making this solution no so good for dynamic libraries.
That's all this week, folks...

Monday, January 19, 2009

[week 2] Muddiest points

During this week's class, some of the points I have addressed in my week-2 reading notes were discussed. I want to emphasize the discussion about indexing techniques for languages as Chinese and Arab. Despite I am not an English native speaker, many different techniques as tokenization and stemming makes sense in my native language, Spanish, so is not so difficult to figure out how they can improve an IR approach. But in a language as Chinese or Arab things are more difficult. Chinese in some cases doesn't use spaces, so word segmentation does not guarantee a unique tokenization (example taken from class slides):
下雨天留客天留我不留
can be:
  1. 下雨,天留客,天留,我不留!It’s raining. The lord wants the guest to stay. Although the lord wants, I don't
  2. 下雨天,留客天,留我不?留!It's a raining day. It is a day to let the guest stay. Will you let me stay? Stay!

Besides, Arab language makes frequent use of infixes (other example taken from class slides)

The open research question on the slide is What's the most effective stemming strategy in Arabic? At this point I think than more than stemming, the approach based on context (note surrounding words) can be more successful when implementing an IR approach for arab language. For Chinese, stemming doesn't make any sense since is not a morphological language.

Tuesday, January 13, 2009

[week 2] Reading Notes

I will start this post with a comment on the section 1.2 of IIR. The authors show the basis for building an inverted index, and at the end of the 3rd paragraph on page 9 they say "this inverted index structure is essentially without rival as the most efficient structure for supporting ad hoc text search". In the first class we saw that ad-hoc text search is just one of the Retrospective needs of information. So, is this index not the best structure for other user needs, such as comprehensive exploration or question-answering? Which are other indexing structures, and how are they best suited for other kinds of user needs?

In the section 2.1.2 Choosing a document unit, the authors explain that sometimes very long documents can be splitted for the purpose of indexing. In fact, now I can recall that when I have wanted to update my Outlook express, I have had to update just one file (.pst extension) which is supposed to have everything (contacts, e-mail messages, etc)... how is this file indexed in memory?

From section 2.2.1 Tokenization, it raises some questions as Do search engines like google index in different ways documents written in different languages? if true, how do they decide the index technique for documents which are written in more than one language with very different characteristics like chinese, arab or english? Do they index the same document with different approaches or index parts of the file as separated documents?

I also want to comment about section 2.2.3 Normalization. I am from Chile and I speak Spanish, and as I have been working on computers since at least 11 years, it's common for me to write some queries without using accents or diacritics, as stated on the first paragraph of page 28. Indeed, I sometimes limit the names of files and folders to 8 characters (as the old DOS times, lol)

[week 1] Muddiest Points

After reviewing my notes and checking the video of the class (really good!, I didn't know before about Panopto), here are my muddiest points of the week 1:

1. We discussed various definitions during the class such as Information, the DIKW (Data, Information, Knowledge and Wisdom) hierarchy and, of course, the definition of Information Retrieval. The definition selected was one from Nicholas J. Belkin (1980):
Information retrieval is a problem-oriented discipline, concerned with the problem of the effective and efficient transfer of desired information between human generator and human user.
I can infer that what is more known as "user needs" in IR is described here as "desired information", but this definition tends to put more emphasis on the "effective and efficient transfer" of information than on the different kinds of "user needs". I wonder how has changed the focus inside the IR research along the years, I mean, in which parts of the IR field the researchers have been more concentrated before and after the creation of the World Wide Web.

2. We also saw some different approaches to realize how fast is growing the generation of information, comparing the size in TB of Lexis-Nexis and DIALOG with the library of Congress, a graph showing the host count year by year on the Internet and the Storage Information in different media (digital devices, books, etc.) Anyway, I wonder what is the methodology to obtain the 5,187,130 TB of data recorded in Magnetic devices in 2002, as described by Lyman and Varian of UC Berkeley.

3. When looking at the IR problem just from the user perspective, what we see is his need of information, which can be classified as Retrospective and Prospective. Are there other classifications of "user needs" that can influence different paradigms or algorithms of Information Retrieval?

4. Finally, I just would like to point that within the types of document representation mentioned in class (Bibliographic, Full-text and Directory), I feel that this classification is a little strange. My sensation is that "Directory" is just part of the "Bibliographic" representation... one the "metadata" fields of a document represented in the Bibliographically can be the categories from a "Directory category path".