Tuesday, January 13, 2009

[week 2] Reading Notes

I will start this post with a comment on the section 1.2 of IIR. The authors show the basis for building an inverted index, and at the end of the 3rd paragraph on page 9 they say "this inverted index structure is essentially without rival as the most efficient structure for supporting ad hoc text search". In the first class we saw that ad-hoc text search is just one of the Retrospective needs of information. So, is this index not the best structure for other user needs, such as comprehensive exploration or question-answering? Which are other indexing structures, and how are they best suited for other kinds of user needs?

In the section 2.1.2 Choosing a document unit, the authors explain that sometimes very long documents can be splitted for the purpose of indexing. In fact, now I can recall that when I have wanted to update my Outlook express, I have had to update just one file (.pst extension) which is supposed to have everything (contacts, e-mail messages, etc)... how is this file indexed in memory?

From section 2.2.1 Tokenization, it raises some questions as Do search engines like google index in different ways documents written in different languages? if true, how do they decide the index technique for documents which are written in more than one language with very different characteristics like chinese, arab or english? Do they index the same document with different approaches or index parts of the file as separated documents?

I also want to comment about section 2.2.3 Normalization. I am from Chile and I speak Spanish, and as I have been working on computers since at least 11 years, it's common for me to write some queries without using accents or diacritics, as stated on the first paragraph of page 28. Indeed, I sometimes limit the names of files and folders to 8 characters (as the old DOS times, lol)

No comments:

Post a Comment