Tuesday, April 14, 2009

[week 13] Muddiest Points

This week's topic was Text Classification and Clustering. My two cents:
  • What is the relationship between the Support Vector Machines (SVM) model and Neural Networks?
  • I didn't understand clearly the concepts of Total Benefit & Total Cost for the K-means clustering algorithm.

That's it.

Monday, April 6, 2009

[week 12] Muddiest Points

Whereas the term gets closer to its end, my posts get shorter... well my two cents:

1. In slide 37, Dr He introduced active feedback, but I missed additional references on this topic, Is there more research on AF mechanisms?

2. Is there any explanation to understand why click-through data produces more improvement than history of queries in active feedback?

That's it, I'll be looking forward to see one of my questions answered during next week class :)

Friday, March 20, 2009

[week 11] Muddiest Points

This week the topic was "Web Search", and these are my two cents:

1. When did the search engines (before Google) started to consider the link structure of the web as an important issue to incorporate in their algorithms? I know HITS algorithm performs links analysis, but when did commercial search engines started to using it seriously? I want to find out whether this was the drop who balanced the scale in favor of Google.

2. In the model of the Web as a core, incoming and outgoing links... how is the 22% percent of disconnected pages calculated? if they are disconnected, how to be sure that is less or more?

3. How a search engine decides where to star crawling? Which are the most common heuristics to make this decision?

Tuesday, March 17, 2009

[week 10] Spring Break

No comments abot reasdings this week. I have been working on my final project, which I am developing with my classmate Parot Ratnapinda. We are using TREC Genomics dataset, Lucene project to index and Carrot 2 to clusterize and visualiza results.

Tuesday, March 3, 2009

[week 9] Reading Notes

This week the lecture is about Information Visualization. This is probably the area which most books get outdated in little time. Anyway, was very interesting to review the different kinds of data types and tasks of the taxonomy to identify visualization that is shown in the secondary reading of this week. A couple of questions:

1. I know a system called SpaceTime 3D, an application which gives enriched visualizations of the results of queries done in google and other Sites. Despite the name, I am not sure if this is a real 3D World data type or a 2D Map, or maybe a combination of both.

2. Does HayStaks ( a plugin for firefox to store and reutilize queries in an organization) corresponds to the challenge category of "Collaborate with others"?

[week 8] Midterm

Lots of muddiest points for the midterm... I will post something when correction is available.

[week 7] Muddiest Points

As the last post, this one is about Relevance Feedback. Firstly, I would like to know more details about the Wilcoxon signed-rank test. I know, I know... I can google it by myself.

Another question is that I guess that relevance feedback seems to be too attached to one specific user needs. If the algorithm updates weights based on user feedback, how can we be sure that if another user makes the same or a similar query has the same information need? Even if they have the same information need, How can we be sure that they will judge the results likely relevant? Even more, how is it possible that a model with two assumptions so weak (users like to provide feedback, we can obtain reliable feedback from users) can be successful?

Tuesday, February 17, 2009

[week 7] Reading Notes

Relevance Feedback

  • In the section 9.1, the authors wrote the following: “RF can also be effective in tracking a user’s evolving information need”. I am not so sure of this, because if the user understands better her information need, will probably reformulate her query instead of waiting for the system to recalculate a result set.

Foundations of the Rocchio Algorithm

  • I don’t understand how the formula (9.2)

is derived from the formula 6.10 of vector similarity.

The Rocchio Algorithm

  • The book states that Relevance Feedback has shown to be useful for increasing recall in situations where it is important, and one of the reasons is because the technique expands the query. Is this always true? When I expand the query and I do a Boolean match of the query terms with AND, adding more terms can increase precision but not necessarily recall. Now, I know that in this case this is a VSM and not a Boolean one, but it is always true that adding terms to query the recall is increased?
  • Besides, I don’t understand clearly the IDE DEC-HI concept.

[week 6] Muddiest Points

This week's lecture was about Evaluation of IR Systems. In addition to the weekly book reading, we read two papers:
  • Karen Sparck Jones, What's the value of TREC: is there a gap to jump or a chasm to bridge? ACM SIGIR Forum, Volume 40 Issue 1 June 2006
  • Kalervo Järvelin, Jaana Kekäläinen. Cumulated gain-based evaluation of IR techniques ACM Transactions on Information Systems (TOIS) Volume 20 , Issue 4 (October 2002) Pages: 422 – 446
TREC
Through the IIR book chapter and the first paper, I was introduced to TREC (Text REtrieval Conference) :
1. I was I little surprised when reading the paper about TREC with the affirmation of the author: <<...this progress... has not so far enabled the research community it represents to say:'if your retrieval case is like this, do this' as oposed to 'well, with tunning, this sort of thing could serve yo alright'. >> Despite Google's success, is this still the feeling about the IR community?
2. Which are the sources used to generate content in the Biological and Law tracks of TREC?
3. How do they (TREC consortium) decide to stop a track or to create a new one?

Interpolation on P-R graphs
About the Precision-Recall graphs, I wonder if the Interpolation process may produce, in some cases, a false interpretation of the performance of an IR system.

Tuesday, February 10, 2009

[week 6] Reading Notes

After reading chapter 8th of IIR and the paper about the TREC collection, I started to look more deeply about the different domains covered by this project. It was specially interesting finding a TREC Genomics data and also a TREC law.

About the first one, I read some papers about it and I found out that the project had been running from 2003 to 2007, and in this paper written by the leader of the project,
William Hersh, he states that the project was a success but there was not too much advancement over the state of the art of IR. In his words:

<<...As with all TREC activity, the short cycle of experimentation and reporting of results has prevented more detailed investigation of different approaches. However, there emerged some evidence that some resources from the genomics/bioinformatics could contribute to improving retrieval, especially controlled lists of terminology used in query expansion, although their improvement over standard state-of-the-art IR was not substantial. ...>>

Which is the real gain in creating this test datasets? Is there any new algorithm based on experiments using TREC data, for example?

Other question I raised was if there is some research about creating queries automatically using NLP based on the description of the user need .

Tuesday, February 3, 2009

[week 5] Reading Notes

2 questions have raised this week:

1. The binary independence model assumes that terms occur in in documents independently and the authors say that nevertheless the assumption is not right, in practice the models perform satisfactorily in some occasions. Is there any explanation for this result? this practical evidence is just in English language or it also occurs with Chinese and Arab languages?

2. In chapter 12, the authors say that most of the time the Stop and (1 - Stop) probabilities are omitted from the language model. If this situation incurs in not modeling a well-formed language (according to Equation 12.1), why do authors do this?

See you on Thursday in class!

Monday, February 2, 2009

[week 4] Muddiest Points

I always write long posts, yet this week I'll make the long story short:
  • In slide 47, professor presented the Lucene term weighting and the SMART term weighting.. which is the term weighting for Lemur?
  • I have been investigating about Gene databases, I wonder whether boolean search or vector space model are used for searching data in these collections. Is there any other popular way to index this data in particular?
  • In slide 67, weights for terms in a query are "guessed"... which are the most popular ways to guess their importance, i.e., their weighting? Besides, is there any common scale (0-1, 1-10) used for this purpose?

Tuesday, January 27, 2009

[week 4] Reading Notes

In the last paragraph of chapter 1, the authors talk about the practical usage of regular expressions for searching. I wanted to try the behavior of web search engines (google, yahoo, msn) in their response to a simple query involving a regular expression: "hello" and then "hel*o". In general, the * means 0 or more characters. The results are presented in the tables below (showing the first two results per each query in each search engine)

Query "hello":


Query "hel*o":
(sorry for posting screenshots, but blogger doesn't have a good table tool in its WYSIWYG editor)

First, I saw that there is some treatment of the * symbol, since there is no result with the exact match hel*o. Now, the question is if it is replaced just as in a regular expression match.

If the regular expression just as we know them had been working in these search engines, I would have expected more results for the search hel*o than for hello, but it was the contrary.

It seems that search engines replace the * for characters different than just one alphanumeric. The results showed that they looked for especial characters, such as - (hyphen) or for more than one concatenated with alphanumeric characters. However they don't seem to replace * per each letter in the alphabet, to then perform successive queries such as helao, helbo, helco, etc

Finally, I note the differences in the amount of returned results between search engines, it's incredible (Yahoo returned more than 1 and a half billion pages to query hello, whereas google 476 million and msn 81 million).

Sunday, January 25, 2009

[week 3] Muddiest Points

My first muddiest point is referred to the data structures used for Access Methods. Are Hash table, Binary Trees and B-Trees the only options? There is no other structure like a linked list or a double linked list commonly used for this purpose?

About the compression methods, I just wonder how the indexers deal with the characters codification, for example Unicode UTF-8 characters, UTF-16 or ISO-8859. Should these codifications change the way IR systems index or compress the data? in which way?

Tuesday, January 20, 2009

[week 3] Reading Notes

This week we read about Index construction and Index compression. About the first topic, I have one doubt:
  • At the end of page 72, talking about dynamic indexing, the authors says "where n is the size of the auxiliary index". May be I just miss some part of the reading, but I don't see how they measure the size of the index: numbers of terms in the dictionary, number of postings, the addition of both, etc. I don't know.
In the second topic:
  • In page 79 is said "in this chapter we define a posting as a docID in a posting list"... does this imply that they didn't assume the same definition for previous chapters?
  • In page 81 Heap's law suggests that "(i) the dictionary size continues to increase with more documents in the collection, rather than a maximum vocabulary size being reached". Is this situation always true even for specific (and sometimes large) domains?
  • In the description in pages 84-85 of a long-string-based dictionary as a replacement of the fixed-width entries, it is clear that a good amount of space is saved. However, I think that the costs of operations for maintaining the index (delete terms from the dictionary, add or delete postings) can be more expensive in the second case, making this solution no so good for dynamic libraries.
That's all this week, folks...

Monday, January 19, 2009

[week 2] Muddiest points

During this week's class, some of the points I have addressed in my week-2 reading notes were discussed. I want to emphasize the discussion about indexing techniques for languages as Chinese and Arab. Despite I am not an English native speaker, many different techniques as tokenization and stemming makes sense in my native language, Spanish, so is not so difficult to figure out how they can improve an IR approach. But in a language as Chinese or Arab things are more difficult. Chinese in some cases doesn't use spaces, so word segmentation does not guarantee a unique tokenization (example taken from class slides):
下雨天留客天留我不留
can be:
  1. 下雨,天留客,天留,我不留!It’s raining. The lord wants the guest to stay. Although the lord wants, I don't
  2. 下雨天,留客天,留我不?留!It's a raining day. It is a day to let the guest stay. Will you let me stay? Stay!

Besides, Arab language makes frequent use of infixes (other example taken from class slides)

The open research question on the slide is What's the most effective stemming strategy in Arabic? At this point I think than more than stemming, the approach based on context (note surrounding words) can be more successful when implementing an IR approach for arab language. For Chinese, stemming doesn't make any sense since is not a morphological language.

Tuesday, January 13, 2009

[week 2] Reading Notes

I will start this post with a comment on the section 1.2 of IIR. The authors show the basis for building an inverted index, and at the end of the 3rd paragraph on page 9 they say "this inverted index structure is essentially without rival as the most efficient structure for supporting ad hoc text search". In the first class we saw that ad-hoc text search is just one of the Retrospective needs of information. So, is this index not the best structure for other user needs, such as comprehensive exploration or question-answering? Which are other indexing structures, and how are they best suited for other kinds of user needs?

In the section 2.1.2 Choosing a document unit, the authors explain that sometimes very long documents can be splitted for the purpose of indexing. In fact, now I can recall that when I have wanted to update my Outlook express, I have had to update just one file (.pst extension) which is supposed to have everything (contacts, e-mail messages, etc)... how is this file indexed in memory?

From section 2.2.1 Tokenization, it raises some questions as Do search engines like google index in different ways documents written in different languages? if true, how do they decide the index technique for documents which are written in more than one language with very different characteristics like chinese, arab or english? Do they index the same document with different approaches or index parts of the file as separated documents?

I also want to comment about section 2.2.3 Normalization. I am from Chile and I speak Spanish, and as I have been working on computers since at least 11 years, it's common for me to write some queries without using accents or diacritics, as stated on the first paragraph of page 28. Indeed, I sometimes limit the names of files and folders to 8 characters (as the old DOS times, lol)

[week 1] Muddiest Points

After reviewing my notes and checking the video of the class (really good!, I didn't know before about Panopto), here are my muddiest points of the week 1:

1. We discussed various definitions during the class such as Information, the DIKW (Data, Information, Knowledge and Wisdom) hierarchy and, of course, the definition of Information Retrieval. The definition selected was one from Nicholas J. Belkin (1980):
Information retrieval is a problem-oriented discipline, concerned with the problem of the effective and efficient transfer of desired information between human generator and human user.
I can infer that what is more known as "user needs" in IR is described here as "desired information", but this definition tends to put more emphasis on the "effective and efficient transfer" of information than on the different kinds of "user needs". I wonder how has changed the focus inside the IR research along the years, I mean, in which parts of the IR field the researchers have been more concentrated before and after the creation of the World Wide Web.

2. We also saw some different approaches to realize how fast is growing the generation of information, comparing the size in TB of Lexis-Nexis and DIALOG with the library of Congress, a graph showing the host count year by year on the Internet and the Storage Information in different media (digital devices, books, etc.) Anyway, I wonder what is the methodology to obtain the 5,187,130 TB of data recorded in Magnetic devices in 2002, as described by Lyman and Varian of UC Berkeley.

3. When looking at the IR problem just from the user perspective, what we see is his need of information, which can be classified as Retrospective and Prospective. Are there other classifications of "user needs" that can influence different paradigms or algorithms of Information Retrieval?

4. Finally, I just would like to point that within the types of document representation mentioned in class (Bibliographic, Full-text and Directory), I feel that this classification is a little strange. My sensation is that "Directory" is just part of the "Bibliographic" representation... one the "metadata" fields of a document represented in the Bibliographically can be the categories from a "Directory category path".