Blogging About Information Retrieval: February 2009

Tuesday, February 17, 2009

[week 7] Reading Notes

Relevance Feedback

In the section 9.1, the authors wrote the following: “RF can also be effective in tracking a user’s evolving information need”. I am not so sure of this, because if the user understands better her information need, will probably reformulate her query instead of waiting for the system to recalculate a result set.

Foundations of the Rocchio Algorithm

I don’t understand how the formula (9.2)

is derived from the formula 6.10 of vector similarity.

The Rocchio Algorithm

The book states that Relevance Feedback has shown to be useful for increasing recall in situations where it is important, and one of the reasons is because the technique expands the query. Is this always true? When I expand the query and I do a Boolean match of the query terms with AND, adding more terms can increase precision but not necessarily recall. Now, I know that in this case this is a VSM and not a Boolean one, but it is always true that adding terms to query the recall is increased?

Besides, I don’t understand clearly the IDE DEC-HI concept.

[week 6] Muddiest Points

This week's lecture was about Evaluation of IR Systems. In addition to the weekly book reading, we read two papers:

Karen Sparck Jones, What's the value of TREC: is there a gap to jump or a chasm to bridge? ACM SIGIR Forum, Volume 40 Issue 1 June 2006
Kalervo Järvelin, Jaana Kekäläinen. Cumulated gain-based evaluation of IR techniques ACM Transactions on Information Systems (TOIS) Volume 20 , Issue 4 (October 2002) Pages: 422 – 446

TREC
Through the IIR book chapter and the first paper, I was introduced to TREC (Text REtrieval Conference) :
1. I was I little surprised when reading the paper about TREC with the affirmation of the author: <<...this progress... has not so far enabled the research community it represents to say:'if your retrieval case is like this, do this' as oposed to 'well, with tunning, this sort of thing could serve yo alright'. >> Despite Google's success, is this still the feeling about the IR community?
2. Which are the sources used to generate content in the Biological and Law tracks of TREC?
3. How do they (TREC consortium) decide to stop a track or to create a new one?

Interpolation on P-R graphs
About the Precision-Recall graphs, I wonder if the Interpolation process may produce, in some cases, a false interpretation of the performance of an IR system.

Tuesday, February 10, 2009

[week 6] Reading Notes

After reading chapter 8th of IIR and the paper about the TREC collection, I started to look more deeply about the different domains covered by this project. It was specially interesting finding a TREC Genomics data and also a TREC law.

About the first one, I read some papers about it and I found out that the project had been running from 2003 to 2007, and in this paper written by the leader of the project,
William Hersh, he states that the project was a success but there was not too much advancement over the state of the art of IR. In his words:

<<...As with all TREC activity, the short cycle of experimentation and reporting of results has prevented more detailed investigation of different approaches. However, there emerged some evidence that some resources from the genomics/bioinformatics could contribute to improving retrieval, especially controlled lists of terminology used in query expansion, although their improvement over standard state-of-the-art IR was not substantial. ...>>

Which is the real gain in creating this test datasets? Is there any new algorithm based on experiments using TREC data, for example?

Other question I raised was if there is some research about creating queries automatically using NLP based on the description of the user need .

Tuesday, February 3, 2009

[week 5] Reading Notes

2 questions have raised this week:

1. The binary independence model assumes that terms occur in in documents independently and the authors say that nevertheless the assumption is not right, in practice the models perform satisfactorily in some occasions. Is there any explanation for this result? this practical evidence is just in English language or it also occurs with Chinese and Arab languages?

2. In chapter 12, the authors say that most of the time the Stop and (1 - Stop) probabilities are omitted from the language model. If this situation incurs in not modeling a well-formed language (according to Equation 12.1), why do authors do this?

See you on Thursday in class!

Monday, February 2, 2009

[week 4] Muddiest Points

I always write long posts, yet this week I'll make the long story short:

In slide 47, professor presented the Lucene term weighting and the SMART term weighting.. which is the term weighting for Lemur?
I have been investigating about Gene databases, I wonder whether boolean search or vector space model are used for searching data in these collections. Is there any other popular way to index this data in particular?
In slide 67, weights for terms in a query are "guessed"... which are the most popular ways to guess their importance, i.e., their weighting? Besides, is there any common scale (0-1, 1-10) used for this purpose?

Blogging About Information Retrieval