Blogging About Information Retrieval

Tuesday, April 14, 2009

[week 13] Muddiest Points

This week's topic was Text Classification and Clustering. My two cents:

What is the relationship between the Support Vector Machines (SVM) model and Neural Networks?
I didn't understand clearly the concepts of Total Benefit & Total Cost for the K-means clustering algorithm.

That's it.

Monday, April 6, 2009

Whereas the term gets closer to its end, my posts get shorter... well my two cents:

1. In slide 37, Dr He introduced active feedback, but I missed additional references on this topic, Is there more research on AF mechanisms?

2. Is there any explanation to understand why click-through data produces more improvement than history of queries in active feedback?

That's it, I'll be looking forward to see one of my questions answered during next week class :)

Friday, March 20, 2009

[week 11] Muddiest Points

This week the topic was "Web Search", and these are my two cents:

1. When did the search engines (before Google) started to consider the link structure of the web as an important issue to incorporate in their algorithms? I know HITS algorithm performs links analysis, but when did commercial search engines started to using it seriously? I want to find out whether this was the drop who balanced the scale in favor of Google.

2. In the model of the Web as a core, incoming and outgoing links... how is the 22% percent of disconnected pages calculated? if they are disconnected, how to be sure that is less or more?

3. How a search engine decides where to star crawling? Which are the most common heuristics to make this decision?

Tuesday, March 17, 2009

[week 10] Spring Break

No comments abot reasdings this week. I have been working on my final project, which I am developing with my classmate Parot Ratnapinda. We are using TREC Genomics dataset, Lucene project to index and Carrot 2 to clusterize and visualiza results.

Tuesday, March 3, 2009

[week 9] Reading Notes

This week the lecture is about Information Visualization. This is probably the area which most books get outdated in little time. Anyway, was very interesting to review the different kinds of data types and tasks of the taxonomy to identify visualization that is shown in the secondary reading of this week. A couple of questions:

1. I know a system called SpaceTime 3D, an application which gives enriched visualizations of the results of queries done in google and other Sites. Despite the name, I am not sure if this is a real 3D World data type or a 2D Map, or maybe a combination of both.

2. Does HayStaks ( a plugin for firefox to store and reutilize queries in an organization) corresponds to the challenge category of "Collaborate with others"?

[week 8] Midterm

Lots of muddiest points for the midterm... I will post something when correction is available.

[week 7] Muddiest Points

As the last post, this one is about Relevance Feedback. Firstly, I would like to know more details about the Wilcoxon signed-rank test. I know, I know... I can google it by myself.

Another question is that I guess that relevance feedback seems to be too attached to one specific user needs. If the algorithm updates weights based on user feedback, how can we be sure that if another user makes the same or a similar query has the same information need? Even if they have the same information need, How can we be sure that they will judge the results likely relevant? Even more, how is it possible that a model with two assumptions so weak (users like to provide feedback, we can obtain reliable feedback from users) can be successful?