Monday, January 19, 2009

[week 2] Muddiest points

During this week's class, some of the points I have addressed in my week-2 reading notes were discussed. I want to emphasize the discussion about indexing techniques for languages as Chinese and Arab. Despite I am not an English native speaker, many different techniques as tokenization and stemming makes sense in my native language, Spanish, so is not so difficult to figure out how they can improve an IR approach. But in a language as Chinese or Arab things are more difficult. Chinese in some cases doesn't use spaces, so word segmentation does not guarantee a unique tokenization (example taken from class slides):
下雨天留客天留我不留
can be:
  1. 下雨,天留客,天留,我不留!It’s raining. The lord wants the guest to stay. Although the lord wants, I don't
  2. 下雨天,留客天,留我不?留!It's a raining day. It is a day to let the guest stay. Will you let me stay? Stay!

Besides, Arab language makes frequent use of infixes (other example taken from class slides)

The open research question on the slide is What's the most effective stemming strategy in Arabic? At this point I think than more than stemming, the approach based on context (note surrounding words) can be more successful when implementing an IR approach for arab language. For Chinese, stemming doesn't make any sense since is not a morphological language.

No comments:

Post a Comment