Tuesday, January 27, 2009

[week 4] Reading Notes

In the last paragraph of chapter 1, the authors talk about the practical usage of regular expressions for searching. I wanted to try the behavior of web search engines (google, yahoo, msn) in their response to a simple query involving a regular expression: "hello" and then "hel*o". In general, the * means 0 or more characters. The results are presented in the tables below (showing the first two results per each query in each search engine)

Query "hello":


Query "hel*o":
(sorry for posting screenshots, but blogger doesn't have a good table tool in its WYSIWYG editor)

First, I saw that there is some treatment of the * symbol, since there is no result with the exact match hel*o. Now, the question is if it is replaced just as in a regular expression match.

If the regular expression just as we know them had been working in these search engines, I would have expected more results for the search hel*o than for hello, but it was the contrary.

It seems that search engines replace the * for characters different than just one alphanumeric. The results showed that they looked for especial characters, such as - (hyphen) or for more than one concatenated with alphanumeric characters. However they don't seem to replace * per each letter in the alphabet, to then perform successive queries such as helao, helbo, helco, etc

Finally, I note the differences in the amount of returned results between search engines, it's incredible (Yahoo returned more than 1 and a half billion pages to query hello, whereas google 476 million and msn 81 million).

No comments:

Post a Comment