Thursday, January 28, 2010

Does search engine do stemming?

Hi all: In response to the question raised by Dr. Kambhampati today in class, I did a quick search comparing 4 search engines and here is what I got:

I tried a very short query: (1) boiled beancurd vs. (2) boil beancurd. And my observation based on this query is the following.

Google: YES
Cuil: NO
Bing: NO
Baidu: NO

I'm providing the first 3 results returned by the search engine here just for comparison. I'd like to hear what you guys have found.

-Xiaolong


0: GOOGLE:

a) "boiled"
Vegetables and Vegetarian Dishes - Water-boiled Beancurd
2) Put box containing beancurd on fire, bring to a slight boil and remove ... (2) Before being put in stock, the pre-boiled beancurd should be drained lest ...
Boiled bean curd in hot sauce - Food & Wine - What's On Xiamen
Boiled bean curd in hot sauce Price: RMB 47. Menu Detail. Title:Boiled bean curd in hot sauce; Price:RMB 47; Date and Time: Lunch & Dinner ...
Tofu - Wikipedia, the free encyclopedia
Tofu (豆腐, tofu), or bean curd is a soft white food made by coagulating soy .... stirred intoboiled soy milk until the mixture curdles into a soft gel. ...
b) "boil"

Yum Recipes : page 1 on Recipe that use bean curd
Bean Curd with Pineapple Cut drained bean curd into small cubes. Drain syrup from the pineapple, retaining 1 tb of it. Dissol... Buddhist Monk's Soup Boil ...
Vegetables and Vegetarian Dishes - Water-boiled Beancurd
2) Put box containing beancurd on fire, bring to a slight boil and remove ... (2) Before being put in stock, the pre-boiled beancurd should be drained lest ...
Korean Bean Curd Soup Recipes with Videos | ifood.tv

1: CUIL

a) "boiled"

Masala Bites - Corn and Beancurd Porridge
Rinse beancurd and wipe dry. Heat 1/2 tbsp oil and saute ginger slice and diced kale. Put in beancurdcubes and bring to boil again. Stir well. Bring to boil. Sesame Oil - 1 pinch. Cornflou - 2 tbsp.
Food Reviewer - dawn - Review Comments
1) Crispy Fried Homemade Beancurd: This was an oldie but goodie, and a dependable, solid Singaporean dish. I believe you can get a reasonably-sized dish (at least 4 large pieces of beancurd) for only S$8 if you go down there. 2) Double Boiled In fact, I am a rather abnormal one. Nevertheless, I could appreciate this.
O N O F R E E. C O N S O L A C I...
Description: Presented here is a simplified version that is fast and easy to make, this is a highly flexible dish since most ingredients are just boiled. You may replace or add your favourite ingredients to the suggested ones below.

b). "boil"

Famous Authentic Chinese Recipes - Ma Po Beancurd
Trim the old skin off the beancurd, cut it into 1cm oblique dices, boil them in boiling water for a while to remove the taste of gypsum, scoop them out and drain. Enjoy cooking Chinese food with these free Chinese beancurd recipes! Famous Authentic Chinese recipe for Mapo Beancurd.

Boiled Shredded Dry Bean Curd - News Center - Tourochina.com
It is also called dried soybean curd threads with chicken soup. First, the bean curd is sliced as thin as hair, and then it is accompanied with sliced chicken and bamboo shoots.

DUNLOP March Cookbooks of Month: Vegetables and Bean Curd - Home Cooking - Chowhound
YOU MUST REMEMBER TO RINSE THE TIANJIN (AND SQUEEZE IT DRY) BEFORE YOU USE IT otherwise your dish will be way way too salty. Dunlop mentions this on page 27 of RC, but I haven't found a similar warning in LOP. We made a number of substitutions in the recipe so I could count it as "core" for Weight Watchers.
2) BAIDU:

a) "boiled"

Deep Fried Beancurd with Mushrooms
Cut the beancurd into slices 1 cm thick, scald in boiled water2)?? Drain water and dish up beancurd, mix well with 1 tbsp of light soy3)?? Soak..

Kelly 愿每一位有缘的朋友都能够圆梦 - Kelly - 网易博客
水饺Boiled dumplings 蒸饺Steamed dumplings 馒头Steamed buns 割包Steamed ... bean curd in casserole | beancurd stewed in earthen pot 砂锅豆腐 bean ...

bob's blog | binghoo.com
1) Cut the beancurd into slices 1 cm thick, scald in boiled water 2) Drain water and dish up beancurd, mix well with 1 tbsp of light soy 3) ...

b) "boil":
山东中粮粉丝杂豆进出口有限公司
(3) Fry broccoli and eggplant a while, remove; heat wok with oil, put in preserved beancurd and add water to boil, put in vermicelli, seasoning and...

永康外籍女佣联合服务网 - 专业申请外籍女佣、家庭看护工、家庭帮..
1.Cut beancurd into dices. 2.Boil white wine until reduced by half,pour chicken stock in to bring to boil beancurd for 3 minutes,starch and season...

英语婴幼儿食谱的儿童食谱
Add seasonings and beancurd and bring to boil.Add shrimps and simmer until done.Add thickenings and egg. Stir well and serve.Explanation: This dish is ...

3) Bing:

a) "boiled"

KIS Early Year Program
Menu chicken soup hard-boiled bean curd japchae with bean sprout boiled dry cuttlefish radish cube kimchi
Recipes - Pork Stew With Bean Curd And Hard Boiled Eggs Recipe
Meat - Pork Stew With Bean Curd And Hard Boiled Eggs Recipe

Chinese Vegetarian Recipes - Bean Curd Rolls Recipe
A quick and easy dish where bean curd sheets are boiled in seasoned water and fried with roasted seaweed.


b)"boil"

Vegetables and Vegetarian Dishes - Water-boiled Beancurd
b) "boil":
Bring it to a rolling boil, transfer beancurd into pan. When the stock boils again, thicken it with cornflour (diluted with water) and pour everything into a big soup bowl.

Tofu with Minced Pork | Delicious Asian Food
Once the water boils, place beancurd gently on top of the sauce / meat and cover the lid (in a way, you are steaming the beancurd whilst cooking the rest of the ingredients / gravy ...

Chinese Vegetarian Recipes - Bean Curd Rolls Recipe
Bring to a boil. Lower heat. Fold the bean curd sheets into 4 squares. Boil the folded sheets in the heated water for 1 minute, using chopsticks to move the sheets around the ...



Wednesday, January 27, 2010

TA office hours (as well as project part 1 preview)

The TA, Sushovan De, will hold office hour every Wed 2-3pm, starting today. He sits right across from my office in the cubicle 557AD.

The project part 1 is ready for preview from the home page. We will formally kick it off next week, but you can look at the preview and see
if you need to brush up on your java skills etc.

Rao



Tuesday, January 26, 2010

Re: does google focus on generating diverse results...?

First of all, my characterization of relevance function in the class is a normative one--I am talking about how we should be doing ranking--not about how any specific IR program--let alone Google--does it.

As for Google, there is no claim/proof/statement on Google's part that they actually take result diversity into account. About the only research paper on Google is from 1998. Most of what Google does currently is not documented anywhere publicly .

That said, it does seem *empirically* like the Google ranking is taking result diversity into account some how. The reason you see two links to "Recent papers from yochan" in the results is easy to explain.
The first one is presented once in the sub-menu--indented below the first result (or as what google calls "site links"), and once again right below. This is because I believe site-links are generated in a completely orthogonal process from the result ranking. Once a particular page is ranked at the top, if it has site links, they are just output.
Here is a link that explains how "site links" work.
http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=47334

rao
(who hopes this won't become a class about search engine optimization ;-)

ps: Here is a link to a research paper that talks about the issues involved in generating diverse results efficiently. We will probably read it/cover it at some point of time: http://www.wsdm2009.org/papers/p5-agrawal.pdf


On Tue, Jan 26, 2010 at 5:27 PM,  <dthiruma@asu.edu> wrote:
Hello Professor,

I have a question. As [you] discussed in the class today, what I had understood  in the concept of relevancy R(d | Q, U, { d1, d2 , d3} ) is, if you have shown a relevant document d1 which is close to d now then it(d) need not be displayed next(instead we can show some other document). It can be achieved by displaying the documents which maximally distinct or clustering

My question is(or rather I was wondering):

When I searched for Kambhampati in google today,

It displays "Recent search from yochan" twice in the first two results(pointing to the same link). Does that mean google does not check for closeness with the already displayed documents? I am confused.

Thanks,
Dananjayan Thirumalai

(blog comments welcome) Another way to view the approximation of R(d|Q,U,{d1...dk}) in the traditional IR

[Sorry, the message accidentally got sent before it was completed... here is the full version]

Comments on the following are welcome

One possibly frustrating part of today's class is the impedence mismatch between the fact that, the class is focused on how traditional IR is done, and the fact that while you may not know much about traditional IR, you do know about the ways in which it has been extended/used in search engines.

Here is a slightly alternative view to interpret how we get the traditional IR model from the "true" relevance model.

We agreed that Relevance R is a function of D, Q, U and the already shown docs {d1...dk}

The next question is how do we represent (or ignore representing) the various components:

document D:
 --Can be represented in terms of "meaning" (too hard ;-)
 --Can be represented in terms of just the words it has
     --Can be seen as a "set" of key words; "bag" of keywords; or "shingles" of keywords
    

User U
  -- can be represented in terms of the "properties" of the user
     --interests of the user
     --features (e.g. age, salary, domicile etc) of the user
 -- can be "ignored"


Query
  -- can be represented in terms of the full context of the query
  -- can be represented in terms of the keywords (i.e., a mini-document)
       and "context" properties (e.g. location of the query, current context of the query)

Traditional IR, as we shall see, looks at Keyword representation for the D and Q, largely ignore U.

Given this representation, it makes perfect sense to think of relevance as the "similarity" between D and Q
(and to assess residual relevance in terms of dissimilarity between D and {d1...dk}).

Rao

ps: Regarding "accuracy"/"trustworthiness" aspect, note that traditional IR assumes that the query is being run on a curated corpus where all; so this doesn't need to be modeled. Also, while there is need to model it on the web, my own preference is to separate this from relevance. This is because trust/accuracy is a badge that the user has no way of verifying, while relevance is something they can judge locally.

Monday, January 25, 2010

An off-beat reading for tomorrow

You might check out the following short one-act play by Woody Allen--mostly because I keep thinking about it whenever I start teaching
Information Retrieval (you will get an idea as to why once you read it..)

http://rakaposhi.eas.asu.edu/lincoln-query-woody-allen.pdf


Rao

Thursday, January 21, 2010

Average length of queries - increasing or decreasing?

http://www.submitexpress.com/news/shownews.php?article=1183

Average Search Query Length Increasing



According to Internet monitoring company Hitwise, the average length of search queries increased between January 2008 and January 2009. Search queries of five or more words saw a 10% year-over-year increase, versus the two percent YoY decline for search queries of only one to four words. Here's Hitwise's official data on the percentage of U.S. clicks by number of keywords:

* One-word queries fell from 20.96% in January 2008 to 20.29% in January 2009.
* Two-word queries fell from 24.91% in January 2008 to 23.65% in January 2009.
* Three-word queries fell from 22.03% in January 2008 to 21.92% in January 2009.
* Four-word queries increased from 14.54% in January 2008 to 14.89% in January 2009.
* Five-word queries increased from 8.20% in January 2008 to 8.68% in January 2009.
* Six-word queries increased from 4.32% in January 2008 to 4.65% in January 2009.
* Seven-word queries increased from 2.23% in January 2008 to 2.49% in January 2009.
* Queries of eight or more words saw the biggest jump, from 2.81% in January 2008 to 3.43% in January 2009.

Hitwise did not speculate about causes of longer search queries, but some industry experts have posited that advancing search capabilities among Internet users have proportionately increased the sophistication and length of their queries. The growing amount of content on the Internet may have also necessitated the use of longer queries. Whatever the impetus, expanded queries surely signal new opportunities for advertisers and webmasters who are seeking to capitalize on the growing Internet population, which currently tops 1.5 billion according to the latest statistics from Internet World Stats.

-Amanda Richter

Tuesday, January 19, 2010

[Thinking Cap Question]: Think of a world that never was and ask why not

[[[From time to time, I will send "thinking cap" questions on the class blog. The idea is that you
respond to the question with your thoughts on the blog (posted as a comment to the question). 
This will count towards "participation" credit, but also allows you to share your class-related ideas with 
other folks in the list. 

As for how often you should feel compelled to respond vs. how deep your thoughts should be, I would use Woody 
Allen's philosophy on quantity vs. quality, expounded in the context of a
  slightly different situation (start at 3:30)]]]
 
Here is the first thinking cap. 
Post your answers to this "homework 1" question on the blog, so we can perhaps aggregate/discuss:


=========

Think of and list 3 queries (or activities) that you would like to do on the
Web that the current day search engines (e.g. Google) don't quite
support. 

A quote to get you inspired:

"Some people see things as they are and say why? I dream things that
never were and say why not?" 
-(Mis)attributed to Robert Kennedy
who paraphrased  Bernard Shaw
==========


post away
Rao


ps: David Bendit is the first to accept the invitation to the class blog. That kind of enthusiasm is usually grounds for either an A+ in the course or a piece of yummy candy (straight from the candy mountain).

[CSE494] Welcome to the class mailing list...

Dear all:

 If you are getting this email, then you know that

1. Class email list is working (Yeah!)
2. All the class emails are also archived at
     2.1 ("read-only") mail archive http://rakaposhi.eas.asu.edu/s10-cse494-mailarchive/threads.html
     2.2  ("read and write") class blog http://cse494-s10.blogspot.com/
3. You will be getting another mail "inviting" you to the class blog. Once you accept that, you can take part in the
     blog discussions (i.e., post)


All of these are accessible from the class page http://rakaposhi.eas.asu.edu/cse494

Also, please note that the class lecture notes as well as audio are available from the "lecture notes" part of the class web page.

regards
Rao