Blog for ASU CSE 494/598 IR/II/IM: [Thinking cap] on anchor text/page importance etc. (with a carrot about homework 2 deadline extension)

Tuesday, February 23, 2010

[Thinking cap] on anchor text/page importance etc. (with a carrot about homework 2 deadline extension)

Some of you have asked for extension on the homework 2 due date. Since the requester set includes non-zero number of "rain-or-shine" brigade (i.e., those who actually show up to the class regularly), I am willing to extend the deadline to next Tuesday (2nd March).

Here however is the catch. Since NAACP says a mind is a terrible thing to waste, I want you to keep yours busy by thinking about (and *commenting on* the following) thinking-cap questions:

1. We said that the relative importance of the anchor text characterizations of a page P depends on how many other pages are pointing to P with that characterization. How should the "number of pages" be taken into account? Should the "type of pages" somehow matter? (i.e. does it as to *which* page is saying those things about page P?) If so, how do you propose it should be taken into consideration?

2. We had a slide about the page importance desiderata. Comment on
2.1. To what extent are each of those desiderata actually subsumed by link-based analysis?
2.2. In the old days, we used to put links to various pages because that was the easiest way to get back to them when you need to. Now, with search engines getting more and more effective, there is not as much of a reason to put links to each other. How does this affect the utility of link-based analysis techniques for finding page importance?
2.3. Can you give some examples of how current day search engines actually handle notions of importance that are not strictly subsumed by link-based analysis?

3. [The "I finally had an orgasm, but my doctor said it was the wrong kind" question]: At the start of IR discussion, we said what we are trying to compute is the "relevance" of a document d, given a user U and query Q. We then decided to approximate the relevance by a similarity computation between the document and the query (and spent the intervening weeks getting deeper and deeper into how best to compute this similarity). Now that we decided to throw in the notion of page importance, do you think this should be seen still as a part of relevance (just a more accurate computation of relevance..) or is it some other orthogonal dimension? (extra credit: Why does the orgasm quote related to this question?).

4. [The "The woods will be silent indeed, if no birds sang except those that sing the best" flame war]: Suppose you search google to find the exact quote and source of the bird quote by Henry Vandyke you get frustratingly many minor variations of the quote (including one by yours truly, which attributes it to Henry David Thoreau! ). It looks as if letting the unwashed masses put up web pages is leading to all sorts of inaccurate information. Don't you think it would be simpler to go back to the peace-and-quiet of the age of poll-taxes and control web page creation? (I know this is beginning to look suspiciously like a SAT essay prompt.... you can focus also on whether life will be more or less interesting for CS folks if the society were to go to this model.)

cheers
Rao

19 comments:

Siddhartha JonnalagaddaFebruary 23, 2010 at 4:59 PM
1. wight based on the rank of the pages from where anchors are coming
ReplyDelete
Replies
shamanthFebruary 23, 2010 at 7:10 PM
1.The ralative importance of a page should depend on the number of pages of a certain type point to it. Only considering the number of pages linking to a page would not be very useful on the web because it is easy to collect a large number of links from unrelated pages.

2.1 Link analysis subsumes the idenitfication of important pages that link to a page. Furthermore, the number of accesses a page gets can also be estimated to a certain extent because hubs and authorities with high scores can be considered to have high number of accesses than pages withnlower scores.
2.2 This affects the link based analysis significantly because this makes identification of authorities hard. As most pages can be considered to have lower scores and thus link to either hubs or authorities with higher scores, without this information it would be hard to identify the real hubs and authorities in a network.
2.3 Search engines can take advantage of the query logs generated from user searches. This can be used to identify pages relevant to a webpage by observing which queries were used to reach that page.

3. Page relevance is attempting to bring the user into the picture by taking into account pages created by other users as a measure of importance. This is an orthogonal way of measuring the similarity because in this approach we use the information provided by other users about a webpage as the measure of its relevance. This is a way to find the relevant documents for a user but there is no guarentee that what others find useful should be useful to a user in question hence the relation to the quote "I had an orgasm, but the doctor said it was the wrong kind".

4. If the web were to be reverted to the old days, then the information on the web would be severely limited. The freedom that every person enjoys in accessing/uploading the information on the web is a unique characterstic of all media associated with the web, Without this the meaning of the web would be lost.
ReplyDelete
Replies
Dustin FranklinFebruary 23, 2010 at 10:08 PM
1. Depending solely on the number of pages that link to the page easily leads to such activities as google bombing. It is simple to create a mock website with millions of pages that all use anchor text to manipulate a characterization of a web page. One way to account for the number of pages could be through a log of the number of links, with a hard cutoff. Another way could be through giving each page on the internet an amount of "points" (which could be increased based on the score in the hub/authority model, or pagerank) which then can be assigned to each of the links on the page. If the page has few links, then the links it does have would be more valuable. This would cut down on some link spam sites, but it is also vulnerable to making pages with single links. However, when used with the hub/authority model, then low link, high value web sites would have the most influence on the characterization.

2.1
Link based analysis is very sensitive to the link structure of the web of course, but it has problems with being sensitive to the access of the page, the query, and the user. As Shamanth said, the access of the page can be estimated with the hub/authority model because of the number and quality of hubs that point to each page, as bigger and better hubs indicate more access to a page. However, this is not a definite metric as links does not necessarily mean visits.

Since a page can become more important related to the number of links pointing to the page, small random queries will not destabilize the ranking.

2.2
Since there is not as much need to place links for navigation, this says to me that link based analysis is becoming more important as links to pages impart more information than simple navigation. If there is a specific link to a page (ie, not just to the homepage of the site), then it means that this link has information directly related to the current page. Unlike the old days, our links are inherently more useful as each link was selected to serve a purpose.

2.3
By analyzing user search logs, the search engine can provide deeper results by discovering synonyms and replaceable words. For example, a user searching for "dog pictures" might have subsequent search for "puppy pictures". Thus, the user has just taught the search engine that puppie and dog are synonymous. Multiply by a few billion searches, and there is quite a large search thesaurus This is more useful than a simple English thesaurus, as it includes pretty much all possible replacements including slang and other combinations that would not be included.

3.
The biggest effect of page rank I see is dealing with the adversarial aspect of the web. By pushing "bad" pages to the bottom of the rank, hopefully the good pages can be found, and the user is able to find the proper document. So, page importance is orthogonal to similarity as one changes in one are independent of the other, but when used together with weighting, they help the user find the best document.
ReplyDelete
Replies
Dustin FranklinFebruary 23, 2010 at 10:09 PM
(Continued)
4.
It is always simpler to destroy information, but it is almost never good. The coming of the internet and random people making a page on whatever they like was long heralded as bringing dishonesty and bad information. However, this did not happen, as we have automagically created ways to route around the "damage". While any random person can create information, they have trouble making someone else see it, or at least not for very long. If the opposite were true, then the http://www.timecube.com/ guy would have us all convinced we live on a 4-day time cube. Good information on the internet comes from trust, and this trust comes from constant good information. Any weirdo can become a blogger, but only bloggers that have a history of good information with no bad information have large audiences.

Also, consider wikipedia. Anyone can edit it, yet the site is able to become amazingly effective, and often has very reliable information. They are also built around a system of trust, as consistent editors who make good edits have powers of moderation that a new user does not have. And if they are abusing their powers or making bad edits, then they are quickly found out, and all their past changes are seen as untrustworthy and reverted.

However, there is some problems with this (which exist in the real world by the way) when someone trustworthy (Rao) says something untrustworthy. Whatever is the source with the most trust will ultimately win out, and then their information will become the truth.

So, yes life would be simpler, but the internet is the way it is, and we have to deal with it. Just as the internet tends to route around damage and censuring, so to do the users route around bad data and untrustworthiness to reach the good sources.
ReplyDelete
Replies
AnonymousFebruary 24, 2010 at 9:02 AM
1. The number of pages should be taken into account as well as the type of pages. It could be computed based on each page that contains a link with the same categorization, multiplied by their page rank or authority value. Then all of the anchor text with the same categorization could be added together and the value could be compared with other anchor text categorizations. The result would make more authoritative anchor text stand out.
2.
2.1.
The link structure of the web is obviously subsumed by link-based analysis. The amount of accesses that the page gets is related to the link-based analysis, in that when a page is linked to more by highly authoritative sites, or sites with a high page rank, then it is likely accessed more. Links do does not necessarily imply accesses, and the number of accesses could be substantially higher than the links imply. Something like Wikipedia is a good example. Wikipedia is accessed a LOT, but in many cases it is not linked to in papers because it is not considered as reliable of a source.
2.2.
This makes the utility of link-based analysis better. Because people are not putting links just for ease of use on the internet, means that the links that exist have more meaning. If a link was used simply to be able to more easily navigate to a page, then it has less correlation to the page it is linking to. With the way links are used today, if a page links to another one, they more than likely contain similar information. This makes link-based analysis better.
2.3.
Current day search engines can maintain a record of the queries, and what the user found helpful. Also current search engines can keep track of information about the user and determine importance based on a user profile.

3. I think that page importance should still be seen as part of relevance. The relevance of document d, should be in part determined by the importance of document d. How relevant a document is to the user is determined by how authoritative it is. Because the internet is an adversarial place, people could create websites deliberately to cause people to go to them even when they are not relevant. In order to determine the actual relevance of the document, the importance/authority must be a part of the calculation.
The orgasm quote is from the Woody Allen movie Annie Hall, his response is: “Did you have the wrong kind? I've never had the wrong kind. Ever. My worst one was right on the money.” In the same way, as long as the results of the query make the user happy, and they get what they wanted it doesn’t matter how the results are created.
4. I do think it would be simpler in some ways to go to controlling the web page creation, however simpler is not always better. Search algorithms would not have to include authority or page rank for the website, as all would be reputable, and there would be significantly fewer websites, so there would be fewer to search through. At the same time however, it would make things less simple because less information would be available on the internet. When a student had a question on say, calculating eigen vectors, they would have to find their old linear algebra book, or worse go to the library to find the information. With the internet being un-controlled, the student can instead simply google the term “eigen vector” and read the answer online from multiple different sites. Also from the perspective of a CS it would mean fewer jobs. If everything online were controlled, Google would probably not be hiring because their job would be so easy. Instead Google has to hire many people to work on combating the many people who make the internet more complex and adversarial.
ReplyDelete
Replies
AceFebruary 24, 2010 at 10:57 PM
1. People have been writing about weighting the documents based on their rank, and I think they're right. Another related way you could filter out the adversarial links would be to aggregate trusted domains. Educational sites or high profile sites come under fire when they put up inaccurate links, which means for the most part they're probably high quality.

2. Link-based analysis covers the gateway of information and transfer. As for the old days vs now, because we don't need to link it means that we only link when we want to. The only times to link then, would be times when we really want to make it as easy as possible for people to find what we're talking about. Therefore it would seem that links now would be more important than before since now they're all created voluntarily and not on a need basis.

3. As someone mentioned earlier, that Woody Allen quote is saying there's no such thing as a wrong orgasm. So as long as the user is getting what they want it doesn't matter. But I think they're orthogonal in the sense that the page would be ranked independent of the search and then when computing relevance it would be taken into account to demote or promote supposedly relevant documents.

4. That's just like saying children don't fall if they don't walk, so lets keep them crawling their whole lives. Or drivers don't get in accidents if they don't drive so lets go back to walking. There are problems with the freedom the web affords but they don't come near outweighing the benefits.
ReplyDelete
Replies
Preetika TyagiFebruary 24, 2010 at 11:00 PM
1. Yes, "types of pages" will matter. I think we can use the concept of Authorities/Hubs as explained in the class. If the page, which is referring the page P, has more number of hubs/Authorities count then it would add more importance to the page P.

2.1 Authorities/ Hubs computation, Eloquence/Informativeness is subsumed by link-based analysis. However, number of accesses to the page is subsumed in case of non-topical news. In case of trust-worthiness, it may lead to incorrect analysis sometimes.

2.2 Why should we consider it as affecting link-based analysis? We may not want to access pages directly from Google search result pages every time. Linked pages also show the inter-relation among pages. Search engine gives all results explicitly. I am not sure though :(

2.3 If you type any query, Google search suggests some options to modify the query based on its stats. This is a kind of handling importance in terms of general tendency of users.

3. I think the orthogonal dimension space is still the same. However, we are trying to modify weights which are used for measuring relevance. In Traditional IR, there was no issue on the authenticity of the terms and documents. In IR on Web, we are trying to include such factors which can contribute more towards identifying measures.
Extra Credit: This quote is trying to make us feel, whatever we did in IR class before yesterday's lecture, was somehow on the wrong track.

4. This is an interesting questions :) I would like state one fact. More restrictions lead to less development. If we put the control then the statement 'NAACP says a mind is a terrible thing to waste' is applicable here also :P
ReplyDelete
Replies
RajFebruary 25, 2010 at 9:16 AM
1. For taking the "number of pages" into account, we can consider only those pages that have an authority score above a certain Threshold value..that is consider the reference of only those pages that themselves are important and are referred to by other sites. But this technique will be partial to the newly developed web pages that don't have any pages pointing to it. Also, the type of the page should also be taken into account! This can be done by considering the similarity of the two pages and considering the importance of the anchor text only if the similarity is above a certain value, which will ensure that only a related page could point to other pages.
ReplyDelete
Replies
RajFebruary 25, 2010 at 2:18 PM
3. I think that the page importance should be seen as a part of the relevance only and not as another orthogonal dimension. As we already discussed in the class, tf can now be seen as a global measure rather than a local one. So, we can now think of Documents to be indexed not only with their contents, but also with the contents of the other pages describing it! Therefore the similarity value that we will now get would me a more accurate way of measuring the relevance of a document to the user Query (Assuming that the other pages have described this page accurately).

I guess the relation to the Quote comes due to the fact that what you may find to be relevant may not be relevant to others(people who have given links to this page)
ReplyDelete
Replies
Rohit RaghunathanFebruary 25, 2010 at 6:06 PM
1. We can take into account the diversity of the pages(or number of independent sources) that are pointing to page P. Domain-names could be used for this purpose. Something like one domain-name corresponds to one independent source. Multiple links from the same domain can be given a marginal/residual scores. Further, the credibility of the pages should also be taken into account. Hub scores could be one way of measuring credibility of a page.
ReplyDelete
Replies
Joseph S.February 25, 2010 at 8:56 PM
1. Types definitely count. One possible measure is the total time the webpage has been "known" and the amount of time its links have been active. This wouldn't help with google bombing but it would provide a way to assess legitimacy of a site, the longer it and its links have been around the more authority it is given.

4. While there is plenty of inaccurate information on the web, we can't forget about bad info that permeates our everyday lives. There are no centralized standards (that i know of) for credible or authoritative information in a conversation. But if we are curious about a topic brought up we investigate it further, checking books, the news, or in all likelihood wikipedia. It would be much simpler to teach the population how to evaluate information than it would be to centralize and censor information. But maybe I'm overestimating the average user...
ReplyDelete
Replies
Lei YuanFebruary 26, 2010 at 8:44 AM
1. The numbers of pages which says that page is of some characterizations surely determines how much trust we should place of these characterizations. The more pages, the more confidence we should have. Also, we shall not place the same amount of confidence on all the pages. Each site should have a score of "trustworthiness". The simplest way to evaluate this measure is to count how many times what a particular site says coincide with the majority of other sites.

2.1. The link structure of the web is the link-based analysis, such as authority and hub, page rank. The amount of the access is somewhat related, for example, if a page have a really high page rank value, it will likely be accessed frequently. The query and the user can be incorporated into the link-based analysis, for example, in page rank, we can certainly modify the transition matrix in a query/user adaptive way.
2.2. Link-based analysis such as page rank and A/H largely rely on "who points to it; who does it point to". So removing these links will certainly affect the page rank or A/H value.

3. The relationship of "relevance" and "importance" is not that intuitive than the one with "similarity". But given a number of pages with identical "similarity", a good choice will be trying to see where others normally go. This gives the motivation for "importance" evaluation. For the "orgasm" quote, I guess feeling "high" enough does not mean feeling "right", just like the importance of a page, a important page is certainly not guaranteed to be the right page.

4. I remember there is a quote for "emptying the baby with the bath". The web have so much useful information because we do not control it that much. If one is uncertain about a piece of information is correct, just check multiple copies of it. The majority will always be much more trustworthy. Also, there are still many sites on which we can place a lot confidence on.
ReplyDelete
Replies
ManishFebruary 26, 2010 at 9:12 PM
1. More than the "number of pages" its the "type of pages" pointing to page P which should matter more while determining importance of anchor text characterizations. As already mentioned, domain based techniques can be used as a measure to identify trustworthiness of a page. Domains like educational institutes, government organizations, host curated pages and there is a high likelihood that the information would be accurate.

2.2. Absence of links will lead to isolation of hubs and authorities and a break down of link-chain structure. Knowledge of good hubs or trustworthiness of the hubs will not propagate to their corresponding authorities (and vice versa) making it difficult to evaluate the importance of their authorities.

2.3. In the absence of link-based analysis, query logs are a good way of judging page importance. As most sites are accessed through search engines, data gathered during a user session, can be used to determine importance of pages. Access patterns of various users with similar queries can be used to determine the pages that are most likely to be accessed for a given query.

3. For web IR, page importance should definitely be considered as part of relevance. Traditional IR techniques used for computing relevance are insufficient to counter the problems posed by adversarial/uncurated nature of web. Even though page importance somewhat fails to consider newly added pages, it is useful in differentiating between already existing genuine and spurious pages. But relevance itself is difficult to define as it depends on each user. We can only approximate it so that it satisfies majority of users. Hence the reference to the quote.

4. Web was primarily created as a means of sharing information. It provides each individual a medium to share their data/views, implement their ideas and thus contribute towards the growth of web. It is this uncontrolled nature of web which makes it unpredictive. Failure/challenges faced by web search engines to retrieve required documents should not be a factor for implementing controls on the web. Instead the uncontrolled nature of web provides new challenges which can be seen as opportunities by search engines to continuously evolve.
ReplyDelete
Replies
Amir AbdollahiFebruary 28, 2010 at 4:01 PM
1. I think it should be based more off the type of pages pointing to the page rather than the number of pages. As people have said before, domains that are trusted such as educational sites and other big sites that get a lot of attention should be weighted more as they are viewed the most and criticized the most therefore these sites will be more accurate and trustworthy. Domain trustworthiness can be computed in a lot of ways from valid certificates, page rank and number of days that domain has been hosted.
3. I believe it is just some other orthogonal dimension because information that is provided by other users about the webpage is being used as the measure of its relevance. Other users that have typed the exact same query and clicked on certain results is giving more information about its relevance.
4. I believe if it was reverted to the old days the internet would crash again as it has before. The benefits of the freedom users have on the internet today out shines the positives of a controlled-web environment. It is up to the user to user their common sense and pick out the useful and accurate information out of the garbage. Also, with site such as Wikipedia, I truly believe in the more eyes on some information the better.
ReplyDelete
Replies
DeeFebruary 28, 2010 at 7:36 PM
1.In my opinion,it should be a weighted combination of query relevance,number of pages and type of pages,with far more importance given to the type of pages that are pointing and very little importance to the number of pages.
2.I think in the old times when people used to put links,there was not as much uncurated data present on the internet as it is today,the pointing sites and pointed to sites were legitimate which is not the case now a days.
With the increasing effectiveness of current day search engines,link based analysis might not be as useful as it was before.
2.3 I believe the search engines use query logs to find the importance of the page.
3.I think we can still see page importance as a part of computing the relevance but 'just a more accurate computation' is not guaranteed.I think thats how the quote is related to this topic,every time we think we have found the right formula,we end up realizing it might not be the best one. Actually I like the continuous improvement philosophy so much so that I just love Heisenberg's uncertainty principle which says by the time we get to the truth it changes by the very act of us finding it.
4.I would like to see the glass full in this case.Along with the masses of 'inaccuarate' data present on the web,there are still a lot of sites like Wikipedia which mostly have the right information even if it is not validated.I think its user's responsibility to get to the 'right' information.It is because of the openness of the web that we at least have the answer to our query even if it is a wrong one,its indeed better to have some birds who don't sing well than to have none.
ReplyDelete
Replies
AmitMarch 1, 2010 at 7:08 PM
1. Type of pages that are pointing to the page P should necessarily be taken into account. Following are few ways in which this can be addressed:
i. The page rank of the page which contains the anchor text should be taken into account. This can be helpful in controlling spam anchor text (because the page rank of spammed web pages would be generally very low).
ii. The context in which the anchor text has been mentioned can also be taken into account. For example, if a news site is mentioning itself many times on its web page should be given less weightage than other web pages referring it.
iii. The popularity of an anchor text among the mass should be taken into account. For example, social news web sites like www.digg.com should be given priority as the links posted here will be by the people who use this web, and importance of an anchor can be decided by the popularity of a link or news.

2.2 Finding page importance would have been easier and better if people would still be putting links (bookmarking) to various pages to get back to them later. When people bookmark some page it means that, that particular page is important for that particular person. So similarly if many people are bookmarking a particular common page signifies that this page is very popular among people, so it should be given more importance. Also, it would be helpful to cluster the important pages for a particular community of people according to the kind of bookmark they do.

2.3 As many people mentioned, query logs (what kind of links one is clicking on) can certainly be one of the methods how it can be handled. Also as part of this, the query logs can be generated relevant to particular users and can be used in this case.

3. I feel the page importance is a more accurate computation or relevance. We always need to consider the relevance of a query with available documents. But among the relevant pages what page is more important (quoted by many etc.) to the masses should be a page what a common user must also be looking for. So by adding page importance, we have made the data more relevant to the user.
If we were calculating relevance by only seeing the similarity between pages and not the importance of a page among the masses then we were giving the results to a user but not the ones which are popular among the masses, or which are considered important among the masses. The orgasm quote is relevant in this context.
ReplyDelete
Replies
AmitMarch 1, 2010 at 7:09 PM
Continued...
4. I feel the simplicity of the implementation is not that important if we are trying to fulfill a common user’s requirement. When people try to quote someone or something, it’s a general tendency to get (or mention) only the essence of the quote and not the exact wordings. Very rarely people remember the exact wordings. So when searching for something similar to quote, a user might not be interested in getting the exact version but only an approximate version so that he/she can use it.
We are trying to make the search engines and web more like the real world works. Control web page creation will kill that idea. Making the web world more close to how the real world looks, or giving users the freedom to equally participate in creating web pages is encouraging democracy and equality to participate. It is of course more challenging and interesting for us to come up with better and more relevant results in spite of the web being uncontrolled.
ReplyDelete
Replies
Siva NatarajanMarch 2, 2010 at 1:28 AM
1. Yes, I think the "type of pages" is necessary to be taken into account when charecterizing a page P. As trustworthiness of a page is captured by link-based analysis to certain extent, more weightage can be given to anchor text from pages of high importance.

Intuitively, the right way to describe a page link using an anchor text would be to use words about the page itself. For example; An anchor text of the link

to "asu" website would be "university", "asu", "arizona" etc. So, it would be better to check whether the words in the anchor text actually is a top keyword(in terms of tf) in the page or not. If not, then the anchor text is describing about the page in a way which is not authorized by the page itself, in such cases we may want to take extra care to verify the trustworthiness of the pages.

2.1 Eloquence and informativeness will be captured by link based importance. Both number of page accesses and link based importance together can help capture novelty and trustworthiness.
2.2 Yes, if everybody stops using anchor text completely then the link based analysis methods considering anchor text as links may be affected. But yet, instead of considering anchor text as links, if one starts considering keywords as links to the webpage(Each webpage tagged with a keyword which is often used to reach/refer to the webpage, for example: "university in arizona" - referring to asu website), then the same underlying techniques of link based analysis can still be used. However, I could not imagine a situation when everyone will stop referring to the webpage url. With the advent of web 2.0 facilitating interactive web content, it is not so hard a task to add a link to a webpage just by typing the url.
2.3 It is possible that, search engines may measure the number of times a page has been accessed in the past, recent/sudden increase in the number of accesses to a page due to current news/popularity, location based evaluation of page access etc., to determine the importance of a page.

3. I think the importance of a page factor should be considered as part of the relevance computation itself. Measure of importance of a page takes care of determining whether a page is trustworthy, informative, whether it has been accessed often by other users and referred by other pages. A result is thus more relevant and deserves to find its spot in the top-k result only if it is more authentic and trustworthy apart from just satisfying tfidf.

4. I think, not allowing everybody to put up webpages except the selected few would make the internet less informative. A model similar to wikipedia can be adopted in this context, to periodically check the authenticity of the information uploaded by users and remove them if necessary. In similar lines, we can allow other users to report a webpage as "fake" when the users find the information contained in the webpage to be wrong, and block the website if it reaches more than a threshold number of fake reports. But then, the woods will be silent indeed, if no birds sang except those that sing the best!!
ReplyDelete
Replies
Rukmani KasiramanMarch 3, 2010 at 8:50 PM
1. Pagerank being one of the main ways to decide on a page's importance, we need to be clear on what page links to our site. There are ways to control pages that can link to another site. Considering just the number of pages pointing to the site may result in incorrect ranking as some people may create "link farm". Some sites might contain better contents related to a particular topic but they might not have better ranking. Generally when a site has multiple links to another page, all links should not be counted (as done by Google where it considers only one anchor text to a URL). But it has not yet been tested with other search engines and hence when there is multiple links to the same page and if all the links are counted then the page ranking may not be correct. Likewise a site's link to itself should not be counted for page ranking. Importance of the page that casts vote should be considered by calculating the weight for a page depending on the rank of the page that has the link to it.
2.1. Link-based analysis is sensitive to link structure of the web. The importance of a page should not be merely based on the number of sites links to it. It should depend on the weight of the page (authority score must be more) that links to this site. Considering just the number of access that a site has from other pages will reduce the effectiveness of ranking and thereby affect the quality and relevance of a page for a given query. Hence pages with high authority and hub should be taken into account as higher authority indicates that the page has more relevance to the query likewise page with higher hub score indicates that the page has link to other sites that has more relevance to the query given.
2.2 sometimes unimportant pages might have many links to a page that has more relevancies to a particular query. But due to many restrictions of placing links to page, the initial page rank of those pages has been dropped because of which the site with more relevant content for a query lost its rank for the same query.
2.3 The relevance of the documents can be estimated using the historical preferences of the users that are stored in the query log for a particular query.
3. Page ranking ends up in providing more relevant results for a query. More than just calculating the similarity value, with the help of page ranking it is possible to rank the page for a query from a user’s perception. It uses the information of how much a page is relevant to the topic not only from the user but also from other websites. The quote implies that sometimes page ranking leads to non relevant pages to make through and they need more digging to find what exactly the user wants.
4. Restricting people from creating web pages will reduce the amount of information we get for any query. In this web 2.0 era users are no more interested in just retrieving information, but to build an interactive application. Wikipedia, social networking sites and many other blogs are web 2.0 based applications. These have major user involvement in creating and editing web pages for their site. But since any user can edit the content, people cannot be sure about the evidence of reading correct information.
ReplyDelete
Replies