Showing posts with label Google Sandbox. Show all posts
Showing posts with label Google Sandbox. Show all posts

Friday, February 12, 2010

What is the Google Dance?

Approximately once a month, Google update their index by recalculating the Pageranks of each of the web pages that they have crawled. The period during the update is known as the Google dance.
Because of the nature of Page Rank, the calculations need to be performed about 40 times and, because the index is so large, the calculations take several days to complete. During this period, the search results fluctuate; sometimes minute-by minute. It is because of these fluctuations that the term, Google Dance, was coined. The dance usually takes place sometime during the last third of each month.
Google has two other servers that can be used for searching. The search results on them also change during the monthly update and they are part of the Google dance.
For the rest of the month, fluctuations sometimes occur in the search results, but they should not be confused with the actual dance. They are due to Google's fresh crawl and to what is known "Everflux".

Google has two other searchable servers apart from www.google.com. They are www2.google.com and www3.google.com. Most of the time, the results on all 3 servers are the same, but during the dance, they are different.
For most of the dance, the rankings that can be seen on www2 and www3 are the new rankings that will transfer to www when the dance is over. Even though the calculations are done about 40 times, the final rankings can be seen from very early on. This is because, during the first few iterations, the calculated figures merge to being close to their final figures. You can see this with the Pagerank Calculator by checking the Data box (top left) and performing some calculations. After the first few iterations the search results on www2 and www3 may still change, but only slightly.
During the dance, the results from www2 and www3 will sometimes show on the www server, but only briefly. Also, new results on www2 and www3 can disappear for short periods. At the end of the dance, the results on www will match those on www2 and www3.
This Google Dance Tool allows you to check your rankings on www, www2 and www3 and on all of data centers simultaneously.

Google currently has 12 data centers, any one of which can provide the Toolbar PageRank of any page. As the dance progresses, these data centers are updated one by one. Before the dance begins, they all return the same, current PageRank value for a given page, but during the dance they are updated, one by one, to the new PageRank value. Checking each of the centers during the dance reveals the new PageRank values as they gradually spread through the centers. If the PageRank isn't going to change, the centers show the same values throughout, of course.
Querying the data centers
For this, it is necessary to have the Google Toolbar installed and the PageRank indicator on. Every time a page is received by the browser, the Toolbar requests its PageRank from one of Google's data centers. The information is returned as a one-line text file and stored in the Temporary Internet Files folder.
The Toolbar's request URL includes the URL of the page that it wants the PageRank for (the target page), and a checksum that matches that URL. Of course, the checksum must match the target page's URL.
A fat URL for a typical Toolbar request (all in one line):-
http://216.239.33.102/search
?client=navclient-auto
&ch=5150615727
&features=Rank:FVN
&q=info:http%3A%2F%2Fwww%2Eexampledomain%2Ecom%2F

If you copy and paste that fat URL into your browser, you will get Google's "forbidden" page back. That's because the target page and checksum don't match - it's just an example of the request URL.
Notice that the target page is in escaped format - some of the characters are represented by hexadecimal codes (e.g. %2F).
To get the new PageRank for a particular page, you need to make the same request that the Toolbar makes for it. I.e. you need the fat URL that the Toolbar uses. And you need to request the PageRank from all of Google's data centers. The method is a bit long-winded but it works. Here's how to do it:-

  • Use your browser to browse to the page. This makes sure that the page and the Toolbar's PageRank request are in your Temporary Internet Files folder. You only need to do this once - not every time.




  • Open the index.dat file from the Temporary Internet Files folder into a text editor, and perform a search in it for the target page. You'll find the entire fat URL, similar to the one above, for the Toolbar's PageRank request. NOTE: Because the target page is escaped in the fat URL, search only for an unescaped part; e.g. "exampledomain".




  • When you've found the fat URL, copy and paste it into your browser's address box and press Return or click Go. If the page is in Google's directory, the returned line includes the directory path. The last element in the first part of the line is the Toolbar PageRank value for the target page. To see the page's new PageRank spread across the centers during the dance, use the same fat URL, but replace the IP address with each of the data centers. This is also a good way to see the progress of the dance in general.
    Data centers

    216.239.33.100 :: www-ex.google.com
    216.239.35.100 :: www-sj.google.com :: currently offline
    216.239.37.100 :: www-va.google.com
    216.239.39.100 :: www-dc.google.com
    216.239.41.100 :: www-fi.google.com
    216.239.51.100 :: www-ab.google.com
    216.239.53.100 :: www-in.google.com
    216.239.55.100 :: www-zu.google.com
    216.239.57.100 :: www-cw.google.com
    216.239.59.100 :: www-gv.google.com
    66.102.11.100 :: www-kr.google.com
    66.102.7.100 :: www-mc.google.com
    TIP: If you want to check the same pages during future dances, save the fat URLs into a text document so that you don't need to go through the process of finding them in the Temporary Internet Files folder each time.





  • Google Dance - The Index Update of the Google Search Engine

    The name "Google Dance" has often been used to describe the index update of the Google search engine. Google's index update occurred on average once per month. During an index update there was significant movement in search results and Google showed new backward links for pages. However, in mid-2003 Google started to update it's index continuously. It appears that, still, there has to be an update of the complete index once in a while and during this time new backward links are shown. But, because of the continuous update, the effects on search results seem to be rather insignificant.
    We will keep this site up running because it provides some information beyond the Google Dance. But there will no longer be a monitoring of updated data centers during a "Dance".
    The Technical Background of the Google Dance
    The Google search engine pulls its results from more than 10,000 servers which are simple Linux PCs that are used by Google for reasons of cost. Naturally, an index update cannot be proceeded on all those servers at the same time. One server after the other has to be updated with the new index.
    Many webmasters think that, during the Google Dance, Google is in some way able to control if a server with the new index or a server with an old index responds to a search query. But, since Google's index is inverse, this would be very complicated. As we will show below, there is no such control within the system. In fact, the reason for the Google Dance is Google's way of using the Domain Name System (DNS).
    Google Dance and DNS
    Not only Google's index is spread over more than 10,000 servers, but also these servers are, as of now, placed in 13 different data centers. These data centers are mainly located in the US (i.e. Santa Clara, California and Herndon, Virginia) and in Dublin, Ireland.
    In order to direct traffic to all these data centers, Google could thoeretically record all queries centrally and then send them to the data centers. But this would obviously be inefficient. In fact, each data center has its own IP address (numerical address on the internet) and the way these IP addresses are accessed is managed by the Domain Name System.
    Basically, the DNS works like this: On the Internet, data transfers always take place in-between IP addresses. The information about which domain resolves to which IP address is provided by the name servers of the DNS. When a user enters a domain into his browser, a locally configured name server gets him the IP address for that domain by contacting the name server which is responsible for that domain. (The DNS is structured hierarchically. Illustrating the whole process would go beyond the scope of this paper.) The IP address is then cached by the name server, so that it is not necessary to contact the responsible name server each time a connection is built up to a domain.
    The records for a domain at the responsible name server constitute for how long the record may be cached by a caching name server. This is the Time To Live (TTL) of a domain. As soon as the TTL expires, the caching name server has to fetch the record for a domain again from the responsible name server. Quite often, the TTL is set to one or more days. In contrast, the Time To Live of the domain www.google.com is only five minutes. So, a name server may only cache Google's IP address for five minutes and has then to look up the IP address again.
    Each time, Google's name server is contacted, it sends back the IP address of only one data center. In this way, Google queries are always directed to different data centers by changing DNS records. On the one hand, the DNS records may be based on the load of the single data centers. In this way, Google would conduct a simple form of load balancing by its use of the DNS. On the other hand, the geographical location of a caching name server may influence how often it receives the single data centers' IP addresses. So, the distance for data transmissions can be reduced.
    How data centers, DNS and Google Dance are related, is easily answered. During the Google Dance, the data centers do not receive the new index at the same time. In fact, the new index is transferred to one data center after the other. When a user queries Google during the Google Dance, he may get the results from a data center which still has the old index at one point im time and from a data center which has the new index a few minutes later. From the users perspective, the index update took place within some minutes. But of course, this procedure may reverse, so that Google switches seemingly between the old and the new index.
    Finally, it shall be noted that Google did the DNS load balancing by themselves until September 2003. Since then, they use the services and, hence, the name servers of Akamai Technologies, Inc.
    IP Addresses and Domains of Google's Data Centers
    The progression of a Google Dance could basically be watched by querying the IP addresses of Google's data centers. But queries on the IP addresses are normally redirected to www.google.com. However, Google has domains which resolve to the single data centers' IP addresses. These domains as well as their IP addresses are shown in the following list.
    Domain IP-Adresse
    www-ex.google.com 216.239.33.100
    www-sj.google.com 216.239.35.100
    www-va.google.com 216.239.37.100
    www-dc.google.com 216.239.39.100
    www-ab.google.com 216.239.51.100
    www-in.google.com 216.239.53.100
    www-zu.google.com 216.239.55.100
    www-cw.google.com 216.239.57.100
    www-fi.google.com 216.239.41.100
    www-gv.google.com 216.239.59.100
    www-kr.google.com 66.102.11.100
    www-mc.google.com 66.102.7.100
    www-lm.google.com 66.102.9.100
    Those that keep an eye on Google's index updates often think that the Google Dance is over, when they see the new index at www.google.com or when they don't see the old index at www.google.com for some time. In fact, the update is not finished until all the domains listed above provide results from the new index.
    The index updates at the single data centers seem to happen at one point in time. As soon as one data center shows results from the new index, it won't switch back to the old index. This happens most likely because the index is redundant at each data center and at first, only one part of the servers (eventually half of them) is updated. During this period, only the other half of the servers is active and provides search results. As soon as the update of the first half of servers is finished, they become active and provide search results while the other half receives the new index. Thus, from the user's perspective, the update of one data centers happens at one point in time.
    Finally, it shall be noted that the access to the single data centers is generally controlled by the DNS only, but sometimes queries are redirected. However, this is easy to detect: When for a query at one of the domains listed above, the links to Google's cache do not comply with the IP address that belongs to the domain, then the query is redirected. If this happens, Google inhibits - for whatever reason - the access to one data center.
    The Google Dance Test Domains www2 and www3
    The beginning of a Google Dance can always be watched at the test domains www2.google.com and www3.google.com. Those domains normally have stable DNS records which make the domains resolve to only one (often the same) IP address. Before the Google Dance begins, at least one of the test domains is assigned the IP address of the data center that receives the new index first.
    Building up a completely new index once per month can cause quite some trouble. After all, Google has to spider some billion documents an then to process many TeraBytes of data. Therefore, testing the new index is inevitable. Of course, the folks at Google don't need the test domains themselves. Most certainly, they have many options to check a new index internally, but they do not have a lot of time to conduct the tests.
    So, the reason for having www2 and www3 is rather to show the new index to webmasters which are interested in their upcoming rankings. Many of these webmasters discuss the new index at the Google forums out on the web. These discussions can be observed by Google employees. At that time, the general public cannot see the new index yet, because the DNS records for www.google.com normally do not point to the IP address of the data center that is updated first when the update begins.
    As soon as Google's test community of forums members does not find any severe malfunctions caused by the new index, Google's DNS records are ready to make www.google.com resolve the the data center that is updated first. This is the time when the Google Dance begins. But if severe malfunctions become obvious during this test phase, there is still the possibility to cancel the update at the other data centers. The domain www.google.com would not resolve to the data center which has the flawed index and the general public could not take any notice about it. In this case, the index could be rebuilt or the web could be spidered again.
    So, the search results which are to be seen on www2.google.com and www3.google.com will always appear on www.google.com later on, as long as there is a regular index update. However, there may be minor fluctuations. On the one hand, the index at one data center never absolutely equals the index at another data center. We can easily check this by watching the number of results for the same query at the data center domains listed above, which often differ from each other. On the other hand, it is often assumed that the iterative PageRank calculation is not finished yet, when the Google Dance begins so that preliminary values exert influence on rankings at that point in time.
    The New PageRank Values during the Google Dance
    Most webmasters are interested in ranking changes for their website during the Google Dance. But, besides that, many also want to know about their new PageRank values. Normally, the Google Toolbar fetches the PageRank values from the data center that is specified by its IP address in the actual DNS record for www.google.com. Hence, when the Google Dance begins, the Toolbar usually displays the old PageRank values.
    Google submits PageRank values in simple text files to the Toolbar. In former times, this happened via XML. The switch to text files occured in August 2002. The PageRank files can be requested directly from the domain www.google.com. Basically, the URLs for those files look like follows (without line breaks):
    http://www.google.com/search?client=navclient-auto&
    ch=0123456789&features=Rank&q=info:http://www.domain.com/
    There is only one line of text in the PageRank files. The last cipher in this line is PageRank.
    The parameters incorporated in the above shown URL are inevitable for the display of the PageRank files in a browser. The value "navclient-auto" for the parameter "client" identifies the Toolbar. Via the parameter "q" the URL is submitted. The value "Rank" for the parameter "features" determines that the PageRank files are requested. If it is omitted, Google's servers still transmit XML files. The parameter "ch" transfers a checksum for the URL to Google, whereby this checksum can only change when the Toolbar version is updated by Google.
    The PageRank files that are requested by the Google Toolbar are cached by the Internet Explorer. So, their URLs and the checksums can simply been found out by having a look at the folder Temporary Internet Files. Knowing the checksums of your URLs, you can view the PageRank files in your browser. Since the PageRank files are kept in the browser cache and, thus, are clearly visible, and as long as requests are not automated, watching the PageRank files in a browser should not be a violation of Google's Terms of Service. However, you should be cautious. The Toolbar submits its own User-Agent to Google. It is:
    Mozilla/4.0 (compatible; GoogleToolbar 1.1.60-deleon; OS SE 4.10)
    1.1.60-deleon is a Toolbar version which may of course change. OS is the operating system that you have installed. So, Google is able to identify requests by browsers, if they do not go out via a proxy and if the User-Agent is not modified accordingly.
    Now, let's see how we can get the new PageRank values. Taking a look at IE's cache, you will notice that the PageRank files are not requested from the domain www.google.com but from IP addresses like 216.239.33.102. Additionally, the PageRank files' URLs often contain a parameter "failedip" that is set to values like "216.239.35.102;1111" (Its function is not absolutely clear). However, it is pretty easy to get the new PageRank values. Simply modify the IP addresses in the URL so that the request goes to one of the data centers that already has the new index. The necessary information is given above.
     
    rantop.com
    ....Our Business Partners....

    Rainrays Web Directory


    Earn upto Rs. 9,000 pm checking Emails. Join now!