Google today blogged about reaching the 1,000,000,000,000 unique URLs indexed. And how do they crawl all these bazillions of pages everyday?
We start at a set of well-connected initial pages and follow each of their links to new pages. Then we follow the links on those new pages to even more pages and so on, until we have a huge list of links. In fact, we found even more than 1 trillion individual links, but not all of them lead to unique web pages. Many pages have multiple URLs with exactly the same content or URLs that are auto-generated copies of each other. Even after removing those exact duplicates, we saw a trillion unique URLs, and the number of individual web pages out there is growing by several billion pages per day.
In 2005, Yahoo! revealed that they have about 20 billion pages. Then Google also announced weeks later that they have 3 times bigger than any other search engine, referring to Yahoo’s earlier claim. That makes it 60 billions indexed pages.
Only 3 years later, that’s now, Google has actually increased that more than 16 times to 1 trillion. The problem though is that I think this type o approach might bump into problems in scaling one day. How many more data centers and hundreds of thousands of server will Google spend on to keep the index growing and make searching still lightning fast?