How big is the Web?
(and how much of it has been indexed?) Short answer: nobody knows
Optimistic answer: maybe we can
guess it Slightly longer answer: read this page
Really long answer: Visit our library, browse the web for info and
study the problem for (at least) a couple of years, then
prepare your own tests, reach some valid conclusions and let us know
Introduction
"To find out about the web ahead, ask those coming back"
How big is the Web?
How big is the Web? Hard to say. How do we calculate "pages"? What is exactly nowadays an "Host"? And what is a "Site"?
When a search engine reports -say- 10.000.000 results do you trust it? And when it reports 10.235.987 results?
And what if you don't trust such numbers?
Those results are not easy to check anyway, since engines do not allow any checking in depth of their own
results after position ~1000 anyway. (See below "How to check results beyond the s.e. limit" as
a partial solution)
Plenty of problems, and few real data we can build upon to start with.
So what can we do? C'mon, we are seekers. We'll find and squeeze the few existing data until they scream and confess!
Alas! Even the few existing data are rather unreliable, so what follows is the best estimate we could
infer, basing on wild averages and extreme simplifications.
As the image at the top indicates, the indexed web is anyhow just a small part of the
real Internet, and even on such a smaller turf, how much is really covered by the main search engines,
and how much they do overlap among them with their
indexes, is anyone guess.
Few entities have sufficient resources
to tell us how many web pages there might be: The biggest search engines have of course some data:
Google and Yahoo after all, spend their life visiting, analyzing and indexing the billions
of web pages in the world. Yet neither company publicizes regularly the
exact size of its index and there are not reliable data since August 2005 (for yahoo). But they keep mum most
of the time. Bursting out some data, or bragging, only when a new contender does represent a menace or another
search engine breaks
this "gentlemen agreement to keep mum".
Few & sparse data
In July 2008 Google
"was aware" of over a trillion pages, adding that
..."the number of individual web pages out there is growing by several billion pages per day".
However many of these
pages just represent auto-generated content and have NOT been added to the real google's index (see "links to unindexed"
in te image above). On http://www.google.com/intl/en/options/
google itself writes (in April 2009):
"Search over 8 billion web pages". As you can read below, that formulation
does not seem to dovetail at all with any really broad google query, though.
Obviously, if you would really begin indexing and
counting auto-generated pages the
amount would quickly reach infinite: a large chunk of this "trillion" google's aware of
will consist of moronic auto-page-generating sites,
that return a page of junk whatever input URL you try (might the bastards that
pollute our searches all die pancaked by trucks together with their most beloved ones).
Many other "pages" will be just login, stats pages, etc. Thanks god these google's results are just a trillion 'discovered' URLs,
and not a trillion crap URLs actually indexed.
Spam sites (including huge amounts of parked domains junk) are responsible for a big part of that
1 trillion figure. Also sites that use non-obvious session IDs are nasty offenders too.
It's amazing that nobody thought of it when developing HTTP/URL specs:
session IDs should have never become part of URLs.
Yahoo published
some data in August 2005: they covered at that time 19,2 billion pages.
MSNlive affirmed
to cover 20 billion pages in September 2005 ("We’ve quadrupled the size of our index, which
means we can return the right results for your searches", elsewhere they mention the 20 billions pages.
Netcraft publishes regularly data about "sites".
(But there are 1more than 100 000 000 blogs alone, and they might count as just one site).
So we have to calculate how many pages correspond on average to a Netcraft "Site".
isc.org publishes regularly data about "hosts"
Note that the number of "hosts" for instance a large number of blogs under its blogspot.com domain
So we have to calculate how many pages correspond on average to an isc.org's "Host"
Also Netcraft's and isc.org's data do span quite different time slices. So we have limited our calculation to Jan 2009,
Jul 2008 and Jan 2008. The global trend of web-growth, which would have been interesting per se, is alas quite
skewed by the sudden inclusion of huge chinese sites & hosts (e.g. gzone.qq.com) during January and February 2009.
Our own estimates, as broad as they are, seem to indicate
a current indexable size of around 60 billion pages, which
would be after all consistent with netcraft's and isc.org's data, but
which does not dovetail with cuil's (unsubstantiated) claims of having a database of
120 billion pages.
The following quick check of the four "big" search engines (using the vowel "a" method) confirms broadly
the engines' claims (apart from MSNlive) only IF YOU TRUST THEIR OWN ALLEGED RESULTS about
the number of pages they index... since there's no way to
really check in depth these results (see the yoyo searching approach): there's in fact no way
to exactly check all results beyond the engine's fixed limit (yet see below for a partial solution
"How to check results beyond the s.e. limit").
So who knows if those numbers of indexed pages are true?
On a side note, it would be interesting to understand WHY search engines do not allow
checking all the results they claim to have found? Avoiding servers overload? Or would this possibility
really allow us to easily reverse engineer their algos (and better understand the mechanisms of their "tides")?
Let's try the vowels
Time to check the search engines' claims.
The letter "a",
for instance, often used for this kind of "quick checking", gives in google (April 2009) "just"
16 billion pages
(or up to 18 billions if you try again a couple of hours later... search engines' results vary continuously, their "tides"
probably depend from your birth sign or from the moon position :-) And anyway
the "more tricky" a OR a
query
gives us in google slightly more:
19,330,000,000 pages.
So let's say 20 billions: consistent, q.e.d.
Now yahoo, that claims a bigger index, gives indeed for "a"
43,100,000,000 pages consistent, q.e.d.
MSNlive, the third of the "big" engines, requires some nifty tricks to squeeze out the biggest pages count from the
"a" vowel: 8 650 000 000 result, which does not seem
consistent with
their own claim of around 20 billion indexed pages. See graph.
If we check
now the recent ("cool") CUIL,
which claimed "the biggest index of them all", it indeed gives us indeed
over 121 billions pages! Consistent, q.e.d.
So the "vowel a indexed pages finals" in April 2009 give the following results:
Cuil = 120 billions; Yahoo = 40 billions; Goggle = 20 billions and MSNLive = 9 billions.
This said -as all searchers know- the quality of the algos is king, while "index size" is -as such-
just accessorial. And google still seems to deliver quality results with a relatively small database of indexed pages.
Charts
"Round numbers are always false"
From Netcraft data...
(averaging with crossed fingers)
...through isc.org data...
...shaking well...
...to our own conclusions
(billions of indexable web pages, that is :-)
The all mighty google monopole...
...should not blind seekers into using only one engine
Search engines don't overlap that much:
main s.e. claimed indexes size & relative web-coverage
Checking (sort of) indexed databases:
main s.e. **claimed** indexes size & vowel "a" search results
Checking their claims for real:
main s.e. refresh rate and size of indexes according to our data
How to check results beyond the s.e. limit
"Slice & splice suit a query nice"
-------- Intro --------
So so so...
the search engine you just
used claims an astronomical number of results for your query,
yet when really checking these results you can never go beyond (or "below" if you prefer) a specific limit, which can be
1000 results (MSNLive, alltheweb, altavista and Yahoo), slightly less than 1000 (google) or even smaller
(ask gives only ~200 viewable results).
Why do the search engines deny us the full set? Well, first of all it remains to be seen
if engines can really produce astronomic totals, even in-house:
search engineers say that such numbers are just an estimation of results,
that depends on to how many "objects" are in the engine's database related to the search query. Hence it has -at most- a
comparative value among queries on the same search engine.
Still... why do the search engines deny us a fuller set?
Most searchers believe this happens because they do not want to
overload their server clusters, and surely it would be (exponentially!)
more taxing for nan engine to order 1000000 results
-say- versus 1000, moreover any robot script could and prolly would wreak bandwith havoc if allowed
to delve into 1 billion or more links.
Yet I think there is also another reason: a more complete set of
result would make it relatively easier to reverse engineer some of the algos they use
for ranking, which is what the beastly SEOs spammers toil and drool about all the time.
In fact "low positioning" in a set of results is often more telling that "best positioning".
That's incidentally the reason seekers -and spammers alike-
prefer to reverse engineer algos with queries that will offer a limited
number of results :-)
Back to our problem: only -say- 1000 results or less,
but we would wish to investigate a bigger amount... what can
we do?
Well, of course we can simply narrow out our "too broad" queries and eliminate noise
until we fish purer signal inside a more manageable, much smaller set
of results. Yet in some cases we do WANT to know what are all the results for a broad query.
Besides this kind of deep diving is interesting per se, as anyone that uses the yoyo searching approach
jolly well
knows :-)
So we basically want
to get past the search engines' imposed limits.
There are some ways to get more results (not all the astronomic amount, but MUCH more than the limit
nevertheless): most of these approaches just
split that astronomic result amount into multiple result sets (say) 1000+1000+1000+...
This is
usually done either adding more words to the query (forcing or eliminating one or more term(s), and
taking care to pick
accompanying terms that are +- equally probable in order to avoid skewing the final collated results)
or (a more typical approach) just
manipulating the "advanced search" options to split all results
per timeslice or per domain of provenience.
Of course afterwards, once obtained the multiple sets,
you'll have to "splice together" these multiple results pages and eliminate possible doubles.
Any ad hoc script will help in making the process almost automatic. Note that
the pages of results you'll get back won't map directly onto what would have been the "straight"
set of results, still they will list all found pages in relevant result order, so it will be relatively easy to
splice everything together.
Also note that if you change the url,
(e.g. in google adding "&start=2000" you'll get the message
"Sorry, Google does not serve more than 1000 results for any query. (You asked for results starting from 2000)"
Imo the best method is always to use a temporal slicing approach:
Get the "normal" set of 1000 results with your straight query.
Then get a somewhat different set with the same query, but specifying: "only pages updated in the last year" (in google just add
the parameter:
&as_qdr=y).
Rinse and repeat with pages updated the year before (in google:
&as_qdr=y2), and then the year before that... you get the idea...
Note however that the temporal slicing approach is prone to fish a lot of duplicates (for all
regularly updated sites), so you'll definitely need some good cleaning scripts for the afterward splicing if
you use this approach.
Another alternative (for instance when the subject is somehow contingent and as such doesn't allow a temporal slicing
approach) is to go for a domain slicing approach and
use "[only] return results from the site or domain" and restrict yourself to -say- .com.
That will give you the top 1000 from .com domains alone.
Rinse and repeat for .org or .ru or co.uk or .edu and whatsnot (fellow seekers will know which domains
to exclude (or force)
according to the kind of fishing they are doing).
There are
other ways to get indirectly past the results limit using the advanced parameters that most search engines offer.
A filetype slicing approach can be for instance quite useful in some
bookish and document-oriented queries (excluding -say- .pfd files or .doc files
from the results). Fellow seekers will know which filetypes to exclude (or force)
according to the kind of fishing they are doing.
Another possibility is filesize slicing. Few engines allow you to do it
for web pages (though most engines allow filesize slicing when searching images).
Where the size parameters do exist, you might try preparing different "1000 pages" sets according
to the size of the targets.
-------- Slice per hand --------
You can also slice results per hand, forcing the addition (or more often the subtraction) of a
specific term. You'll need to choose carefully terms that split the resultmountain in ~one half.
For instance "web searching"
on MSNLive (179,000,000 results)
can be almost cut in half subtracting the term "Internet":
"web searching" -Internet: 63,200,000 results.
Again, fellow seekers will
need some "Fingerspitzgefühl", some "nose" for this kind of term choices... ça va sans dire.
-------- Google alone and you'r never done --------
If you need more results than you can obtain with the approaches described above, just use more main
search engines! Combine (splice) the results of the same query by -say- Google, Yahoo, MSNLive, Ask, Altavista, Alltheweb and Cuil.
Remember that search engines' indexes do not overlap much.
Organizing and searching the World Wide Web of facts -- step one: the one- million
fact extraction challenge. In Proceedings of the 21st National Conference ... portal.acm.org/citation.cfm?id=1242572.1242587 - Similar pages by M Paşca - 2007 - Cited by 14 - Related articles - All 10 versions
Ahi ahi! If you click on the link you'll be brought to an "ACM portal" ("the guide to computing literature", nonetheless),
where some clown will ask you to "pay" in order to see your pdf target.
Well you can pay, of course. Or you can simply search for the exact title:
Organizing and searching the world wide web of facts -- step two:
And ta-daa! As many copies of your target as you might need, allover the web.
For instance: http://www2007.org/papers/paper560.pdf
But we are not finished yet. Where there's a "step one:" there is also be a "step two", as the google original result also pointed out.
Quaerite et invenietis:
Organizing and searching the world wide web of facts -- step one:
And ta-daa! As many copies of your target as you might need, allover the web.
For instance: http://www2007.org/papers/paper560.pdf
Quod erat demonstrandum.