portalessays → howbigistheweb.htm This is a windrose
Structure of the web: indexed vs unindexed      
~ How big is the Web? ~
by fravia+

http://www.searchlores.org
http://www.fravia.com


Version April 2009

This is a windrose

[Introduction]
[Charts]
[How to check results beyond the s.e. limit]
["Hidden Document" retrieval, simple example]
This is a windrose

How big is the Web?
(and how much of it has been indexed?)

Short answer: nobody knows
Optimistic answer: maybe we can guess it
Slightly longer answer: read this page
Really long answer: Visit our library, browse the web for info and study the problem for (at least) a couple of years, then prepare your own tests, reach some valid conclusions and let us know



Introduction 

"To find out about the web ahead, ask those coming back"

How big is the Web?

How big is the Web? Hard to say.
How do we calculate "pages"? What is exactly nowadays an "Host"? And what is a "Site"? When a search engine reports -say- 10.000.000 results do you trust it? And when it reports 10.235.987 results? And what if you don't trust such numbers?
Those results are not easy to check anyway, since engines do not allow any checking in depth of their own results after position ~1000 anyway. (See below "How to check results beyond the s.e. limit" as a partial solution)
Plenty of problems, and few real data we can build upon to start with.

So what can we do? C'mon, we are seekers. We'll find and squeeze the few existing data until they scream and confess!
Alas! Even the few existing data are rather unreliable, so what follows is the best estimate we could infer, basing on wild averages and extreme simplifications.
As the image at the top indicates, the indexed web is anyhow just a small part of the real Internet, and even on such a smaller turf, how much is really covered by the main search engines, and how much they do overlap among them with their indexes, is anyone guess.

Few entities have sufficient resources to tell us how many web pages there might be: The biggest search engines have of course some data: Google and Yahoo after all, spend their life visiting, analyzing and indexing the billions of web pages in the world. Yet neither company publicizes regularly the exact size of its index and there are not reliable data since August 2005 (for yahoo). But they keep mum most of the time. Bursting out some data, or bragging, only when a new contender does represent a menace or another search engine breaks this "gentlemen agreement to keep mum".


Few & sparse data

Our own estimates, as broad as they are, seem to indicate a current indexable size of around 60 billion pages, which would be after all consistent with netcraft's and isc.org's data, but which does not dovetail with cuil's (unsubstantiated) claims of having a database of 120 billion pages.
The following quick check of the four "big" search engines (using the vowel "a" method) confirms broadly the engines' claims (apart from MSNlive) only IF YOU TRUST THEIR OWN ALLEGED RESULTS about the number of pages they index... since there's no way to really check in depth these results (see the yoyo searching approach): there's in fact no way to exactly check all results beyond the engine's fixed limit (yet see below for a partial solution "How to check results beyond the s.e. limit"). So who knows if those numbers of indexed pages are true?

On a side note, it would be interesting to understand WHY search engines do not allow checking all the results they claim to have found? Avoiding servers overload? Or would this possibility really allow us to easily reverse engineer their algos (and better understand the mechanisms of their "tides")?


Let's try the vowels
Time to check the search engines' claims.
The letter "a", for instance, often used for this kind of "quick checking", gives in google (April 2009) "just" 16 billion pages (or up to 18 billions if you try again a couple of hours later... search engines' results vary continuously, their "tides" probably depend from your birth sign or from the moon position :-)
And anyway the "more tricky" a OR a query gives us in google slightly more: 19,330,000,000 pages. So let's say 20 billions: consistent, q.e.d.

Now yahoo, that claims a bigger index, gives indeed for "a" 43,100,000,000 pages consistent, q.e.d.
MSNlive, the third of the "big" engines, requires some nifty tricks to squeeze out the biggest pages count from the "a" vowel: 8 650 000 000 result, which does not seem consistent with their own claim of around 20 billion indexed pages. See graph.
If we check now the recent ("cool") CUIL, which claimed "the biggest index of them all", it indeed gives us indeed over 121 billions pages! Consistent, q.e.d.

So the "vowel a indexed pages finals" in April 2009 give the following results: Cuil = 120 billions; Yahoo = 40 billions; Goggle = 20 billions and MSNLive = 9 billions.

This said -as all searchers know- the quality of the algos is king, while "index size" is -as such- just accessorial. And google still seems to deliver quality results with a relatively small database of indexed pages.


Charts 

"Round numbers are always false"



From Netcraft data...
growth: sites
(averaging with crossed fingers)

...through isc.org data...
growth: hosts
...shaking well...


...to our own conclusions
growth: billions of pages
(billions of indexable web pages, that is :-)



The all mighty google monopole...
 usage 
...should not blind seekers into using only one engine


Search engines don't overlap that much:
 coverage 
main s.e. claimed indexes size & relative web-coverage




Checking (sort of) indexed databases:
 bigger than yours 
main s.e. **claimed** indexes size & vowel "a" search results




Checking their claims for real:
 real life results according to fravia 
main s.e. refresh rate and size of indexes according to our data


How to check results beyond the s.e. limit 

"Slice & splice suit a query nice"


-------- Intro --------
So so so... the search engine you just used claims an astronomical number of results for your query, yet when really checking these results you can never go beyond (or "below" if you prefer) a specific limit, which can be 1000 results (MSNLive, alltheweb, altavista and Yahoo), slightly less than 1000 (google) or even smaller (ask gives only ~200 viewable results).

Why do the search engines deny us the full set? Well, first of all it remains to be seen if engines can really produce astronomic totals, even in-house: search engineers say that such numbers are just an estimation of results, that depends on to how many "objects" are in the engine's database related to the search query. Hence it has -at most- a comparative value among queries on the same search engine.
Still... why do the search engines deny us a fuller set?
Most searchers believe this happens because they do not want to overload their server clusters, and surely it would be (exponentially!) more taxing for nan engine to order 1000000 results -say- versus 1000, moreover any robot script could and prolly would wreak bandwith havoc if allowed to delve into 1 billion or more links. Yet I think there is also another reason: a more complete set of result would make it relatively easier to reverse engineer some of the algos they use for ranking, which is what the beastly SEOs spammers toil and drool about all the time.
In fact "low positioning" in a set of results is often more telling that "best positioning".
That's incidentally the reason seekers -and spammers alike- prefer to reverse engineer algos with queries that will offer a limited number of results :-)

Back to our problem: only -say- 1000 results or less, but we would wish to investigate a bigger amount... what can we do?
Well, of course we can simply narrow out our "too broad" queries and eliminate noise until we fish purer signal inside a more manageable, much smaller set of results. Yet in some cases we do WANT to know what are all the results for a broad query. Besides this kind of deep diving is interesting per se, as anyone that uses the yoyo searching approach jolly well knows :-)

So we basically want to get past the search engines' imposed limits.
There are some ways to get more results (not all the astronomic amount, but MUCH more than the limit nevertheless): most of these approaches just split that astronomic result amount into multiple result sets (say) 1000+1000+1000+...

This is usually done either adding more words to the query (forcing or eliminating one or more term(s), and taking care to pick accompanying terms that are +- equally probable in order to avoid skewing the final collated results) or (a more typical approach) just manipulating the "advanced search" options to split all results per timeslice or per domain of provenience.
Of course afterwards, once obtained the multiple sets, you'll have to "splice together" these multiple results pages and eliminate possible doubles.
Any ad hoc script will help in making the process almost automatic. Note that the pages of results you'll get back won't map directly onto what would have been the "straight" set of results, still they will list all found pages in relevant result order, so it will be relatively easy to splice everything together.

Also note that if you change the url, (e.g. in google adding "&start=2000" you'll get the message "Sorry, Google does not serve more than 1000 results for any query. (You asked for results starting from 2000)"

-------- Temporal slicing & Domain slicing --------
Imo the best method is always to use a temporal slicing approach: Get the "normal" set of 1000 results with your straight query. Then get a somewhat different set with the same query, but specifying: "only pages updated in the last year" (in google just add the parameter: &as_qdr=y). Rinse and repeat with pages updated the year before (in google: &as_qdr=y2), and then the year before that... you get the idea...
Note however that the temporal slicing approach is prone to fish a lot of duplicates (for all regularly updated sites), so you'll definitely need some good cleaning scripts for the afterward splicing if you use this approach.

Another alternative (for instance when the subject is somehow contingent and as such doesn't allow a temporal slicing approach) is to go for a domain slicing approach and use "[only] return results from the site or domain" and restrict yourself to -say- .com. That will give you the top 1000 from .com domains alone.
Rinse and repeat for .org or .ru or co.uk or .edu and whatsnot (fellow seekers will know which domains to exclude (or force) according to the kind of fishing they are doing).

There are other ways to get indirectly past the results limit using the advanced parameters that most search engines offer.
A filetype slicing approach can be for instance quite useful in some bookish and document-oriented queries (excluding -say- .pfd files or .doc files from the results). Fellow seekers will know which filetypes to exclude (or force) according to the kind of fishing they are doing.

Another possibility is filesize slicing. Few engines allow you to do it for web pages (though most engines allow filesize slicing when searching images). Where the size parameters do exist, you might try preparing different "1000 pages" sets according to the size of the targets.


-------- Slice per hand --------
You can also slice results per hand, forcing the addition (or more often the subtraction) of a specific term. You'll need to choose carefully terms that split the resultmountain in ~one half.
For instance "web searching" on MSNLive (179,000,000 results) can be almost cut in half subtracting the term "Internet": "web searching" -Internet: 63,200,000 results.
Again, fellow seekers will need some "Fingerspitzgefühl", some "nose" for this kind of term choices... ça va sans dire.


-------- Google alone and you'r never done --------
If you need more results than you can obtain with the approaches described above, just use more main search engines! Combine (splice) the results of the same query by -say- Google, Yahoo, MSNLive, Ask, Altavista, Alltheweb and Cuil. Remember that search engines' indexes do not overlap much.




"Hidden Document" retrieval, simple example 

"just pull the exact target name & search again"


Simple example 1:
"searching the world wide web"
Let's say you are interested in this specific google result:

Organizing and searching the world wide web of facts -- step two

Organizing and searching the World Wide Web of facts -- step one: the one- million
fact extraction challenge. In Proceedings of the 21st National Conference ...
portal.acm.org/citation.cfm?id=1242572.1242587 - Similar pages
by M Paşca - 2007 - Cited by 14 - Related articles - All 10 versions


Ahi ahi! If you click on the link you'll be brought to an "ACM portal" ("the guide to computing literature", nonetheless), where some clown will ask you to "pay" in order to see your pdf target.
Well you can pay, of course. Or you can simply search for the exact title:
Organizing and searching the world wide web of facts -- step two:
And ta-daa! As many copies of your target as you might need, allover the web. For instance: http://www2007.org/papers/paper560.pdf
But we are not finished yet. Where there's a "step one:" there is also be a "step two", as the google original result also pointed out.
Quaerite et invenietis: Organizing and searching the world wide web of facts -- step one:
And ta-daa! As many copies of your target as you might need, allover the web. For instance: http://www2007.org/papers/paper560.pdf
Quod erat demonstrandum.





to basic
Back to basic
  
 
  
Petit image
Back to advanced

© 1952-2032: [fravia+], all rights reserved, coupla wrongs reversed

Page optimised for Opera. Other browsers? Couldn't care less.