Mining the Deep Web PDF Print E-mail
Written by Andy   
Monday, 07 July 2008
Article Index
Mining the Deep Web
Information At A Price
Data Heaven or Hype?
Entering The Deep Web
Further Reading
Appendix A: PubMed Central - Table of Open Access Journals

There is a revolution raging over access to scientific information, yet much of that information you will never find by using Google: it lurks in the Deep Web. Time to don the helmet and switch on the lamp......

Data MiningEverybody is familiar with using Google. You type in a simple search string, or use more advanced criteria including ANDs / ORs, and Google comes back with pages of hits. The first in the list will represent the page with the highest GPR (Google Page Ranking), which is a "score" assigned to that page or website by some pretty complicated Google algorithms. The exact content of those algorithms is kept secret, but amongst other things, the GPR indicates how popular the page or website is, and how many links point to it from other sites.

But, of course, along with the nuggets which Google returns there is an awful lot of dross to sift through. Try Googling on the word "atom" - a fairly basic scientific term - and in the first page of hits which Google returns, only a link to a Wikipedia article refers to "atom" in the scientific sense. Just one hit out of twenty-odd on the first page. It's in the company of computer terms, comic book sites and miscellaneous organisations, all of which are related to the word "atom". And incidentally, the adjective "atomic" doesn't produce one science-related hit in the first page of search results: Searching for other basic scientific terms often delivers one relevant hit in a page full of religious and pseudo-scientific dross. The reason for this is very simple: the type of scientific information you are searching for is not likely to win any competition for popularity. Perhaps your search text will therefore not appear in the first few dozen pages of Google search results, if at all.

Looking for the contents of an obscure scientific journal, or a scientific paper about chromosome x-crossover fluctuations in an alkaline environment? You could spend a long time searching Google. Ok, I made that last example up, but it's probably quite typical of the type of scientific jargon a researcher may need to find. However, there is another, very good, reason why you might not find detailed scientific information in Google. It's just not there.

I thought Google had everything.......

The history of internet search engines is one of competition and survival of the fittest. For some reason which nobody really knows, Google won, and became nearly everybody's search engine of choice. The Google search engine - never mind other Google add-ons, services and gadgets - certainly doesn't offer anything vastly different to other search engines like MSN or Alta Vista - once the preferred search engine for academics - but Google handles far more internet search queries than all other major search engines put together. In April 2007, the figures were: Google's market share 65.26%, to Yahoo’s 20.73%, MSN’s 8.46% and Ask’s 3.69%.(Source: MarketingPilgrim.com). As for the number of pages Google has in its index, current estimates put the figure at anything between 4 and 8 billion, growing by a few million each day. Of course, no search engine would claim to have every single website in its index, because such a thing is impossible, but there is something at play here which is important for seekers of scientific information: the true number of internet pages may be six hundred times the number one could ever find by using a search engine like Google or MSN. Read that again: the web is possibly six hundred times bigger than most people realise.

In 2000, a study performed by the University of Berkely estimated that this "invisible internet" comprises 91,000 terabytes of data. The so-called "surface web" - that which is readily accessible by search engines such as Google and Yahoo! - was reckoned to comprise just 167 terabytes. There is another, more common, term which is now used to describe the "invisible internet". It is called the Deep Web.

What is the Deep Web?

Is there, then, some sort of technical deficiency with search engines which prevents them mining the Deep Web? Well, yes, although it's not a software error per se. The reason has to do with the way web pages are generated. If you do a search in Google, what Google looks for is web pages which are static. That is to say, the pages' content may change - and, indeed, regularly-changing content is one thing which can improve a site's GPR, for it marks the site as being alive rather than defunct - but the page is based on HTML content. The page was there yesterday, and it will be there tomorrow when Google's robots crawl the web. It has a fixed URL which does not change.

However, the Deep Web consists largely of pages which are generated dynamically when their content is needed. One example which is often used is somebody searching for departure times of an airline flight. Input the date and time you wish to fly, and the airline's site will generate a page of matching results. However, you cannot search for this page in Google. You cannot, say, search Google for British Airways flights leaving from London, destination New York, on the 5th of April around 9am. The information is dynamic, not static. The reason is that behind the scenes, the airline's server generates the information you are looking for by running software scripts which query a database and then create the web page "on the fly" - if you'll excuse the pun.

A vast number of web pages are generated and assembled dynamically by scripts. These make up the Deep Web, along with intranet pages, classified information and other web pages which search engines like Google will never find. For researchers of science information, the Deep Web is where all the action is, for it allows them to query huge scientific databases dynamically. But they cannot do it using "surface web" search engines. Google does not have access to online scientific databases: many are held on government or university servers with restricted access, and in any event the information would not make any sense to a search engine without the software scripts to query, collate and publish that data. On its own, the data is pretty much unusable. Moreover, the permutations of queries one can construct to interrogate databases is virtually infinite, and companies like Google don't want millions upon millions of pages - which might only be generated once - cluttering up its index. Having said that, Google have developed a tool called the Sitemap Protocol to trawl the Deep Web, but it is not as effective as, and produces fewer results than, a dedicated search engine. And it will deliver the Deep Web content alongside that of the surface web: you don't know which is which. So again, you have the job of separating the wheat from the chaff.

Therefore, to find scientific papers, journals, citations or anything else in the Deep Web, we need specialist tools.


Last Updated ( Friday, 11 July 2008 )
 

Search Science File

Members' Login

If you would like to donate towards the cost of running Science File, via PayPal, please click the button.Thank you.

Upcoming Events

Wed, Jan 7th, 2009,7:30pm - 10:00PM
Café Scientifique, York, UK - Nanotechnology and Health: Hope or Hype?
Thu, Jan 8th, 2009
Symposium: Integrated Imaging in Life Science
Thu, Jan 8th, 2009,7:30pm - 09:00PM
IYA Lecture: The History of the Royal Observatory, Edinburgh
Thu, Jan 8th, 2009,8:00pm - 11:59PM
IYA Event - Public Star Party
Fri, Jan 9th, 2009,7:00pm - 10:00PM
Rutherford Appleton Laboratory Lecture: Creation by Evolution
Fri, Jan 9th, 2009,7:30pm - 11:00PM
IYA Event: Lincolnshire Wildlife Trust: The Winter Stars and Planets
Fri, Jan 9th, 2009,7:30pm - 09:00PM
IYA Lecture: Comets
Sat, Jan 10th, 2009,5:00pm - 10:00PM
Sheffield AS: IYA2009 Opening Event and Telescope Amnesty
Sat, Jan 10th, 2009,5:30pm - 09:30PM
IYA Event: Out of this World - Evening with the Stars for Young People
Mon, Jan 12th, 2009
Royal Society of Chemistry Lecture: Can Chemistry be Green?
Mon, Jan 12th, 2009,7:30pm - 09:30PM
IYA Night Class: Hubble Vision: The Legacy of the Hubble Space Telescope
Mon, Jan 12th, 2009,7:30pm - 08:30PM
IYA Lecture: Tour of the Universe
Tue, Jan 13th, 2009,7:00pm - 08:45PM
The Dana Centre: Natural-Born Killer?
Tue, Jan 13th, 2009,7:30pm - 09:30PM
IYA Night Class: Night Sky and Far Cosmos
Wed, Jan 14th, 2009,5:00pm - 08:00PM
IYA Lecture: Space Lecture Series
Wed, Jan 14th, 2009,7:00pm - 09:30PM
IYA Tutorial: Setting Up Your Telescope
Wed, Jan 14th, 2009,7:00pm - 08:30PM
The Dana Centre: Punk Science - Eat It
Wed, Jan 14th, 2009,7:00pm - 08:30PM
IYA Night Class: Universe
Thu, Jan 15th, 2009,7:00pm - 08:30PM
The Royal Institution: Cancer Therapy from Within
Fri, Jan 16th, 2009,7:00pm - 09:30PM
IYA Lecture: New Advances in Digital Astrophotography
Sat, Jan 17th, 2009,8:00pm - 11:59PM
IYA Event - Public Star Party
Mon, Jan 19th, 2009,7:00pm - 08:30PM
The Royal Institution: The Age of Wonder
Mon, Jan 19th, 2009,7:30pm - 08:30PM
IYA Lecture: Present and Future Giant Telescopes: The Challenges
Mon, Jan 19th, 2009,7:30pm - 09:30PM
IYA Night Class: Hubble Vision: The Legacy of the Hubble Space Telescope
Mon, Jan 19th, 2009,7:30pm - 10:00PM
IYA Event - Digital Astronomical Imaging

Science Events Calendar - Click A Day To View

January 2009 >
M T W T F S S
29 30 31 1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 31 1