| Mining the Deep Web |
|
|
|
| Written by Andy | ||||||||
| Monday, 07 July 2008 | ||||||||
Page 1 of 6 There is a revolution raging over access to scientific information, yet much of that information you will never find by using Google: it lurks in the Deep Web. Time to don the helmet and switch on the lamp......
But, of course, along with the nuggets which Google returns there is an awful lot of dross to sift through. Try Googling on the word "atom" - a fairly basic scientific term - and in the first page of hits which Google returns, only a link to a Wikipedia article refers to "atom" in the scientific sense. Just one hit out of twenty-odd on the first page. It's in the company of computer terms, comic book sites and miscellaneous organisations, all of which are related to the word "atom". And incidentally, the adjective "atomic" doesn't produce one science-related hit in the first page of search results: Searching for other basic scientific terms often delivers one relevant hit in a page full of religious and pseudo-scientific dross. The reason for this is very simple: the type of scientific information you are searching for is not likely to win any competition for popularity. Perhaps your search text will therefore not appear in the first few dozen pages of Google search results, if at all. Looking for the contents of an obscure scientific journal, or a scientific paper about chromosome x-crossover fluctuations in an alkaline environment? You could spend a long time searching Google. Ok, I made that last example up, but it's probably quite typical of the type of scientific jargon a researcher may need to find. However, there is another, very good, reason why you might not find detailed scientific information in Google. It's just not there. I thought Google had everything.......The history of internet search engines is one of competition and survival of the fittest. For some reason which nobody really knows, Google won, and became nearly everybody's search engine of choice. The Google search engine - never mind other Google add-ons, services and gadgets - certainly doesn't offer anything vastly different to other search engines like MSN or Alta Vista - once the preferred search engine for academics - but Google handles far more internet search queries than all other major search engines put together. In April 2007, the figures were: Google's market share 65.26%, to Yahoo’s 20.73%, MSN’s 8.46% and Ask’s 3.69%.(Source: MarketingPilgrim.com). As for the number of pages Google has in its index, current estimates put the figure at anything between 4 and 8 billion, growing by a few million each day. Of course, no search engine would claim to have every single website in its index, because such a thing is impossible, but there is something at play here which is important for seekers of scientific information: the true number of internet pages may be six hundred times the number one could ever find by using a search engine like Google or MSN. Read that again: the web is possibly six hundred times bigger than most people realise. In 2000, a study performed by the University of Berkely estimated that this "invisible internet" comprises 91,000 terabytes of data. The so-called "surface web" - that which is readily accessible by search engines such as Google and Yahoo! - was reckoned to comprise just 167 terabytes. There is another, more common, term which is now used to describe the "invisible internet". It is called the Deep Web. What is the Deep Web?Is there, then, some sort of technical deficiency with search engines which prevents them mining the Deep Web? Well, yes, although it's not a software error per se. The reason has to do with the way web pages are generated. If you do a search in Google, what Google looks for is web pages which are static. That is to say, the pages' content may change - and, indeed, regularly-changing content is one thing which can improve a site's GPR, for it marks the site as being alive rather than defunct - but the page is based on HTML content. The page was there yesterday, and it will be there tomorrow when Google's robots crawl the web. It has a fixed URL which does not change. However, the Deep Web consists largely of pages which are generated dynamically when their content is needed. One example which is often used is somebody searching for departure times of an airline flight. Input the date and time you wish to fly, and the airline's site will generate a page of matching results. However, you cannot search for this page in Google. You cannot, say, search Google for British Airways flights leaving from London, destination New York, on the 5th of April around 9am. The information is dynamic, not static. The reason is that behind the scenes, the airline's server generates the information you are looking for by running software scripts which query a database and then create the web page "on the fly" - if you'll excuse the pun. A vast number of web pages are generated and assembled dynamically by scripts. These make up the Deep Web, along with intranet pages, classified information and other web pages which search engines like Google will never find. For researchers of science information, the Deep Web is where all the action is, for it allows them to query huge scientific databases dynamically. But they cannot do it using "surface web" search engines. Google does not have access to online scientific databases: many are held on government or university servers with restricted access, and in any event the information would not make any sense to a search engine without the software scripts to query, collate and publish that data. On its own, the data is pretty much unusable. Moreover, the permutations of queries one can construct to interrogate databases is virtually infinite, and companies like Google don't want millions upon millions of pages - which might only be generated once - cluttering up its index. Having said that, Google have developed a tool called the Sitemap Protocol to trawl the Deep Web, but it is not as effective as, and produces fewer results than, a dedicated search engine. And it will deliver the Deep Web content alongside that of the surface web: you don't know which is which. So again, you have the job of separating the wheat from the chaff. Therefore, to find scientific papers, journals, citations or anything else in the Deep Web, we need specialist tools. |
||||||||
| Last Updated ( Friday, 11 July 2008 ) | ||||||||
Search Science File
Members' Login
Upcoming Events
Science Events Calendar - Click A Day To View
| January 2009 | > |
| M | T | W | T | F | S | S |
| 29 | 30 | 31 | 1 | 2 | 3 | 4 |
| 5 | 6 | 7 | 8 | 9 | 10 | 11 |
| 12 | 13 | 14 | 15 | 16 | 17 | 18 |
| 19 | 20 | 21 | 22 | 23 | 24 | 25 |
| 26 | 27 | 28 | 29 | 30 | 31 | 1 |



Everybody is familiar with using Google. You type in a simple search string, or use more advanced criteria including ANDs / ORs, and Google comes back with pages of hits. The first in the list will represent the page with the highest GPR (Google Page Ranking), which is a "score" assigned to that page or website by some pretty complicated Google algorithms. The exact content of those algorithms is kept secret, but amongst other things, the GPR indicates how popular the page or website is, and how many links point to it from other sites.