Dark Web: How Big Is It?

January 11, 2016

I read “Big Data and the Deep, Dark Web.” The write up raises an important point. I question the data, however.

First, there is the unpleasant task of dealing with terminology. A number of different phrases appear in the write up; for example:

Dark Web
Deep Web
Surface Web
Web World Wide

Getting hard data about the “number” of Web pages or Web sites is an interesting problem. I know that popular content gets indexed frequently. That makes sense in an ad-driven business model. I know that less frequently indexed content often is an unhappy consequence of resource availability. It takes time and money to index every possible link on each index cycle. I know that network latency can cause an indexing system to move on to another, more responsive site. Then there is bad code, intentional obfuscation such as my posting content on Xenky.com for those who attend my LEA/Intelligence lectures sponsored by Telestrategies in information friendly Virginia.

Then what is the difference between the Surface Web, which I call the Clear Web which allows access to a Wall Street Journal article when I click a link from one site and not from another. The Wall Street Journal requires a user name and password—sometimes. So what is this? A Clear Web site or a visible, not accessible site?

The terminology is messy.

Bright Planet coined the Deep Web moniker decades ago. The usage was precise: These are sites which are not static; for example dynamically generated Web pages. An example would be the Southwest Airlines fare page. A user has to click in order to get the pricing options. Bright Planet also included password protected sites. Examples range from a company’s Web page for employees to sites which require the user to pay money to gain access.

Then we have the semi exciting Dark Web, which can also be referenced as the Hidden Web.

Most folks writing about the number of Web sites or Web pages available in one of these collections are pretty much making up data.

Here’s an example of fanciful numerics. Note the disclaimers which is a flashing yellow caution light for me:

Accurately determining the size of the deep web or the dark web is all but impossible. In 2001, it was estimated that the deep web contained 7,500 terabytes of information. The surface web, by comparison, contained only 19 terabytes of content at the time. What we do know is that the deep web has between 400 and 550 times more public information than the surface web. More than 200,000 deep web sites currently exist. Together, the 60 largest deep web sites contain around 750 terabytes of data, surpassing the size of the entire surface web by 40 times. Compared with the few billion individual documents on the surface web, 550 billion individual documents can be found on the deep web. A total of 95 percent of the deep web is publically accessible, meaning no fees or subscriptions.

Where do these numbers come from? How many sites require Tor to access their data. I am working on my January Webinar for Telestrategies. Sorry. Attendance is limited to those active in LEA/Intelligence/Security. I queried one of the firm’s actively monitoring and indexing Dark Web content. That company which you may want to pay attention to is Terbium Labs. Visit them at www.terbiumlabs.com. Like most of the outfits involved in Dark Web analytics, certain information is not available. I was able to get some ball park figures from one of the founders. (He is pretty good with counting since he is a sci-tech type with industrial strength credentials in the math oriented world of advanced physics.

Here’s the information I obtained which comes from Terbium Labs’s real time monitoring of the Dark Web:

We [Terbium Labs] probably have the most complete picture of it [the Dark Web] compared to most anyone out there. While we don’t comment publicly on our specific coverage, in our estimation, the Dark Web, as we loosely define it, consists of a few tens of thousands or hundreds of thousands of domains, including light web paste sites and carding forums, Tor hidden services, i2p sites, and others. While the Dark Web is large enough that it is impossible to comprehensively consume by human analysts, compared with the billion or so light web domains, it is relatively compact.

My take is that the Dark Web is easy to talk about. it is more difficult to obtained informed analysis of the Dark Web, what is available, which sites are operated by law enforcement and government agencies which sites are engaged actively is Dark Web commerce, information exchange, publishing, and other tasks.

One final point: The Dark Web uses Web protocols. In a sense, the Dark Web is little more than a suburb of the metropolis that Google indexes selectively. For more information about the Dark Web and its realities, check out my forthcoming Dark Web Notebook. If you want to reserve a copy, email benkent2020 at yahoo dot com. LEA, intel, and security professionals get a discount. Others pay $200 per copy.

Stephen E Arnold, January 11, 2016

Written by Stephen E. Arnold · Filed Under Dark Web, News

Comments

Comments are closed.

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.