Collective Intelligence Anthology Available

May 14, 2008

The Arnoldit.com mascot admires the new collection of essay by Mark Tovey. Collective Intelligence: Creating a Prosperous World at Peace, published by the Earth Intelligence Network in Oakton, Virginia (ISBN: 13: 978-0-97-15661-6-3) contains more than 50 essays by analysts, consultants, and intelligence practitioners. You can obtain a copy from the publisher, Amazon, or your bookseller.

ci_art_02 copy

The ArnoldIT mascot completed reading the 600-page book with remarkable alacrity for a duck.

The collection of essays is likely to find many readers among those interested in social phenomena of networks. Many of the essays, including the one I contributed, talk about information retrieval in our increasingly inter connected world.

This essay will provide a synopsis of my contribution, “Search–Panacea or Play. Can Collective Intelligence Improve Findability”, which I wrote shortly before completing Beyond Search: What to Do When Your Search System Doesn’t Work“. My essay begins on page 375.

Social Search

The dominance of Google forces other vendors to look for a way over, under, around, or through its grip on the Web search. The vendor landscape now offers search and content processing systems that arguably do a better job of manipulating XML (Extensible Markup Language) content, figuring out who knows whom (the social graph initiative), and the “real” meaning of content (semantic search). There are more than 100 vendors who have technology that offers, if one believes the marketing collateral and conference presentations, a way to squeeze more information from information.

Social search is the name given to an information retrieval system that incorporates one or more of these functions:

  1. Users can suggest useful sites. Examples: Delicious.com and StumbleUpon.com
  2. The system discovers relationships between and among processed documents and links: Powerset.com and Kartoo Visu
  3. The system analyzes information extracts entities and identifies individuals and their relationships: i2 Ltd (now part of ChoicePoint) and Cluuz.com
  4. Monitoring of user behavior and using data to guide relevance, spidering and other system functions: public Web indexing companies

There are other types of social functions, but these provide sufficient salt and pepper for this information side dish. The reason I say side dish is that social functions are not going to displace the traditional functions on which they are based. Social search has been in the mainstream from the moment i2 Ltd. introduced its workbench product to the intelligence community more than a decade ago. “Social” functions, then, are a recent add-on to the main diet in information retrieval.

Old Statistics and Cheap, Powerful Computers

What’s overlooked in the rush to find a Google “killer” is that the new companies are using some well-known technologies. For example, the inner workings of Autonomy’s “black box” is somewhat dependent on the work of a slightly unusual Englishman, Thomas Bayes. Mr. Bayes left the world a couple of centuries ago, but his math has been a staple in college statistics courses for many years. To deploy Bayesian techniques on a large scale is, therefore, not exactly a secret to the thousands of mathematicians who followed his proofs in pursuit of their baccalaureate.

Read more

Sybase Jumps into the Content Processing Appliance Fray

May 13, 2008

Sybase announced on May 12, 2008, the roll out of its Sybase Analytic Appliance. The hardware is an IBM Power System preconfigured with Sybase IQ, Sybase PowerDesigner, and MIcroStrategy 8. The idea is to eliminate the fiddly tasks associated with setting up a data and content processing system. The idea is that a customer will get the benefits of a custom-built enterprise data warehouse in a ready-to-deploy device.

Sybase IQ is the column-oriented Sybase database engine. Column databases offer a performance boost over traditional relational databases. Sybase PowerDesigner is a model-driven tool intended to reduce the pain of building report requirements, models, and related tasks. MIcroStrategy 8 is a business intelligence system.

The cost for the system is based on the data volume. The information I saw quote an introductory price of $27,000 per terabyte of data. The design of the appliance allows “snap in” scaling. There are three versions of the appliance, and the prices rise as you move from the starter to standard to enterprise version. You can buy the device from Sybase, MIcroStrategy, or mLogica (a systems integrator).

Appliances can be criticized for their limited functionality. Sybase has done a good job of providing a bundle that gives the licensee considerable freedom to configure the device and manipulate data. Compared to other industrial-strength appliances, Sybase has an attractive launch price point. You will need to determine your data volume and data change rate in order to determine which appliance version is appropriate for your organization.

Stephen Arnold, May 13, 2008

Commercial Intelligence: A Better Way to Do Competitive Intelligence

May 13, 2008

Business intelligence and competitive intelligence are “not really intelligence”, asserts Robert D. Steele, well-known advocate of open source information and managing director of OSS.Net. In an exclusive interview with Beyond Search, Mr. Steele–who is one of the strongest advocates for the use of open source information for intelligence–says that commercial business intelligence “systems are edging toward failure. The systems aren’t very good, useful, or usable.”

The fix to the problems of today’s software-based approaches to intelligence is a mixed approach. He says that a better approach:

…consists of requirements definition (understand the question in context, the desired outcome); collection management (know who knows), source discovery and validation (generally done by expert humans who have spent their life mastering the domain, at someone else’s expense); analysis, which can be aided by but does not necessarily require automated support; and compelling timely actionable presentation to the decision-maker.

You can read the full interview on the Interview section of the Beyond Search Web log site here.

Stephen Arnold, May 13, 2008

Former Clandestine Operative Says Automated Systems Not Good Enough

May 13, 2008

Editor’s Note: Robert Steele, former Marine Corp. officer and intelligence operative, was one of the first, if not the first, intelligence professional since World War II to question the relative value of secret sources and technologies in relation to open sources and technologies. Mr. Steele agreed to meet me near his office in suburban Washington, D.C. The full text of the interview appears below. After we spoke, Mr. Steele provided me with illustrations he referenced in our conversation. I have included these in the transcript at the point where Mr. Steele references them. You can read more about Mr. Steele at his Web site, OSS.Net.

How did you get interested in using information that’s readily available to anyone in a library, in newspapers, and online as a source of useful intelligence?

I went into the international spy program at CIA with a Master’s in International Relations, and knew quite a bit about citation analysis and primary research. What I was not expecting over the course of my clandestine career was the obsession with stealing secrets to the exclusion of all that could be known from open sources.

steele

Robert D. Steele

The clandestine officers also refused to interact with the analysts—before leaving for my first overseas assignment, the Chief of Station took me to the analysis side of the house, and on my way there he said something along the lines of “these folks know nothing useful, and we tell them nothing.”

When the Marine Corps asked me to leave CIA to create the Marine Corps Intelligence Center in 1988, I promptly did what I thought the government wanted; that is, I spent $20 million on a codeword analysis center, including a Special Intelligence Communications (SPINTCOM) work station. I thought it would do everything except kill the terrorist.

Was I in for a shock. I had put a PC with Internet access in an isolated room, not connected to any government network. The PC had a modem. I was curious about online and bulletin board systems. In a short time, analysts were leaving their super charged workstations to stand in line to use the PC. These professionals were looking for information that was not in the government system and not known to our officers in the field (including diplomats and commercial or defense attaches).

What a wake up call.

That is when I learned that expensive systems are as good as their sources—narrow casting into the secret world made much of our multi-billion dollar technology virtually worthless. Analysts using the PC showed me that 80 to 90 percent of the information we needed could be obtained using the PC and public information to include direct calls to overt human experts. I also learned that useful information was available in 183 other languages no one in the US Government can speak or understand. Even today, a large number of Washington officials don’t understand the intelligence value of open sources of information including commercial imagery, foreign-language broadcasts that must be accessed locally, and gray literature, such as university yearbooks for a photo of a terrorist. Washington is completely out of touch with human experts that are not US citizens eligible for a secret clearance—the spies don’t want them unless they agree to commit treason, and the analysts are not allowed to talk to them by paranoid ignorant security officials.

Almost every vendor asserts that their systems can “do” business or competitive intelligence. In your experience is this accurate?

Look. BI and CI are not really intelligence.

BI or business intelligence is commonly used as a descriptor for what is nothing more than internal knowledge management, spiced up with a point-and-click graphics dashboard. Not only are most of these system non-interoperable with everything else, they are as smart or as stupid as the digital data they can access.

The reality of information in most organizations is that most of what is really valuable is not digital. And, most CEOs have zero idea what intelligence (decision support) actually means.

CI or competitive intelligence focuses on competitors. What I practice, Commercial Intelligence, focuses on

  • External information
  • Collaborative work
  • Knowledge management
  • Organizational intelligence.

Commercial intelligence leverages what can be drawn from the human social networks interacting with an organization and the other sources of information. External information is not information about competitors. It includes such factors as “true cost” of goods and next-generation “cradle to cradle” opportunities. You have to factor in the art and science of retaining Organizational Intelligence. I will send you a diagram that shows my view of this commercial intelligence space.

four sectors

In my experience, today’s systems are edging toward failure. The systems aren’t very good, useful, or usable. As the Gartner Group recently said about Windows, it is untenable. I like Microsoft for its cash flow—they need to dump the legacy and launch an open source network with shared call centers and Blue Cube power processing.

Read more

Groping the Enterprise Search Elephant

May 12, 2008

In the 2000 to 2003 period, ArnoldIT.com delivered a number of tutorials about search. Some of these presentations were held in conjunction with conferences such as the Boston Search Engine Meeting, Gilbane’s conferences, and the Information Today line up of professional programs. Others were delivered to small groups at various financial institutions, search vendors, and government entities.

elephant_final

This is the search elephant. In a meeting, you will hear many people talk about search. Each person will have a specific meaning and assume that the others in the room will know exactly what’s meant when she uses the word search. If you take all these individual meanings of search and put them together, you have a better idea of what a search system is supposed to deliver.

In each case, I had to take more time than budgeted to define the different types of search encountered in enterprise behind-the-firewall deployments. This issue surfaced this week end when I spoke with a colleague grousing about the different perceptions of search in a consulting firm in Europe.

The purpose of this essay is to provide an abbreviated and hopefully useful look at the different meanings of search. To help make these ideas concrete, You can learn more about this subject in Enterprise Search Report and the brand-new Beyond Search study that came out in April 2008. I wrote the first three editions of ESR and played a minor part in the current edition, but you will get some color on this topic in those for-fee analyses.

Everybody Knows about Search

The definition issue is skipped over because most people today believe they know about search. At dinner last night, people said, “I did a search for a cruise to Brazil”, “I looked up my health care benefits and found they were reduced” and I’m not sure it’s worth seeing” and “My boss had me find a proposal he thought he had lost when his laptop was stolen”. None of these people were information retrieval professionals or computer scientists. But each of them talked about search as if it were a routine activity like finding a parking space.

The need for a definition goes up when people assume others mean the same thing for search. Let’s look at the meanings for search in an enterprise.

Enterprise Search or Behind-the-Firewall Search

This is the buzz word of the moment. Companies know intuitively that if a worker can’t find information on the company’s own internal network, the worker is going to waste time looking for what’s needed. Even worse, the employee can’t find the accurate information and makes a bone head decision.

Enterprise search is a contradiction. No boss in the world wants “everything” indexed and searchable. Problems come from indexing “everything”. A few of the bombs in the enterprise search mine field are:

  • Email on topics that are or can be problematic
  • Information about company secrets like Coca Cola’s formula for the fizzy drink
  • Information about legal matters
  • Information an employee puts on a company server about non-company activities
  • Personal, salary, and medical information
  • Pricing information
  • Stolen software, information from a third-party provider without paying a license fee or obtaining a copyright permission, information about a competitor that was obtained via an email from a friend

Search works best when the domain of information to index is narrowly defined, reviewed, and subject to a formal approval and review policy. Ad hoc indexing of behind-the-firewall information can trigger big trouble fast.

Read more

Intelligenx Discloses Referrals Fuel Rapid Growth

May 12, 2008

In an exclusive interview, Iqbal and Zubair Talib, senior managers of Intelligenx, reveal that referrals have fueled the company’s rapid growth. Intelligenx has a leadership position in directory and “yellow page” search in South Africa, South America, and elsewhere. The company’s profile, despite its US headquarters in suburban Washington, DC, is modest.

The father-son team said:

It seems that our international clients are actively talking about our technology at international conferences. We can always do a better job of marketing, but we put our customers first. Sales occur because people come to us and say, “We want to license your system”… we maintained certain relationships among an elite group of scientists and engineers. We never signed up to give marketing talks at the marketing-oriented venues. Our success comes because certain people understand our technology and recognize that it delivers scale, speed, performance, data management today. Our technology is our marketing.

Unlike search and content processing firms who issue news releases when a Web site signs on to use a well-known search engine or when a vendor announces for the second or third time a reseller deal, Intelligenx keeps innovating and selling.

The company’s system offers almost all of the features associated with the best-known vendors in the search market sector. The Talibs said:

Intelligenx was first to market with technology that offered a true full-text search with what many people call faceted or assisted search results. To achieve this functionality, performance under heavy loads is the prevailing challenge and simply put, our Discovery Engine® solves the problem in what we think is a most elegant fashion “Facets” or “guided navigation” are not just a “checkbox” on a feature matrix but an underlying central philosophy in our technology, the company, and in the development of our system.

You can read about the company’s new stream processing of information, what the Talibs call “cluster flow”. In addition to near real time index updating, additional metadata are generated without adding latency to the system. Another interesting feature of the Intelligenx system is that a licensee can provide its sales people with a real time view of what advertisements are germane to a popular query. The sales person is able to show a prospective advertiser a live report of traffic and the payoff from an advertisement in a specific context.

The company’s technology offers an alternative to the better-known MarkLogic system and the specialist firm, Dieselpoint.

You can read the entire interview on the ArnoldIT.com Web site. The full text of the interview is part of the Search Wizards Speak feature. The exclusive interview is the 13th in this series of first-person accounts of the origin and functionality of important search and content processing systems. Click here to read the interview.

Powerset Available

May 12, 2008

Navigate to Powerset.com and try out the much-publicized Web search system. Using proprietary technology plus third-party components, Powerset is a semantic search system. The system differentiates itself with fact extraction (Factz, in Powerset jargon), direct links to definitions, and a summary / outline view. A big yellow sticky note says that Powerset is searching Wikipedia articles, but my test queries returned useful information in the results list in default mode; for example, the name of Tropes Zoom, a system I had heard about but never seen. A quick Google search allowed me to pinpoint Semantic Knowledge as a company with a technology of this name. I’m not sure Powerset envisioned my use of its system as a front end for Google, but that use jumped out at me. Check it out and let me know if you think it is better than Google, Hakia, or Exalead. These are systems that contain a dollop of semantic sauce. Hopefully the company will provide a larger content index either by spidering the Web or via a metasearch like Vivisimo’s.

Stephen Arnold, May 12, 2008

Kartoo’s Visu: Semantic Search Plus Themescape Visualization

May 11, 2008

In England in December 2007, I saw a brief demonstration of Kartoo.com’s “thematic map”, which was announced in 2005.

The genesis for the company was developed from the relationships with large publishing groups into 1997. Mr. Baleydier was working to make CD-ROMs easily searchable. Founded in 2001 by Laurent and Nicholas Baleydier to provide a more advanced search interface. You can find out more about the company at Kartoo.net. Kartoo S.A. offers a no-charge metasearch Web system at Kartoo.com.

The original Kartoo service was one of the first to use dynamic graphics for Web search. Over the last few years, the interface became more refined. But the system presented links in the form of dynamic maps. Important Web sites were spherical, and the spheres were connected by lines. Here’s an example of the basic Kartoo interface as it looked on May 11, 2008, for the query “semantic search” run against the default of English Web sites. (The company also offers Ujiko.com, which is worth a quick look. The interface is a bit too abstract for me. You can try it here.)

defaultresultsonmay2008

The dark blue “ink blots” connect related Web sites. The terms provide an indication of the type of relationship between or among Web sites. You can click on this interface and explore the result set and perform other functions. Exploration of the interface is the best way to explore its features. Describing the mouse actions is not as effective as playing with the system.

Another company–Datops SA–was among the first to use interesting graphic representations of results. I recall someone telling me that the spheres that once characterized Groxis.com’s results had been influenced by a French wizard. Whether justified or not, when I saw spheres and ink blots, I said to myself, “Ah, another vendor influenced by French interface design”. In talking with people who use visualizations to help their users understand a “results space”, I’ve had mixed feedback. Some people love impressionistic representations of results; others, don’t. Decades ago I played a small role in the design of the F-15 interface or heads-up display. The one lesson I learned from that work was that under pressure, interfaces that offer too many options can paralyze reaction time. In combat, that means the pilot could be killed trying to figure out what graphics means. In other situations where a computational chemist is trying to make sense of 100,000 possible structures, a fine-grained visualization of the results may be appropriate.

Read more

Google: A Brace of Media Analyzer Inventions

May 11, 2008

On May 8, 2008, the USPTO, an outstanding organization with a stellar search system, published two Google patent applications. US2008/0107337 is “Methods and Systems for Analyzing Data in Media Material Having Layout” and US2008/0107338 is “Media Material Analysis of Continuing Article Portions”. You can download these here.

Both inventions, to which Google is the assignee, pertain to figuring out what’s important and what’s not on Web pages. Companies that scan hard copy and convert those images to machine-readable ASCII use some tricks but a great deal of brute force to figure out what’s information and what’s advertising or other dross.

The inventions’ systems and methods can also be applied to other types of images converted to a machine-readable form; for example, a PDF that consists of the PDF wrapper and the TIFF image in the wrapper. I know that commercial database publishers are on top of Google’s innovations in content processing, so this is old news to the wizards at ProQuest, Reed Elsevier, and Thomson Reuters. But others in the less rarified atmosphere may find these disclosures interesting. Two patent documents stumbling through the USPTO’s hallowed halls are not an accident of fate.

Stephen Arnold, May 11, 2008

Let’s Assume Microsoft Acquires Powerset

May 10, 2008

I read Dan Farber’s most intriguing post “Is Microsoft Stalking Powerset’s Search Technology?”

I have Saturday chores to do, and I was sweeping the garage with the Microsoft-Powerset tie up buzzing in my head. I dropped the broom and grabbed by notebook for this post. Please, navigate to the News.com site and snag this “Outside the Lines”, May 10, 2008, information.

Mr. Farber writes:

Powerset raises the bar on search based on a preview that I had of the service last month. Powerset differs from the Google in that it extracts and indexes concepts, relationships, and meaning, rather than keywords. It’s able to create connections and pivot in some cases in ways that elude Google’s proficient engine, which favors more of a statistical approach

I saw an interesting demonstration of the Powerset technology at the BearStearns’ (oh, the late, lamentable BearStearns’) Internet Conference a year or two ago. I also received a link that allowed me to run some test queries on the system. Based on technology from Xerox PARC (Palo Alto Research Center), Powerset delivers some of the functionality I wrote about in my description of Cluuz.com here.

Quite a few companies are processing content, identifying relationships, and trying to move beyond key word search. I’m not going to revisit these points. My broom awaits, and I want to offer these ideas for comment:

  1. Assume the Microsoft buys Powerset. Now the giant from Redmond has to figure out what to do with its various Live.com search functionality, the Fast Search & Transfer Web search (which you can see here as AllTheWeb.com, branded as a Yahoo service but delivered using Fast Search & Transfer’s system), and the hybrid solution from Powerset (home-grown plus the third-party code from Xerox PARC).
  2. Powerset has undergone a lengthy gestation. I think the service is interesting, but Hakia, which beat Powerset to market, has a niche focus in health care and a growing appetite for enterprise deals. If I had to pick between Hakia and Powerset, I think I would lean toward the Hakia system for two reasons: [a] most, if not all of the code, is the product of the Hakia team, so there’s no pesky third-party involved; and [b] the company, despite its hunger for capital, has pushed products out the door, not just demonstrated prototypes.
  3. Microsoft has to find a way to slow Googzilla, and I am not certain that buying search technologies is a way to throw some body punches at the mathematicians in Mountain View, California. For example, Google continues to build out a 21st-century version of the “pre-break up” AT&T infrastructure without much push back from anyone. Even IBM has a bad case of Google love. AT&T and Verizon along with Wall Street see Google as a one-trick pony, albeit a big, big pony. Loading up on search wizards is a good thing. Trying to integrate different search technologies into the existing Microsoft platform may be less good.

Okay, now I have to return to my garage clean up duty. A happy quack from the Beyond Search goose to Mr. Farber for his interesting article and the respite he gave my tired wings.

Stephen Arnold, May 10, 2008

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta