Data Harmony Update a Suite Release

May 16, 2008

Access Innovations Inc., a data management systems company, is releasing version 3.4 of its Data Harmony software suite, and it sounds like a sweet deal.

The five-component software is used to make and maintain taxonomies, thesaurus, and indexing systems. Data Harmony focuses on accuracy, precision, and repeatability in its search results, an emphasis that receives a happy quack from the Arnold IT mascot.

The major updates include more than 30 new features and revised documentation (to keep you in tune). The company says current users will recognize the same look and feel of the program and appreciate “friendlier and more functional features.”

President and Chairman Marjorie M.K. Hlava said the upgrade comes courtesy user requests and suggestions. It’s refreshing to find a tech company making such efforts to rework a good product and actually making it better. We like Ms. Hlava’s old-fashioned, hands-on, we-care approach most refreshing at a time when software vendors do better PR than coding. The full list of the Data Harmony enhancements for 3.4 can be found here.

Jessica Bratcher, May 16, 2008

Content Transformation: A Challenge that Won’t Go Away

May 15, 2008

We live in a world of Web 2.0 and Web 3.0 goodness. At the Where 2.0 conference in Burlingame, California on May 14, 2008, I overheard this snippet of conversation:

We had everything working, but when we imported content, the system crashed. I reinstalled. I checked the config files. It still crashed. I have to open each file, resave it as an RTF, and import them one at a time. Grrrr.

Sound familiar?

I have heard this complaint many times before. In our content-savvy, XML-ized era, moving a source file into a content processing system should be trivial. The content processing system can extract entities. It can metatag. Some can slice, dice, and cook a chicken. But unless the system can intake content and transform it to something that the content processing subsystem understands, the system is dead in the water. Even worse, the text processing system only processes some of the source documents. In certain mission critical applications, kicking out documents is a no-no. Not only is the manual manipulation expensive, it’s time consuming. In those minutes or hours of fiddling, potentially significant data are not available to the analysts. What does missing information cost? Well, it depends on your work situation. In the Wall Street world, investment information can turn a win into a loss in a millisecond. In certain military applications, the information may mean the difference between health and harm.

square circle

Transforming a square into a circle or a circle into a square looks easy. With a triangle and a compasss you can create two objects. Its the intermediate steps that become tricky for an artist or a budding mathematician.

What is file or data transformation? In its simplest form, you have a file in Microsoft Word 2007 format, and you want to “transform” or change the file into a format recognized by another system’s import filter. So, one approach would be to open the File in Word 2007, click on File Save As, select RTF (Rich Text Format), and save the file. You can then allow your search or content processing system to suck the file into the conversion subsystem and turn the RTF into whatever target output format the filter generates. In a more sophisticated form, you take an unstructured document or a database table, and you transform it into some file type that your system can process. A more interesting task is to convert a file into a file with a comparable structure; for instance, take and SGML instance and convert it to HTML. Some search system vendors include filters and transformation tools with their system. Others provide an application programming interface. The idea is that you will write a script to perform whatever conversion you require, handle entities in an appropriate manner, and preserve the information and metadata (if available) throughout the process.

Let’s take a quick look at several transformation challenges and then step back to consider what steps you can follow to minimize these problems. Before jumping into the causes, keep in mind that as much as 30 percent of an information technology department’s budget is consumed by transformation costs. This astounding number surfaced in a presentation given by a Google engineer in 2007. If that number seems high, you can knock it down to a more acceptable 10 or 20 percent. The point is that fiddling with data when moving it from one system and format to another is a common task. Any transformation activity can go off the tracks. Read more

Collective Intelligence Anthology Available

May 14, 2008

The Arnoldit.com mascot admires the new collection of essay by Mark Tovey. Collective Intelligence: Creating a Prosperous World at Peace, published by the Earth Intelligence Network in Oakton, Virginia (ISBN: 13: 978-0-97-15661-6-3) contains more than 50 essays by analysts, consultants, and intelligence practitioners. You can obtain a copy from the publisher, Amazon, or your bookseller.

ci_art_02 copy

The ArnoldIT mascot completed reading the 600-page book with remarkable alacrity for a duck.

The collection of essays is likely to find many readers among those interested in social phenomena of networks. Many of the essays, including the one I contributed, talk about information retrieval in our increasingly inter connected world.

This essay will provide a synopsis of my contribution, “Search–Panacea or Play. Can Collective Intelligence Improve Findability”, which I wrote shortly before completing Beyond Search: What to Do When Your Search System Doesn’t Work“. My essay begins on page 375.

Social Search

The dominance of Google forces other vendors to look for a way over, under, around, or through its grip on the Web search. The vendor landscape now offers search and content processing systems that arguably do a better job of manipulating XML (Extensible Markup Language) content, figuring out who knows whom (the social graph initiative), and the “real” meaning of content (semantic search). There are more than 100 vendors who have technology that offers, if one believes the marketing collateral and conference presentations, a way to squeeze more information from information.

Social search is the name given to an information retrieval system that incorporates one or more of these functions:

  1. Users can suggest useful sites. Examples: Delicious.com and StumbleUpon.com
  2. The system discovers relationships between and among processed documents and links: Powerset.com and Kartoo Visu
  3. The system analyzes information extracts entities and identifies individuals and their relationships: i2 Ltd (now part of ChoicePoint) and Cluuz.com
  4. Monitoring of user behavior and using data to guide relevance, spidering and other system functions: public Web indexing companies

There are other types of social functions, but these provide sufficient salt and pepper for this information side dish. The reason I say side dish is that social functions are not going to displace the traditional functions on which they are based. Social search has been in the mainstream from the moment i2 Ltd. introduced its workbench product to the intelligence community more than a decade ago. “Social” functions, then, are a recent add-on to the main diet in information retrieval.

Old Statistics and Cheap, Powerful Computers

What’s overlooked in the rush to find a Google “killer” is that the new companies are using some well-known technologies. For example, the inner workings of Autonomy’s “black box” is somewhat dependent on the work of a slightly unusual Englishman, Thomas Bayes. Mr. Bayes left the world a couple of centuries ago, but his math has been a staple in college statistics courses for many years. To deploy Bayesian techniques on a large scale is, therefore, not exactly a secret to the thousands of mathematicians who followed his proofs in pursuit of their baccalaureate.

Read more

Sybase Jumps into the Content Processing Appliance Fray

May 13, 2008

Sybase announced on May 12, 2008, the roll out of its Sybase Analytic Appliance. The hardware is an IBM Power System preconfigured with Sybase IQ, Sybase PowerDesigner, and MIcroStrategy 8. The idea is to eliminate the fiddly tasks associated with setting up a data and content processing system. The idea is that a customer will get the benefits of a custom-built enterprise data warehouse in a ready-to-deploy device.

Sybase IQ is the column-oriented Sybase database engine. Column databases offer a performance boost over traditional relational databases. Sybase PowerDesigner is a model-driven tool intended to reduce the pain of building report requirements, models, and related tasks. MIcroStrategy 8 is a business intelligence system.

The cost for the system is based on the data volume. The information I saw quote an introductory price of $27,000 per terabyte of data. The design of the appliance allows “snap in” scaling. There are three versions of the appliance, and the prices rise as you move from the starter to standard to enterprise version. You can buy the device from Sybase, MIcroStrategy, or mLogica (a systems integrator).

Appliances can be criticized for their limited functionality. Sybase has done a good job of providing a bundle that gives the licensee considerable freedom to configure the device and manipulate data. Compared to other industrial-strength appliances, Sybase has an attractive launch price point. You will need to determine your data volume and data change rate in order to determine which appliance version is appropriate for your organization.

Stephen Arnold, May 13, 2008

Intelligenx Discloses Referrals Fuel Rapid Growth

May 12, 2008

In an exclusive interview, Iqbal and Zubair Talib, senior managers of Intelligenx, reveal that referrals have fueled the company’s rapid growth. Intelligenx has a leadership position in directory and “yellow page” search in South Africa, South America, and elsewhere. The company’s profile, despite its US headquarters in suburban Washington, DC, is modest.

The father-son team said:

It seems that our international clients are actively talking about our technology at international conferences. We can always do a better job of marketing, but we put our customers first. Sales occur because people come to us and say, “We want to license your system”… we maintained certain relationships among an elite group of scientists and engineers. We never signed up to give marketing talks at the marketing-oriented venues. Our success comes because certain people understand our technology and recognize that it delivers scale, speed, performance, data management today. Our technology is our marketing.

Unlike search and content processing firms who issue news releases when a Web site signs on to use a well-known search engine or when a vendor announces for the second or third time a reseller deal, Intelligenx keeps innovating and selling.

The company’s system offers almost all of the features associated with the best-known vendors in the search market sector. The Talibs said:

Intelligenx was first to market with technology that offered a true full-text search with what many people call faceted or assisted search results. To achieve this functionality, performance under heavy loads is the prevailing challenge and simply put, our Discovery Engine® solves the problem in what we think is a most elegant fashion “Facets” or “guided navigation” are not just a “checkbox” on a feature matrix but an underlying central philosophy in our technology, the company, and in the development of our system.

You can read about the company’s new stream processing of information, what the Talibs call “cluster flow”. In addition to near real time index updating, additional metadata are generated without adding latency to the system. Another interesting feature of the Intelligenx system is that a licensee can provide its sales people with a real time view of what advertisements are germane to a popular query. The sales person is able to show a prospective advertiser a live report of traffic and the payoff from an advertisement in a specific context.

The company’s technology offers an alternative to the better-known MarkLogic system and the specialist firm, Dieselpoint.

You can read the entire interview on the ArnoldIT.com Web site. The full text of the interview is part of the Search Wizards Speak feature. The exclusive interview is the 13th in this series of first-person accounts of the origin and functionality of important search and content processing systems. Click here to read the interview.

EasyAsk: Business Intelligence for End Users

May 7, 2008

Progress Software purchased EasyAsk in May 2005. Prior to the change in ownership, EasyAsk offered natural language search to a range of government and commercial clients. After the buy out, Progress narrowed the focus of EasyAsk, as I understand the transition, from a broad search vendor to eCommerce.

The initial positioning, according to information in my files, was:

The Progress EasyAsk Division provides natural language ad-hoc query solutions that empower non-technical users to quickly find and retrieve critical business information from multiple enterprise data sources. In addition, EasyAsk provides an integrated search, navigation and merchandising platform that optimizes the shopping experience on many of the world’s most successful eCommerce sites.

The value add that EasyAsk offered customers was a higher conversion rate than the conversion rate achieved by competitive software. In 2006, some of the company’s licensees–for example, Redcats USA and Lillian Vernon–reported conversion rates 15 percent or higher than the rates from competitor’s software. You can try out the EasyAsk system yourself by navigating to Lillian Vernon or Lands’ End. EasyAsk’s commercial customers don’t make their system accessible to outsiders. If you get a chance to access Ceridian’s Intranet , you can check out EasyAsk in a behind-the-firewall setting because EasyAsk is now pushing into the business intelligence market.

Now EasyAsk is expanding its scope and asserting that its system is a front end for the data mart or data warehouse. EasyAsk calls its approach “operational business intelligence”. EasyAsk describes its system as being “closer to the ground”; that is, it’s more accessible than traditional BI systems. Users require little or no training to create a custom report. Interaction is via a traditional search box or a point-and-click, assisted navigation interface. If a data warehouse is already built, EasyAsk can deploy its system in a matter of days.

In an interview on the Business Intelligence Network, EasyAsk’s Dr. Larry Harris, vice president and general manager of the EasyAsk division of Progress Software, said:

The inherent complexity of traditional BI tools prevents organizations from deploying these solutions company wide, and this inhibits individuals who might otherwise be able to act on the insight these tools provide from making better business decisions. EasyAsk for Operational BI provides employees at all levels of the organization with the ability to perform ad hoc business analysis as well as search for existing reports through the familiar search box interface, empowering them to make better business decisions more quickly.

A number of vendors are addressing the knowledge barrier that prevents industrial-strength business intelligence systems from broader use in an organization. If you know how to code and have a degree in statistics, the complexities of building queries and manipulating data cubes are trivial. For the average MBA, building a chopper from a stack of parts would be less difficult.

sample outputs

This graphic shows typical outputs from EasyAsk in response to a user’s natural language query.The user types a query; for example, “Crosstab sales by customer’s state and category” or “What account in the Bay area had the most orders in Q4, 2007?”

That’s the hurdle BI or business intelligence must leap over without tripping. EasyAsk’s trampoline is its NLP or natural language processing capability. The idea is that the user can type a “natural” question. The EasyAsk system “understands” the user’s query, converts it to a form understandable by the system, retrieves the needed information, and displays an answer.

Read more

Indexing Dynamic Databased Content

April 20, 2008

In the last week, there’s been considerable discussion of what is now called “deep Web” content. The idea is that some content requires the user to enter a query. The system processes the querey and generates a search result from a database. This function is easier to illustrate than explain in words.

Look at the screen shot below. I have navigated to Southwest Airlines Web page and entered a query for flights from Louisville, Kentucky, to Baltimore, Maryland.

southwest form

Here’s what the system shows me:

southwest result

If you do a search on Google, Live.com, or Yahoo, you won’t see the specific listing of flights shown below:

southwest flight listing

Read more

Data Bunny Unmasked

April 16, 2008

Earlier today, a well-paid, somewhat insightful senior executive ripped the fur off a 27 year charade. The keen investigative mind of the anonymous investigator revealed that the data bunny has been Stephen E. Arnold.

The shocking discovery dismayed the two known fans of Mr. Arnold. One chagrined client said:

We had no idea that Mr. Arnold was the data bunny. When he lectured at our company, we did not notice the ears. The information he conveyed was more important than his appearance. I’m not sure what he was wearing during the briefing. But now that the truth is revealed, we will not listen to his analyses if he wears those ears. I hope we don’t confuse substance and appearance again. Proper dress is more important than real information.

When Mr. Arnold learned that his secret was out of the hutch, he blinked his pink eyes and said, according to Donald Anderson, an engineer who has worked with Mr. Arnold for more than 15 years: “Those bunny ears are not funny. Mr. Arnold doesn’t wear them all the time or I just don’t notice them anymore.”

According to Mr. Anderson’, Mr. Arnold’s reaction was to stamp his paw and twitch his nose in frustration. Added Mr. Anderson, “I guess he thought the secret was safe. It’s sad. Almost like Lois Lane learning the identity of Superman. It’s sad, but the truth must come out.”

According to another member of the Beyond Search team, Mr. Arnold removed his bunny ears in disgust and slipped on his new Beyond Search rubber goose mask. A photograph of Mr. Arnold in his goose disguise is the basis of this Web log’s logo here.

Beyond Search will publish more details about this startling investigative discovery as they become available. Mr. Arnold’s attorney told Beyond Search, “Although the revelation is shocking, I have advised Mr. Arnold to not reveal the name of the genius who disclosed this 27 year old mystery.”

According to his attorney, Mr. Arnold’s final comment was, “Honk. Honk.”

Stephen Arnold, April 16, 2008

Google Forms: A Data Snout for a Bigger Creature

April 12, 2008

Navigate to Google’s Webmaster Central Blog. Scan the posting written by two wizards whom you probably don’t know, Alon Halevy (senior wizard) and Jayant Madhavan (slightly less senior wizard). Here’s what you will be told in well-chosen, Googley prose:

In the past few months we have been exploring some HTML forms to try to discover new web pages and URLs that we otherwise couldn’t find and index for users who search on Google. Specifically, when we encounter a <FORM> element on a high-quality site, we might choose to do a small number of queries using the form. For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML. Having chosen the values for each input, we generate and then try to crawl URLs that correspond to a possible query a user may have made. If we ascertain that the Web page resulting from our query is valid, interesting, and includes content not in our index, we may include it in our index much as we would include any other web page.

The idea is that dynamic content does not usually appear in an index. On the public Internet, this type of content is useful to me. For example, when I want to take a Southwest flight, I have to fill in some annoying Southwest forms, fiddle with drop down boxes, and figure out exactly which fare is likely to let me sit in one of the “choice” seats by boarding first. Wouldn’t it be great to be able to run a query on Google, see the flights aggregated, and from that master list jump to the order form? Dynamic content is now becoming more common.

I heard from one wizard at a conference in London that dynamic content is now more than half of the content appearing on the Web. The shift from static to dynamic is, therefore, a fundamental change in the way Web plumbing works on Web log content management systems to the sprawling craziness of Amazon.com.

pse

A diagram from Dr. Guha’s patent applications with the Context Server shown in relation to the other parts of the PSE. This is a figure from Google Version 2.0: The Calculating Predator, published by Infonortics, Ltd., Tetbury, Glou. in July 2007. Infonortics holds the copyright to this study and its contents.

Read more

Absolutes and Electronic Information

April 9, 2008

I find the research for my work fascinating. Periodically I root through some of the PDFs and PowerPoints used in my public talks.

Information in 2001

Today, while consolidating some information from a soon-to-be-retired NetFinity 5500, I came across a presentation I made to the legal information giant, Lexis Nexis, in year 2001.

The presentation sure didn’t win me any buddies in this $1 billion a year unit of the Euro-giant Reed Information. Reed, like the Thomson Corporation, maintains a low profile. Most people are unaware of what these two professional publishing companies do for a living, and I am not going to tell you that. You will have to figure it out for yourself.

My talk was given at some golf resort, and I don’t golf. I sat on my tail feather and waited to deliver my talk, which I titled “Information Professionals and In-Phase Services”. The main idea behind the talk was that anyone who used information for a living (lawyers, consultants, intelligence officers, and financial analysts) wanted current information in the context of their work.

The idea of stopping one thing to go ferret out a missing piece of information is growing long in the tooth. No, “long in the tooth” is too gentle even seven years after I wrote this presentation. Stupid, ill-advised, crazy, dumb — these are much more appropriate words. In year 2000, it was obvious — based on my research — that savvy users of information wanted information from one screen or dashboard. Furthermore that information should be [a] comprehensive, [b] current or fresh, and [c] in a form that allowed it to be cut-and-pasted or recycled without annoying manual reformatting.

I used this quote from Emily Dickinson to catch the crowd’s attention: “The truth must dazzle gradually / Or every man be blind…” No one knew what the heck I was talking about. To help the audience along, I used this chart from Forbes Magazine, October 2, 2000:

absolutes

The point of this study is that humans–more than two thirds of them in 2000–want fixed points in their lives. The notions of change, flux, transformation made people uncomfortable. The chart did little to win my audience’s confidence in my talk because I then told the group, “Absolutes are rarely found when we talk about electronic information.”

Read more

Next Page »