DBsight Search: Worth a Closer Look

April 30, 2008

Version 1.5.4 of DBSight is now available from DBSight.net. The system delivers full-text search on any relational database. The system features a built-in database crawler following user-defined SQL, incremental indexing, user-controllable result ranking, an option for highlighting search terms in a result, and displaying results as categorized result counts. If you want to avoid writing Java code, a a graphical interface is provided for almost all configuration settings. In this release, There is a UI for all operations, so no Java coding is necessary. The current version allows synchronization of deleted or updated records. The system also supports the Eden space strategy which is one way to make Lucene more tractable.

One of the major changes is the reduction of the memory footprint for facet search for fields with with multiple categories. This version refreshes the index if only a deletion occurs. Support has been added to permit searching of connected words; that is, words where spaces do not appear between the words. Spaces are now permitted in range queries. A text and XML MIME type has been added to XML search result queries. In this version, the index appears in Memory mode on the dashboard. A feature summary is available on the DBSight Web site here. Complete pricing, free from tricks and gotchas, is here. The J2EE search platform is available free for non-commercial use. A free download is here.

DBSight Inc was started in 2004. The product grew from the founders’ efforts to make search fast and scalable. You can contact the company at dbsight at gmail dot com. I strongly recommend a test drive.

Stephen Arnold, April 30, 2008

Tibco Adds Silverlight to Its Arsenal

April 30, 2008

Tibco (the information bus company) is not a PC or Mac user’s touchstone. Nevertheless, the company is a force in the enterprise computing sector. At its user conference in San Francisco, Tibco announced that it would support Microsoft Silverlight, according to Infoworld.

Silverlight is Microsoft Silverlight is a cross-browser, cross-platform, and cross-device plug-in for delivering the next generation of .NET based media experiences and rich interactive applications for the Web. You can read more about this technology here.

The Tibco technology is hugely popular among banks and financial services firms. Tibco’s “information bus” concept has influenced a number of other enterprise software companies, including search firms Autonomy and Fast Search & Transfer (now part of the Microsoft combine). The ground-breaking idea originated with Vivek Randivé, Tibco’s founder, in the 1980s.

The company’s decision is likely to have far-reaching consequences in the enterprise market. Rich media applications have been slow to make their appearance in some blue-chip corporations. With this move, Microsoft gains an important foothold in the “enterprise bus” that underpins numerous message and information functions. Furthermore, as Microsoft strives to expand its defensive wall against the slow seepage of Google into the enterprise, Tibco’s support is significant. If Microsoft continues to embrace Linux, the deal may also hearten the John Lervik, new head of Microsoft’s enterprise search initiative.

Stephen Arnold, April 29, 2008

Boston Search Engine Meeting, Day Two

April 30, 2008

The most important news on Day Two of Infonortics’ Boston Search Engine Meeting was the announcement of the “best paper awards” for 2008. The Evvie–named in honor of Ev Brenner–one of the leaders in online information systems and functions–was established after Mr. Brenner’s death in 2006. Mr. Brenner served on the program committee for the Boston Search Engine Meeting since its inception almost 20 years ago. Mr. Brenner had two characteristics that made his participation a signature feature of each year’s program. He was willing to tell a speaker or paper author to “add more content” and after a presentation, Mr. Brenner would ask a presenter one or more penetrating questions that helped make a complex subject more clear.

Sponsored by ArnoldIT.com, the Evvie is an attempt to keep Mr. Brenner’s push for excellence squarely in the minds of the speakers and the conference attendees.

This year’s winners are:

  • Best paper: Charles Clarke, University of Waterloo. His paper “XML Retrieval: Problems and Potential” explained that XML (Extensible Markup Language) is no panacea. Properly used, XML systems create new ways to make search more useful to users. He received a cash prize and an engraved Evvie award.
  • Runner up: Richard Brath, Oculus, for his paper “Search, Sense-Making and Visual User Interfaces”. In this paper, Mr. Brath demonstrated that user interface becomes as important as the underlying content processing functions for search. He received an engraved Evvie award.

evvie 2008

Left: Richard Brath (Oculus), center: Stephen E. Arnold (ArnoldIT.com), right: Charles Clarke (University of Waterloo).

This year’s judges were Dr. Liz Liddy, Syracuse University, Dr. David Evans, Just Systems (Tokyo), and Sue Feldman, IDC Content Technologies Group. Dr. Liddy heads the Center for Natural Language Processing. Dr. Evans, founder of Clairvoyance, is one of the foremost authorities on search. Ms. Feldman is one of the leading analysts in the search, content processing, and information access market sector. Congratulations to this year’s Evvie winners.

Read more

Boston Search Engine Meeting, Day One

April 29, 2008

The Infonortics’ meeting attracts technologists and senior managers involved in search, content processing and information access. For the full program and an overview of the topics, navigate to http://www.infonortics.com.

Summaries of the talks and versions of the PowerPoints will be available on the Infonortics’ Web site on or before May 2, 2008. I will post a news item when I have the specific link.

Background

This conference draws more PhDs per square foot than a Harvard coffee shop. Most of the presentations were delightful if you enjoy equations with your latte. In the last two years, talks about key word search have yielded to discussions about advanced text manipulation methods. What’s unique about this program is that the invited presenters talk with the same enthusiasm an undergraduate in math feels when she has been accepted into MIT’s PhD physics program.

The are often spiced with real world descriptions of products that anyone can use. A highlight was the ISYS Search Software combined useful tips with a system that worked–no PhD required.

Several other observations are warranted:

  • Key word search and long lists of results are no longer enough. To be useful, a system has to provide suggestions, names people, categories, and relevance thermometers
  • An increasing appetite for answers combined with a discovery function.
  • Systems must be usable by the people who need the system to perform a task or answer a question.

Chatter at the Breaks

Chatter at the breaks was enthusiastic. In the conversations to which I was party on Monday, three topics seemed to attract some attention.

First, the acquisition of Fast Search by Microsoft was the subject of considerable speculation. Comments about the reorganization of Microsoft search under the guidance of John Lervik, one of Fast Search’s founders sparked this comment from one attendee: “Organizing search at Microsoft is going to be a very tough job.” One person in this informal group said, “I think some if not all of the coordination may be done from Fast Search’s offices in Massachusetts and Norway.” The rejoinder offered by one individual was, “That’s going to be really difficult.”

Second, the search leader Autonomy’s share price concerned one group of attendees. The question was related to the decline in Autonomy share price on the heels of a strong quarterly report. No one had any specific information, but I was asked about the distribution of Autonomy’s revenue; that is, how much from core search and how much from Autonomy’s high profile units. My analysis–based on a quick reading of the quarterly report press announcements — suggests that Autonomy has some strong growth from the Zantaz unit and in other sectors such as rich media. Autonomy search plays a supporting role in these fast-growth sectors. On that basis, Autonomy may be entering a phase where the bulk of its revenue may come from system sales where search is an inclusion, not the super charger.

Finally, there was much discussion about the need to move beyond key word search. Whether the adjustment is more sophistication “under the hood” with the user seeing suggestions or an interface solution with a range of graphic elements to provide a view of the information space, the people talking about interfaces underscored the need to [a] keep the interface simple and [b] make the information
accessible. One attendee asked at the noon break, “Does anyone know if visualization can be converted to a return on investment?” No one had a case at hand although there was some anecdotal evidence about the payoffs from visualization.

Wrap Up

The second day’s speakers are now on the stage. Stay tuned for an update.

Stephen Arnold, April 29, 2008

LTU Releases LTU-Finder 3.0

April 28, 2008

One of the leaders in image recognition and analysis is a decade-old company, LTU Technologies. The firm released LTU-Finder v. 3.0, which it described as “a breakthrough tool for image and video recognition in the field of computer forensics”. However, LTU’s system suits a wide range of enterprise image and video applications in eDiscovery, copyright, and security.

Version 3.0 of LTU-Finder includes image and video content recognition technology can increase the speed and scope of forensic and legal investigations as well as e-discovery. The new version includes enhanced image and video recognition capabilities and introduces text data identification tools that further automate large-scale file searches in the legal, e-discovery and law enforcement fields. You can use LTU’s products to find copyright infringement and digital fingerprints of images.

LTU-Finder also incorporates automatic document identification tools that separate relevant scanned documents, like e-faxes, from other content such as personal photos or Web graphics. Automating this process eliminates the need for a subject matter expert to click through image files one by one. The system reduces the amount of data that needs to be processed and stored during the e-discovery process.

You can get more information about the company’s image search and recognition technologies at  LTU’s Web site here.

Stephen Arnold, April 29, 2008

Kroll’s Ontrack Enhances Other Vendors’ Search System

April 28, 2008

David Chaplin, founder of Engenium, runs Kroll’s search and content processing business. Kroll acquired Engenium in 2006 and has moved quickly to integrate the firm’s content processing technologies into its products and services. Kroll is a unit of Marsh & McLennan Companies, a diversified firm with interests ranging from insurance to risk assessment and professional services.

Mr. Chaplin told Beyond Search in an exclusive interview that the Kroll solution “can enhance enhance search results from any search engine. If the desired search result is not on page one of the results we will bring all the results onto page one and provide a well organized and labeled folder structure to navigate to the best result.” He added, “We have two basic products: the query based conceptual keyword and parametric search and non-query based automatic information clustering.”

He also said:

I don’t believe that the volatility [in search] will decrease. I do believe there are not very many big moves to be made right now. I believe there are some big guys out there who want to make a move in this space.An underlying factor is that I do not believe corporate America believes that they are getting what they need from search and they are finding an increasing number of employees go to the Internet first before even checking their internal systems.

You can read the full interview on the ArnoldIT.com Web site. The interview is part of the Search Wizards Speak series. The interview is the 11th in this series.

Stephen Arnold, April 28, 2008

IBM’s Slow Moving Cloud

April 28, 2008

In late 2007, IBM announced it “blue cloud”. If you don’t recall the announcement, you can read the IBM announcement here.

The key points that jumped out at me last year when I learned about this initiative are:

  • The start of a shift from on-premises computing to cloud computing and Salesforce.com-type
  • solutions for some of the IBM enterprise, government, and not for profit clients
  • A series of cloud computing offerings that include hardware, services, and systems
  • Distributed, globally accessible fabric of resources targeted for existing workloads and emerging massively scalable, data intensive workloads.

Last week, IBM revealed additional blue cloud component. The firm’s iDataPlex hardware is designed for cloud computing specifically for distributed data centers. Engineered to reduce power consumption and air conditioning load, the servers put the IBM “seal of approval” on network-centric or cloud computing solutions for business and large organizations. The zippy hardware can be managed with IBM’s Tivoli-based Blue Cloud software, which helps allays some organizations fears about “out in the cloud” solutions.

Infoworld’s story “Battle Brewing in the Cloud”, which you can read here, does a good job of summarizing similar initiatives from Amazon, Google, and EMC.

IBM’s push into cloud computing is interesting. The company says, “Cloud computing is an emerging approach to shared infrastructure in which large pools of systems are linked together to provide IT services…Blue Cloud will particularly focus on the breakthroughs required in IT management simplification to ensure security, privacy,reliability, as well as high utilization and efficiency.”

My take on IBM’s November 2007 announcement and last week’s iDataPlex and management software availability is that cloud computing is the next application platform. IBM’s verbiage says with authority what Webby companies have been arguing for several years. Largecompanies often pay little attention to innovations from upstarts like Amazon and Google. Industrial giants do notice when IBM gets behind an information technology trend.

Here’s the kicker. I don’t think cloud computing is going to be an overnight sensation. Large organizations are by their nature slow moving. IBM’s announcement certifies that cloud computing is a viable enterprise systems option.

The next IT struggle for dominance, mind share, and revenues is officially underway. Just slowly and for some organizations that pace won’t permit the behemoths to adapt quickly enough to avoid some consequences of the coming shift in enterprise computing.

Stephen Arnold, April 28, 2008

“Black Holes” in Enterprise Information

April 27, 2008

Yesterday–trapped once again in the nausea-inducing SEATAC Airport–I talked on the telephone with a caller concerned about problem areas in enterprise information. The issue, as I understood her comments, had to do with launching a search and retrieval system’s crawler or content acquisition “bot” and then running queries to see what was on publicly-accessible folders and servers within the organization.

My comment to her was, “You may want to perform a content inventory, do some testing of access controls, and do some very narrowly focused tests.”

Her response was one I hear frequently from 30-somethings, children of the approval culture: “Why?” These wonderful people have grown up with gold stars on lousy book reports, received “You’re a Champ” T shirts for miserable under-10 soccer efforts, and kisses upon graduating from university with a gentle person’s “C”.

I did what I could to flash the yellow caution signal, but this call, like so many others I get, was “tell me what I want to hear” inquiry, not a “real world” information buzz. The caller wanted me to say, “What a great idea!” Sorry. I’m the wrong guy for that cheerleading.

A Partial List of Black Holes

Here is my preliminary list of enterprise information “black holes”. A black hole is not well understood. My weak intellect thinks that a black hole is a giant whirlpool with radiation, crushing gravity, and the destruction of chubby old atoms such as the ones that make me the doddering fool I am. To wit:

  • School, religious, bake sale, and Girl Scout information in email and any other file formats, including Excel, image files, and applications that send email blasts
  • MP3 and other rich media files that are copyrighted, pornographic, or in some way offensive to management, colleagues, or attorneys. This includes vacation photos of overweight relatives and spouses wearing funny hats.
  • Information in emai or other formats pertaining to employee compensation, health, job performance, or behavior. Think discovery. Think deposition. Think trial.
  • Any data that is at variance with other information vetted and filed at a regulatory body; for example, marked up copies of departmental budgets, laboratory reports, clinical trial data, agreements between a vendor and a manager, and similar “working draft” information. Think how you and your colleagues would look on the six o’clock news in orange jump suits.
  • Software installed or copied to a hard drive that is hacked, borrowed, or acquired from an online source not known to be free from spyware, backdoors, keyloggers, and trojans. Think big, big fine.
  • Information about defeating firewall filters or other security work arounds needed to allow access to Web sites, information, or services that are not permitted by the firm’s security officer, agreements between the firm and a law enforcement or intelligence entity. Think losing lucrative pork barrel goodies.
  • Information germane to a legal action that has not been provided to the firm’s legal counsel regardless of the holder of the information role in the company or that person’s knowledge of a legal matter making the information pertinent to the legal matter. Think of hours with attorneys. Uggh. This makes me queasy typing the words.
  • Email threads discussing behaviors of employees and their dealings with vendors, co workers, business partners, and consultants in which non-work related topics are discussed. Think Tyco, Enron, and other business school case studies about corporate ethics.

Do you have examples of other “black holes”?

In the run up to the release of the index of the US Federal government’s public facing Web sites, I recall sitting in a meeting to discuss the test queries we were running in the summer of year 2000. My own queries surfaced some interesting information. I stumbled upon a document that when opened in an editor carried a clear statement that the document was not to be made public. The document was removed from the crawl and its index pointer deleted. My recollection is hazy, but the test queries surfaced a great deal of information that I certainly did not expect to be sitting on a publicly-accessible server.

To greater and lesser degrees, I’ve learned that test crawls that suck information into a search system’s crawler almost always yields some excitement. The young, hip, enthusiastic search engine managers don’t realize the potential downside of indiscriminate “test indexing”.

Tips on How to Avoid a Black Hole

Here are my suggestions for avoiding self-destruction in an information “black hole”:

  1. Do a thorough content inventory, define a narrow test crawl, and expand the crawl on a schedule that allows time to run test queries, remove or change the access flag on problematic information
  2. Coordinate with your firm’s security and legal professionals. If you don’t have these types of employees sitting in their offices eager to help you, hire a consultant to work with you
  3. Run exhaustive test queries * before * you make the search system available to the users. An alpha test followed by a slightly more expansive beta test is a useful pre release tactic
  4. Inform your co workers about the indexing or “indexation” process so your co workers have time to expunge the grade school’s bake sale promotional literature, budget, and email list from the folder the spider will visit
  5. Inform management that if problematic information turns up, the search system may be offline while the problem is rectified in order to avoid surprises.

I will let you know if she calls me back.

Stephen Arnold, April 27, 2008

Not Your Microsoft Social: It’s Enterprise “The Social”

April 27, 2008

Internet News reported that “the social”–an umbrella noun that includes blogs, wikis, podcasting, mashups, RSS, social networking and widgets–will generate either $707 million or $2.7 billion by 2011.

To be fair to Kenneth Corbin, the Internet News journalist, his story relies on data from two sharp-pencil outfits: Forrester and the Gartner Group. Please, read the story yourself in order to imbibe the magnitude of “the social” in enterprise software.

The key point in the story for me appears in the final paragraph of the story, dated April 22, 2008, is: “Admitting that Web 2.0 features are still in their infancy, the Forrester researchers noted that the technologies are moving steadily toward the mainstream, as older users come to understand and embrace them, and major media firms ink deals with Web 2.0 vendors to soup up their online properties with more interactive features.”

Like most emerging trends, the excitement for Facebook-like and wiki-type functions will have to work within the regulatory net tossed over certain commercial enterprises such as the ever-innovative US financial sector, the slippery pharmaceutical companies with their interesting approach to clinical trials and compartmentalized data, and the reliable health care organizations.

Organizations need to move “beyond search” with regard to information. But what happens if that shift takes the enterprise into unexplored territory?

The role of the Internet as a method of communication is a tired subject. Uncertain and litigation-averse senior managers in commercial firms have to trade off Web 2.0 payoffs against the very real possibility that a misstep can sink their careers and possibly their company.

Stephen Arnold, April 25, 2008

Microsoft: Possible Server Security Woes for Search

April 26, 2008

The Washington Post’s Brian Krebs asserted on the newspaper’s security Web log that “hundreds of thousands of Microsoft Web Servers are hacked”. Security is a slippery fish, and the reports from security vendors leave me looking for additional corroboration. You can judge for yourself by reading this essay yourself.

The attack he writes “is coming in waves, with the bad guys swapping in new malicious downloader sites every few days.”

You can keep up with the more highly ranked comments on this topic by clicking this link to run a query on Google News.

The flaw, according to VNUnet.com’s take on the problem exists within the handling of code for IIS or Internet Information Services and SQL Server, two widely used Microsoft products. Many enterprise search systems running on a Microsoft platform will have these two servers as well.

The vulnerability exists when IIS connects to the Internet. If your enterprise search system makes use of IIS, you may want to look for this in your Web pages: <script src=http://www.nihaorr1.com/1.js> as reported by Internet News here.

With a big push in the works for SharePoint which meshes with IIS and SQL Server, the Fast Search & Transfer team will have to ramp quickly to hit the ground running with regards to search in a Windows world.

Stephen Arnold, April 26, 2008

Next Page »