Open Source and Text Mining on a Collision Course
April 18, 2008
Open source intelligence or OSI is moving from the canvas barriers at US government facilities into the business mainstream. Open source is generally understood to mean “accessible via the Internet”, but it is a relatively low-profile discipline. Many of the experts in this type of research avoid the spot light. The “father” of open source intelligence is Robert Steele.
JasperSoft–a company that describes itself as “the market leader in open source business intelligence–announced a partnership with Microsoft to ensure that JasperSoft’s business intelligence (BI) solutions work well on Windows platforms, according to Internet News.
The ability to “suck” data into a widely-deployed system opens new opportunities for analysts and competitive intelligence practitioners. The JasperSoft
A key new initiative is JasperSoft’s Connect product. An analyst can use Windows products as a front-end to the JasperSoft’s data analysis server. The value of this approach is that the cost of open source analysis can be sharply reduced.
Consolidation in the for-fee content sector may put a restrictor plate on some open source initiatives. For example, the merger of Thomson with Reuters–a deal valued at $17 billion–means that data accessible via the Internet will almost certainly be placed under tighter access controls. If this occurs, commercial information and data that find their way to publicly-accessible Web sites will require a fee to access. Improper use of information owned by such multi-national professional publishing giants like Thomson, Reed Elsevier, Wolters Kluwer, and Springer-Verlag will lead to restrictions, fees, and possibly legal action.
As a result, wider use of open source intelligence with lower-cost tools such as those from JasperSoft will trigger greater restrictions on high-value information. Awareness of open source and consolidation in the professional publishing and news sectors may collide with unknown consequences.
Open source has disrupted traditional military intelligence methods, and it may pose a similar challenge to professional publishing companies. Open source has increased the pressure on commercial software companies, which have not been able to quell interest in community-supported products like Apache, a widely used Web server. Publishing companies, already threatened by sharply decreasing revenues and rising costs, are likely to respond more quickly and more aggressively than software vendors.
Stephen Arnold, April 18, 2008
The Text Mining Can of Worms
April 17, 2008
In October 2007, I participated in a half-day “text mining” tutorial held after the International Chemical Information Conference in that most appealing Spanish city, Barcelona. The attendees–I think there were about 24 people mostly from European companies–wanted to learn about advanced text mining systems–in theory. Reality often intrudes, however.
Fresh from the primary research for my Beyond Search: What to Do When Your Enterprise Search System Won’t Work, I had a significant amount of information about 50 vendors’ text mining systems and their technologies. The structure of the Barcelona tutorial was straight forward. After defining text mining and differentiating it from the better-known data mining, I walked through some case examples of text mining successes. The second part of the tutorial focused on the business issues of text mining. The end point of this segment tackled three key challenges to which I will return in a moment. The third segment of the tutorial took at look at what Google was disclosing through its engineering papers and speeches about its approach to text mining. This is a very interesting block of information, and I may at some point in the future describe a little of our findings. The tutorial wrap up was a series of observations with time for the attendees to ask additional questions and share some of their experiences.
Cognition Upgrades Its Meaning-Based Search
April 17, 2008
Culver City, California’s Cognition Technologies, Inc. has released Semantic NLP. “NLP” is short hand for “natural language processing”. The idea is that a search system understands a user’s query. No Boolean statements or formal search syntax is required to obtain an answer or a result list from the system.
The company told Beyond Search:
[Our] engineers have crafted a technology which is “the next evolution” in search. That remains to be proven, but Cognition, like a number of rich text processing companies have jumped into advanced search with verve. However, compared to other search newcomers, Cognition has uniquely solved one of the biggest hurdles toward increased precision and recall and understanding the meaning of user queries and the searched content. Through this understanding it is able to resolve both the ambiguity and synonymy of the English language.
The company says that Semantic NLP understands words and phrases in enterprise and Web content. A demonstration of some of the Cognition system’s functions are available here. Registration is required.
The company is the subject of an in-depth profile in Beyond Search: What to Do When Your Enterprise Search System Won’t Work, published in April 2008 by the Gilbane Group. This study identified 24 vendors whose technology illustrates next-generation search and content processing features.
Additional information about the firm is available at its Web site or by writing learnmore at cognition.com.
Stephen Arnold, April 17, 2008
Leximancer: Divining Meaning from Words
April 17, 2008
In Australia last year, I met several information technology professionals who mentioned the Leximancer text and content processing system to me. Leximancer now has offices in three cities: Brisbane, Australia, London, England, and Boulder, Colorado. I updated my Leximancer files and made a mental note that that company had some nifty visualization technology. Based on comments made to me, analysts in police and intelligence as well as the academic community find the product of significant value. I heard that the company has more than 200 licensees and is growing at a brisk pace.
At the eContent conference in Phoenix, Arizona, one of the attendees was grilling me about text analytics. As the grill-ee, I was reluctant to provide too much information to the grill-er. Most of what the young, confident MBA wanted is in my new study Beyond Search: What to Do When Your Enterprise Search System Won’t Work. Furthermore, she was convinced after her text mining industry research which included healthy bites of blue-chip consultancies’ pontifications that no firm combined text analysis, discovery, and useful point-and-click visualizations of the topic and concept space of a collection.
Sigh. Like the Fortune 500 country clubbers, vendors are so darn inadequate. Maybe? Sometimes it’s the Fortune 500 Ivy leaguers who are missing a card or two in their deck, not the vendors. Just a thought.
This short essay is a partial response to her assertion, which was–by the way–100 percent incorrect. For some reason, her research overlooked high-profile tools from dozens of vendors as well as point specialists. On the flight back last night, I recalled the Leximancer system, and I thought I would provide some color about that firm’s approach for two reasons: [a] I find it useful to look at companies with interesting search-related technologies and [b] I want to underscore that her assertion and her research was woefully inadequate.
What’s a Leximancer?
Leximancer is text mining software that you can use to analyze the content of collections of textual documents. The system then displays the the extracted information in a browser. Leximancer’s approach to visualization is to use a “concept map”. The idea is that a user can glance at the map, get an overview, and then explore the relationships that Leximancer discovers within the text.
Oracle SES May Not Be
April 17, 2008
Security Pro News pointed to a notice in the Oracle Security Web log on April 16, 2008. The database giant has released an update that addresses more than a dozen security issues. You can read the details on the Oracle Web site.
If you are an Oracle Secure Enterprise Search licensee with an Oracle database back end, you will want to make certain you have the appropriate updates. Oracle details of the patch and download links may be accessed from the Oracle Technology Network.
Beyond Search recommends that you navigate to the Oracle site. Make sure your Secure Enterprise Search installation is “secure”. The key differentiator for this search and content processing system is its security engineering. As I noted in the first three editions of the Enterprise Search Report, search systems present a number of access and control issues. Oracle was one of the first companies to pivot a value proposition on security. Oracle’s approach requires some additional administrative effort and, in some cases, the licensing of additional Oracle components.
In the 1980s, Verity implemented a clever ticket system. Although largely forgotten, the Verity approach offered some technical advantages over the Oracle approach. But compared to the security measures taken by some vendors, Oracle’s approach is solid. Glitches are annoying. Update today.
Stephen Arnold, April 17, 2008
Data Bunny Unmasked
April 16, 2008
Earlier today, a well-paid, somewhat insightful senior executive ripped the fur off a 27 year charade. The keen investigative mind of the anonymous investigator revealed that the data bunny has been Stephen E. Arnold.
The shocking discovery dismayed the two known fans of Mr. Arnold. One chagrined client said:
We had no idea that Mr. Arnold was the data bunny. When he lectured at our company, we did not notice the ears. The information he conveyed was more important than his appearance. I’m not sure what he was wearing during the briefing. But now that the truth is revealed, we will not listen to his analyses if he wears those ears. I hope we don’t confuse substance and appearance again. Proper dress is more important than real information.
When Mr. Arnold learned that his secret was out of the hutch, he blinked his pink eyes and said, according to Donald Anderson, an engineer who has worked with Mr. Arnold for more than 15 years: “Those bunny ears are not funny. Mr. Arnold doesn’t wear them all the time or I just don’t notice them anymore.”
According to Mr. Anderson’, Mr. Arnold’s reaction was to stamp his paw and twitch his nose in frustration. Added Mr. Anderson, “I guess he thought the secret was safe. It’s sad. Almost like Lois Lane learning the identity of Superman. It’s sad, but the truth must come out.”
According to another member of the Beyond Search team, Mr. Arnold removed his bunny ears in disgust and slipped on his new Beyond Search rubber goose mask. A photograph of Mr. Arnold in his goose disguise is the basis of this Web log’s logo here.
Beyond Search will publish more details about this startling investigative discovery as they become available. Mr. Arnold’s attorney told Beyond Search, “Although the revelation is shocking, I have advised Mr. Arnold to not reveal the name of the genius who disclosed this 27 year old mystery.”
According to his attorney, Mr. Arnold’s final comment was, “Honk. Honk.”
Stephen Arnold, April 16, 2008
Pfizer Taps Linguamatics for Knowledge Discovery
April 16, 2008
The US pharmaceutical giant has licensed the low-profile Linguamatics I2E Version 3.x technology for natural language processing and text mining functions. Linguamatics describes the tie up as an “expansion of strategic collaboration with Pfizer”.
Linguamatics is a Cambridge, England-based company specializing in ferreting meaning from text. The company has a low profile in the United States, but its approach makes it possible for a user to interact with the system via a dialog. This is essentially a question-and-answer approach with the system and the user exchanging information.
Beyond Search identified Linguamatics as one of 24 companies to watch in its new study of companies able to breathe new life into traditioinal search and retrieval systems.
Pifzer will use the I2E system as an information platform. One feature of I2E is its ability to perform text mining, the process of discovering and extracting key facts and relationships from internal and external literature sources to support decision making.
Pharmaceutical companies in general have been early adopters of information access technologies that can keep their competitive edges sharp. A single item of research data could have a significant financial impact. The compartmentalization of information within drug companies was once considered standard operating procedure can be detrimental to some business goals. In the last five years, Pfizer has demonstrated an appetitive for content processing, business intelligence, and text mining technologies. Furthermore, Pfizer has looked outside the US for information technologies that can reduce costs and increase the financial performance of the company. For example, Pfizer has tapped the French company Temis for other information processing systems.
Pfizer, whose shares are in the $20 range, is on track to meet its profit forecast. The company, like others in the pharmaceutical sector, faces increasing competition and cost control pressure in a harsh economic climate. Information technology appears to be a key part of the company’s broader business strategy.
Additional information about Linguamatics is available from the company’s Web site. The company is profiled in Beyond Search: What to Do When Your Enterprise Search Systems Won’t Work, now available from the Gilbane Group in Cambridge, Massachusetts. The analysis of Linguamatics is one of the few in-depth descriptions of the I2E technology now available and positioned within the broader “market map” of vendors providing technology that address the problems of traditional key word search systems.
Stephen Arnold, April 16, 2008
Linguistic Agents: Smart Software from Israel
April 16, 2008
In my new study “Beyond Search”, I profile a number of non-US content processing companies. Several years ago I learned about Jerusalem-based Linguistic Agents. The company uses an interesting technique for its natural language processing system. I found Linguistics Agents’ approach interesting.
The firm’s founder is Sasson Margaliot. In 1999, Mr. Margaliot wanted to convert linguistic theories
into practical technologies. The goal was to enable computers to understand human language and context. Like other innovators in content processing, Mr. Margaliot had expertise in theoretical linguistics and application software development. He studied Linguistics at UCLA and Computer Science at Hebrew
University of Jerusalem.
The company’s chief scientist is Alexander Demidov. Mr. Demidov was responsible for the development of Linguistic Grammars for the Company’s NanoSyntactic Parser, the precursor of today’s Streaming Logic engine. Previously, he worked for the Moscow Institute of Applied Mathematics and at Zehut, a company that developed advanced compression and protection algorithms for digital imaging.
Computerworld identified the company in the summer of 2007 as having one of the “cool cutting-edge technologies on the horizon”. Since that burst of publicity in the US, not much has been done to keep the company’s profile above the water line.
The company uses “nano syntax” to extract meaning from documents. On the surface, the approach seems to share some features with Attensity, the “deep extraction company” and the firm that I included in my new study as an exemplar of recursive analysis and linguistic processing for meaning.
The idea is that a series of parallelized processes converts a sentence into a representation that preserves its syntactical meaning. The technology can be applied to search as well as context-based advertising. The company asserts, “The technology can revolutionize how computers and people interact –computers will learn our language instead of vice versa.”
Coveo: Email Face Off in Canada
April 15, 2008
Coveo, a vendor of search and content processing technology, rolled out a limited release of Coveo G2B™ for Email.
Based on a preview glimpsed by ArnoldIT’s above-the-radar, spy goose, Coveo is perhaps the only vendor with an email search application able to deliver unified search and navigation across both live and archived email. In the world of email, the term “archived” means message stores running on such vendors’ repositories as Microsoft Exchange and Symantec Enterprise Vault, among others. The search function can be used from any connected desktop, or Windows Mobile or BlackBerry mobile device. Even Beyond Search’s spy goose queried email from his lowly Treo 650 with near zero latency.
“Beyond Search” was able to look at a new email and search for a referenced attachment in an email no longer on the Treo 650. The spy goose concluded that this functionality is a long over due function and will be a welcome service for anyone who relies on email.
Coveo’s founder Laurent Simoneau said:
Businesses today cannot properly leverage their email content for critical decisions…and desktop search applications simply do not scale to meet the needs of today’s mobile workers. Coveo G2B for Email benefits businesses tremendously by enabling their employees to access instantly and easily the information they need most, whether they are in the office or on the road.
You can read an exclusive interview with Mr. Simoneau here.
Coveo’s system appears to challenge Waterloo, Ontario-based Research in Motion in email search and content processing. RIM’s been able to expand its revenues due to new handset margins. In the view of ArnoldIT.com, RIM’s software relies on an aging architecture which is likely to become increasingly problematic as RIM expands its consumer user base.
Coveo’s approach uses a fresh approach. Based on Beyond Search’s probe into the product, Coveo is now playing with a man advantage in this face off between Canadian technology rivals. You can find more information on the Coveo Web site.
Stephen Arnold, April 15, 2008
Arnold’s New Study “Beyond Search” Now Available
April 15, 2008
Stephen E. Arnold’s most recent study–Beyond Search: What to Do When Your Enterprise Search System Doesn’t Work–is now available from the Gilbane Group. The 270-page study contains practical information about fixing problems with an existing behind-the-firewall search system, a market analysis and vendor road map, profiles of 24 vendors of behind-the-firewall search and content processing systems, and a glossary.
The three key findings from the year-long research behind the book are that user dissatisfaction with incumbent search systems is increasing. The need to deploy a system that meets increasingly savvy users’ needs is rising sharply.
Mr. Arnold also says that Google’s dataspace technology–largely unknown by search vendors and not yet deployed by Google–could reshape enterprise search in a very short time if Google makes it available. Google is keeping quiet about the dataspace technology acquired when Google purchased Transformic, Inc. in 2006. He said, “Few outside of Google know about dataspaces, and the technology offers one way to deliver new ty8pes of query functionality so users can know how certain a particular result is to be accurate and to determine the lineage of a particular result.” He added, “The world is starting to think about BigTable, but dataspaces are a quantum leap beyond the functionality of BigTable, which is in itself a quatum leap beyond relational database technology. Google’s engineering and technical prowess are its chief competitive advantage.”
Finally, Mr. Arnold’s research reveals that remarkable new, extremeley useful technologies are being develoiped outside the US. Mr. Arnold says, “There’s a perception that innovation in search only arises in the United States. That’s simply not true. Non-U.S. vendors like Exalead and ISYS Search Software are making strong thrusts into the North American market. Others are opening offices in the U.S. and will increase the competititve heat for many of the best-known search vendors.
Martin White, noted British search and content expert, said about Beyond Search, “a fabulous job on the book and the industry, and the CIO fraternity, should be very grateful that Mr. Arnold found the energy to write it.”
You can order the study from the Gilbane Group. Selected quotations from the study appear on Mr. Arnold’s Web site, ArnoldIT.com. An abbreviated table of contents is available on that site as well.
Stuart Schram, April 15, 2008