Traditional Publishers: Patricians under Siege

April 19, 2008

This is an abbreviated version of Stephen Arnold’s key note at the Buying and Selling eContent Conference on April 15, 2008. A full text of the remarks is here.

Roman generals like Caesar relied on towers spaced about 3000 feet apart. Torch signals allowed messages to be passed. Routine communications used a Roman version of the “pony express”, based on innovations in Persia centuries before Rome took to the battlefield.

Today, you rely on email and your mobile phones. Those in the teens and tweens Twitter and use “instant” social messaging systems like those in Facebook and Google Mail. Try to Imagine how difficult it would be for Caesar to understand the technology behind Twitter. but how many of you think Caesar would have hit upon a tactical use of this “faster that flares” technology?

Read more

Text Mining: No-Cost Resources

April 19, 2008

Engineers without Fears has a post by Matt Moore that contains four useful links. If you are looking for a way to get up to speed on this “beyond search” function, navigate to this post.

None is without some constraints; each is useful. First, you can read a six-page paper comparing four systems: Leximancer, Megaputer, SAS Institute, and SPSS. Keep in mind that each of these is approaches text mining from very different angles of attack. Leximancer is a useful system that can become difficult to navigate in visualization mode. Megaputer, developed by wizards from a university in Russia, is robust but can be complex to operate. SAS has licensed technology from Inxight Software (now owned by SAP’s Business Objects) and the recent buyer of text processing specialist, Teragram. Expect some changes in the SAS approach in the near future. SPSS, a company best known for data mining, acquired LexiQuest and uses that company’s technologies in its systems. Nevertheless, you can pick up some helpful information in “An Evaluation of Unstructured Text Mining Software”. The link appears on Engineers without Fears.

The link to the National Centre for Text Mining is particularly helpful. The information available on the site ranges from traditional society boilerplate to the more useful comments about tools and research. You may find it useful to spider the entire site. Information can appear and disappear, so an archive is helpful if you plan on extending your research over a period of years.

The links to a lecture by Dr. Marti Hearst is a must read. Most vendors have sucked concepts, phrases, and data from Dr. Hearst’s work, often without giving her credit. This particular paper dates from late 2003, and a quick search of Google and the University of California – Berkeley Web site will point you to more current information. (You may want to narrow your query to computer science and allied disciplines. The site is sprawling, and it can difficult to locate what you need. UC Berkeley obviously doesn’t pay much attention to Dr. Hearst’s expertise.)

The link to the 2003 New York Times’s article satisfies a researcher’s need to get the “gray lady’s” take on a technical topic. I don’t pay much attention to the information in newspapers, but you can decide for yourself. Engineering documents, patent applications, and technical articles often provide more useful information without the rhetorical over extension needed to convert an equation into a two word phrase or a metaphor.

If you have a budget, you will want to look at the profiles of text mining companies in Beyond Search, a 300-page review of text mining and its component parts. The study also includes a discussion of approaches to content processing that “wrap” text mining in more usable applications. More information about this resource is located here.

Stephen Arnold, April 19, 2008

Open Source and Text Mining on a Collision Course

April 18, 2008

Open source intelligence or OSI is moving from the canvas barriers at US government facilities into the business mainstream. Open source is generally understood to mean “accessible via the Internet”, but it is a relatively low-profile discipline. Many of the experts in this type of research avoid the spot light. The “father” of open source intelligence is Robert Steele.

JasperSoft–a company that describes itself as “the market leader in open source business intelligence–announced a partnership with Microsoft to ensure that JasperSoft’s business intelligence (BI) solutions work well on Windows platforms, according to Internet News.

The ability to “suck” data into a widely-deployed system opens new opportunities for analysts and competitive intelligence practitioners. The JasperSoft

A key new initiative is JasperSoft’s Connect product. An analyst can use Windows products as a front-end to the JasperSoft’s data analysis server. The value of this approach is that the cost of open source analysis can be sharply reduced.

Consolidation in the for-fee content sector may put a restrictor plate on some open source initiatives. For example, the merger of Thomson with Reuters–a deal valued at $17 billion–means that data accessible via the Internet will almost certainly be placed under tighter access controls. If this occurs, commercial information and data that find their way to publicly-accessible Web sites will require a fee to access. Improper use of information owned by such multi-national professional publishing giants like Thomson, Reed Elsevier, Wolters Kluwer, and Springer-Verlag will lead to restrictions, fees, and possibly legal action.

As a result, wider use of open source intelligence with lower-cost tools such as those from JasperSoft will trigger greater restrictions on high-value information. Awareness of open source and consolidation in the professional publishing and news sectors may collide with unknown consequences.

Open source has disrupted traditional military intelligence methods, and it may pose a similar challenge to professional publishing companies. Open source has increased the pressure on commercial software companies, which have not been able to quell interest in community-supported products like Apache, a widely used Web server. Publishing companies, already threatened by sharply decreasing revenues and rising costs, are likely to respond more quickly and more aggressively than software vendors.

Stephen Arnold, April 18, 2008

The Text Mining Can of Worms

April 17, 2008

In October 2007, I participated in a half-day “text mining” tutorial held after the International Chemical Information Conference in that most appealing Spanish city, Barcelona. The attendees–I think there were about 24 people mostly from European companies–wanted to learn about advanced text mining systems–in theory. Reality often intrudes, however.

textmining worms copy

Fresh from the primary research for my Beyond Search: What to Do When Your Enterprise Search System Won’t Work, I had a significant amount of information about 50 vendors’ text mining systems and their technologies. The structure of the Barcelona tutorial was straight forward. After defining text mining and differentiating it from the better-known data mining, I walked through some case examples of text mining successes. The second part of the tutorial focused on the business issues of text mining. The end point of this segment tackled three key challenges to which I will return in a moment. The third segment of the tutorial took at look at what Google was disclosing through its engineering papers and speeches about its approach to text mining. This is a very interesting block of information, and I may at some point in the future describe a little of our findings. The tutorial wrap up was a series of observations with time for the attendees to ask additional questions and share some of their experiences.

Read more

Cognition Upgrades Its Meaning-Based Search

April 17, 2008

Culver City, California’s Cognition Technologies, Inc. has released Semantic NLP. “NLP” is short hand for “natural language processing”. The idea is that a search system understands a user’s query. No Boolean statements or formal search syntax is required to obtain an answer or a result list from the system.

The company told Beyond Search:

[Our] engineers have crafted a technology which is “the next evolution” in search. That remains to be proven, but Cognition, like a number of rich text processing companies have jumped into advanced search with verve. However, compared to other search newcomers, Cognition has uniquely solved one of the biggest hurdles toward increased precision and recall and understanding the meaning of user queries and the searched content. Through this understanding it is able to resolve both the ambiguity and synonymy of the English language.

The company says that Semantic NLP understands words and phrases in enterprise and Web content. A demonstration of some of the Cognition system’s functions are available here. Registration is required.

The company is the subject of an in-depth profile in Beyond Search: What to Do When Your Enterprise Search System Won’t Work, published in April 2008 by the Gilbane Group. This study identified 24 vendors whose technology illustrates next-generation search and content processing features.

Additional information about the firm is available at its Web site or by writing learnmore at cognition.com.

Stephen Arnold, April 17, 2008

Leximancer: Divining Meaning from Words

April 17, 2008

In Australia last year, I met several information technology professionals who mentioned the Leximancer text and content processing system to me. Leximancer now has offices in three cities: Brisbane, Australia, London, England, and Boulder, Colorado. I updated my Leximancer files and made a mental note that that company had some nifty visualization technology. Based on comments made to me, analysts in police and intelligence as well as the academic community find the product of significant value. I heard that the company has more than 200 licensees and is growing at a brisk pace.

At the eContent conference in Phoenix, Arizona, one of the attendees was grilling me about text analytics. As the grill-ee, I was reluctant to provide too much information to the grill-er. Most of what the young, confident MBA wanted is in my new study Beyond Search: What to Do When Your Enterprise Search System Won’t Work. Furthermore, she was convinced after her text mining industry research which included healthy bites of blue-chip consultancies’ pontifications that no firm combined text analysis, discovery, and useful point-and-click visualizations of the topic and concept space of a collection.

Sigh. Like the Fortune 500 country clubbers, vendors are so darn inadequate. Maybe? Sometimes it’s the Fortune 500 Ivy leaguers who are missing a card or two in their deck, not the vendors. Just a thought.

This short essay is a partial response to her assertion, which was–by the way–100 percent incorrect. For some reason, her research overlooked high-profile tools from dozens of vendors as well as point specialists. On the flight back last night, I recalled the Leximancer system, and I thought I would provide some color about that firm’s approach for two reasons: [a] I find it useful to look at companies with interesting search-related technologies and [b] I want to underscore that her assertion and her research was woefully inadequate.

What’s a Leximancer?

Leximancer is text mining software that you can use to analyze the content of collections of textual documents. The system then displays the the extracted information in a browser. Leximancer’s approach to visualization is to use a “concept map”. The idea is that a user can glance at the map, get an overview, and then explore the relationships that Leximancer discovers within the text.

concept map

Read more

Oracle SES May Not Be

April 17, 2008

Security Pro News pointed to a notice in the Oracle Security Web log on April 16, 2008. The database giant has released  an update that addresses more than a dozen security issues. You can read the details on the Oracle Web site.

If you are an Oracle Secure Enterprise Search licensee with an Oracle database back end, you will want to make certain you have the appropriate updates. Oracle details of the patch and download links may be accessed from the Oracle Technology Network.

Beyond Search recommends that you navigate to the Oracle site. Make sure your Secure Enterprise Search installation is “secure”. The key differentiator for this search and content processing system is its security engineering. As I noted in the first three editions of the Enterprise Search Report, search systems present a number of access and control issues. Oracle was one of the first companies to pivot a value proposition on security. Oracle’s approach requires some additional administrative effort and, in some cases, the licensing of additional Oracle components.

In the 1980s, Verity implemented a clever ticket system. Although largely forgotten, the Verity approach offered some technical advantages over the Oracle approach. But compared to the security measures taken by some vendors, Oracle’s approach is solid. Glitches are annoying. Update today.

Stephen Arnold, April 17, 2008

Data Bunny Unmasked

April 16, 2008

Earlier today, a well-paid, somewhat insightful senior executive ripped the fur off a 27 year charade. The keen investigative mind of the anonymous investigator revealed that the data bunny has been Stephen E. Arnold.

The shocking discovery dismayed the two known fans of Mr. Arnold. One chagrined client said:

We had no idea that Mr. Arnold was the data bunny. When he lectured at our company, we did not notice the ears. The information he conveyed was more important than his appearance. I’m not sure what he was wearing during the briefing. But now that the truth is revealed, we will not listen to his analyses if he wears those ears. I hope we don’t confuse substance and appearance again. Proper dress is more important than real information.

When Mr. Arnold learned that his secret was out of the hutch, he blinked his pink eyes and said, according to Donald Anderson, an engineer who has worked with Mr. Arnold for more than 15 years: “Those bunny ears are not funny. Mr. Arnold doesn’t wear them all the time or I just don’t notice them anymore.”

According to Mr. Anderson’, Mr. Arnold’s reaction was to stamp his paw and twitch his nose in frustration. Added Mr. Anderson, “I guess he thought the secret was safe. It’s sad. Almost like Lois Lane learning the identity of Superman. It’s sad, but the truth must come out.”

According to another member of the Beyond Search team, Mr. Arnold removed his bunny ears in disgust and slipped on his new Beyond Search rubber goose mask. A photograph of Mr. Arnold in his goose disguise is the basis of this Web log’s logo here.

Beyond Search will publish more details about this startling investigative discovery as they become available. Mr. Arnold’s attorney told Beyond Search, “Although the revelation is shocking, I have advised Mr. Arnold to not reveal the name of the genius who disclosed this 27 year old mystery.”

According to his attorney, Mr. Arnold’s final comment was, “Honk. Honk.”

Stephen Arnold, April 16, 2008

Pfizer Taps Linguamatics for Knowledge Discovery

April 16, 2008

The US pharmaceutical giant has licensed the low-profile Linguamatics I2E Version 3.x technology for natural language processing and text mining functions. Linguamatics describes the tie up as an “expansion of strategic collaboration with Pfizer”.

Linguamatics is a Cambridge, England-based company specializing in ferreting meaning from text. The company has a low profile in the United States, but its approach makes it possible for a user to interact with the system via a dialog. This is essentially a question-and-answer approach with the system and the user exchanging information.

Beyond Search identified Linguamatics as one of 24 companies to watch in its new study of companies able to breathe new life into traditioinal search and retrieval systems.

Pifzer will use the I2E system as an information platform. One feature of I2E is its ability to perform text mining, the process of discovering and extracting key facts and relationships from internal and external literature sources to support decision making.

Pharmaceutical companies in general have been early adopters of information access technologies that can keep their competitive edges sharp. A single item of research data could have a significant financial impact. The compartmentalization of information within drug companies was once considered standard operating procedure can be detrimental to some business goals. In the last five years, Pfizer has demonstrated an appetitive for content processing, business intelligence, and text mining technologies. Furthermore, Pfizer has looked outside the US for information technologies that can reduce costs and increase the financial performance of the company. For example, Pfizer has tapped the French company Temis for other information processing systems.

Pfizer, whose shares are in the $20 range, is on track to meet its profit forecast. The company, like others in the pharmaceutical sector, faces increasing competition and cost control pressure in a harsh economic climate. Information technology appears to be a key part of the company’s broader business strategy.

Additional information about Linguamatics is available from the company’s Web site. The company is profiled in Beyond Search: What to Do When Your Enterprise Search Systems Won’t Work, now available from the Gilbane Group in Cambridge, Massachusetts. The analysis of Linguamatics is one of the few in-depth descriptions of the I2E technology now available and positioned within the broader “market map” of vendors providing technology that address the problems of traditional key word search systems.

Stephen Arnold, April 16, 2008

Linguistic Agents: Smart Software from Israel

April 16, 2008

In my new study “Beyond Search”, I profile a number of non-US content processing companies. Several years ago I learned about Jerusalem-based Linguistic Agents. The company uses an interesting technique for its natural language processing system. I found Linguistics Agents’ approach interesting.

The firm’s founder is Sasson Margaliot. In 1999, Mr. Margaliot wanted to convert linguistic theories
into practical technologies. The goal was to enable computers to understand human language and context. Like other innovators in content processing, Mr. Margaliot had expertise in theoretical linguistics and application software development. He studied Linguistics at UCLA and Computer Science at Hebrew
University of Jerusalem.

The company’s chief scientist is Alexander Demidov. Mr. Demidov was responsible for the development of Linguistic Grammars for the Company’s NanoSyntactic Parser, the precursor of today’s Streaming Logic engine. Previously, he worked for the Moscow Institute of Applied Mathematics and at Zehut, a company that developed advanced compression and protection algorithms for digital imaging.

Computerworld identified the company in the summer of 2007 as having one of the “cool cutting-edge technologies on the horizon”. Since that burst of publicity in the US, not much has been done to keep the company’s profile above the water line.

The company uses “nano syntax” to extract meaning from documents. On the surface, the approach seems to share some features with Attensity, the “deep extraction company” and the firm that I included in my new study as an exemplar of recursive analysis and linguistic processing for meaning.

The idea is that a series of parallelized processes converts a sentence into a representation that preserves its syntactical meaning. The technology can be applied to search as well as context-based advertising. The company asserts, “The technology can revolutionize how computers and people interact –computers will learn our language instead of vice versa.”

Read more

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta