SAS Text Analytics and Teragram

May 28, 2010

I received a call about Teragram, the text processing company that SAS acquired a couple of years ago. I did a quick Overflight check and realized that I had not documented the absorption of Teragram into SAS. Teragram’s technology is alive and well, but the SAS positioning is for content processing to be a component of SAS Text Analytics. The product and solution has its own subsite within SAS.com. You can locate the details at http://www.sas.com/text-analytics/.

Another important point is that SAS Text Analytics includes four components. There is the SAS Enterprise Content Categorization function. The system parses content and identifies entities. Metadata are created along with category rules.

The second function is SAS Sentiment Analysis. A number of companies are competing in this sector. The SAS approach sucks in emails, tweets, and other documents. The system identifies various subjective shades in the source content.

SAS Text Miner now includes both text and data mining operations. The system is not one of those Web 2.0, “it is really easy” solutions. The system is easy to use, but to put “easy” in context, you will need programming and statistical savvy along with solid data set building skills.

The SAS Ontology Management solution provides a centralized method for keeping index terms and metatags consistent. Sounds easy, but this type of consistency is the difference between useful and useless information. SharePoint lacks this type of functionality. You have been given a gentle reminder about consistent tagging, dear SharePoint user.

SAS has a blog focused on text analytics. You can read “The Text Frontier” but last time I checked, the blog’s most recent update was posted in March 2010.

Bottomline: Teragram is alive and well, just part of SAS Text Analytics.

Stephen E Arnold, May 28, 2010

Freebie

Sybase Touts Search Prowess

May 27, 2010

Sybase IQ Update Strengthens Database Query, Search Features” is, I suppose, a response of sorts to my opinion that SAP has not made a commitment to search. With TREX and some Endeca support, SAP seems content to rely upon third parties to make content findable within a sprawling SAP construct.

The story trots out an azure chip consultant to dash around the circus ring. I love those azure chip “we” statements. Right. The core of the announcement is that Sybase IQ15.2 can search unstructured content and perform such tricks as term frequency. Yep, that’s text analytics.

The passage that caught my attention was this:

“Sybase IQ is best known for its extreme performance, allowing decision-makers to analyze business trends, predict outcomes, and revise strategies, often in a matter of seconds,” Joydeep Das, Director of Product Management, Data Warehousing and Analytics at Sybase, said in a statement. “With Sybase IQ 15.2, enterprises are now able to analyze previously untapped sources of information, such as web content and email, to deliver smarter answers across structured and unstructured data.”

What is Sybase IQ? I dug through my Overflight file on this product and jotted down these points:

  • This is a column-based database. The column approach stacks data vertically, not in the Excel-like horizontal row format. Arguments between row store and column dudes are esoteric. In a nutshell, certain types of processes are facilitated with the column approach. Keep in mind it is a relational database and some RDBMS jockeys can make row stores into columnar structures.
  • For certain types of data reads are faster. Some data warehouse jockeys swear by the column method. Sun was a cheerleader, but we know what happened to Sun, so that may not be the endorsement it once was. Keep in mind that Sybase itself was acquired after compiling an interesting financial track record, but those offices are cool looking.
  • The company’s definition of data federation confused me. Does Sybase perform a Mark Logic type of function, creating a repository? Does Sybase work like the original Vivisimo federating method? What happens if I need to see the source Sybase has its indexes within its system. I am not sure how IQ queries external databases, makes sure security is observed, and then returns results without getting confused about who provided what and which data item is visible to a particular user. But like many vendors, azure chip folks are happy to parrot “federated”, cash their check, and move on to the next client.
  • You can if you hurry download the Sybase description of IQ’s architectural strength at this link. I checked it on May 26, 2010, and it was valid. Wait too long and the PDF may be unavailable.

With Sybase in a new home, a product in Version 15 will have an opportunity to show how it can grow and contribute to the SAP franchise. With SAP pursuing inorganic growth, the expectations and timeline will be key factors in my opinion. The database sector has some dominant players with upstarts like Google ready to enter the fray. Can Sybase challenge IBM, Microsoft, Oracle, and NoSQL crowd? I hope so.

Stephen E Arnold, May 27, 2010

Freebie

IBM Search Technology

May 25, 2010

Before I headed West last week, I participated in a discussion about IBM search technology. No one at the lunch meeting worked at IBM, but, hey, IBM is a giant in software and services and each person had a viewpoint.

One surprising factoid emerged from chatter, and I wanted to snag it before it flew away like my first female goose friend. (She left for New York, abandoning the joys of rural Illinois for the bright lights in the big city. She probably ended up working at IBM in Armonk.)

IBM has a mini Web site embedded within its sprawling IBM digital Uzbekistan. The page is “Enterprise Search Technology.” The subtitle is “Innovation Matters”. You can navigate directly to this page by clicking this link. Finding the page took some work, but you are welcome to experience the thrill of hunt via IBM.com if you have some spare time.

IBM describes Trevi, which is an Intranet search system. The system incorporates six technologies, illustrated in the diagram below:

image

Source: IBM 2010.

The factoid: The page seems to be an island in time. The featured researcher – Marcus Fontoura – offers some comments about problems in searching. A click returns a 404 error.

Interesting.

Stephen E Arnold, May 25, 2010

Freebie.

Exalead and Dassault Tie Up, Users Benefit

May 24, 2010

A happy quack to the reader who alerted us to another win by Exalead.

Dassault Systèmes (DS) (Euronext Paris: #13065, DSY.PA), one of the world leaders in 3D and Product Lifecycle Management (PLM) solutions, announced an OEM agreement with Exalead, a global software provider in the enterprise and Web search market. As a result of this partnership, Dassault will deliver discovery and advanced PLM enterprise search capabilities within the Dassault ENOVIA V6 solutions.

The Exalead CloudView OEM edition is dedicated to ISVs and integrators who want to differentiate their solutions with high-performing and highly scalable embedded search capabilities. Built on an open, modular architecture, Exalead CloudView uses minimal hardware but provides high scalability, which helps reduce overall costs. Additionally, Exalead’s CloudView uses advanced semantic technologies to analyze, categorize, enhance and align data automatically. Users benefit from more accurate, precise and relevant search results.

This partnership with Exalead demonstrates the unique capabilities of ENOVIA’s V6 PLM solutions to serve as an open federation, indexing and data warehouse platform for process and user data, for customers across multiple industries. Dassault Systèmes PLM users will benefit from its Exalead-empowered ENOVIA V6 solutions to handle large data volumes thus enabling PLM enterprise data to be easily discovered, indexed and instantaneously available for real-time search and intelligent navigation. Non-experts will have the opportunity to access PLM know-how and knowledge with the simplicity and the performance of the Web in scalable online collaborative environments. Moreover, PLM creators and collaborators will be able to instantly find IP from any generic, business, product and social content and turn it into actionable intelligence.

Stephen E Arnold, May 22, 2010

Freebie.

Twitter Sentiments: A Search Variant?

May 13, 2010

Quite a suggestive write up from DNA India called “Twitter Sentiments May Soon Replace Public Opinion Polls.” According to the write up, “combing Twitter for data can be as good a way of researching opinions as conducting an actual poll.” Instead of working through a traditional survey process with sampling, instrument drafting, and instrument testing, just tweet. The notion of searching through data sets for a nugget gets replaced with an instant answer. For me the key point in the write up was:

Noah Smith, assistant professor of language technologies and machine learning in the School of Computer Science, said that the findings suggest that analyzing the text found in streams of tweets could become a cheap, rapid means of gauging public opinion on at least some subjects. He, however, warned that tools for extracting public opinion from social media text are still crude and social media remain in their infancy, so the extent to which these methods could replace or supplement traditional polling is still unknown.

What is the make up of a Twitter message sample? Noise. That’s an understatement. Nevertheless, the idea is interesting and shows how “informazation” is becoming a significant method.

Stephen E Arnold, May 13, 2010

Freebie.

Google Gets Guha Patent

May 12, 2010

Short honk: I learned today that Google received a patent for Ramanathan Guha’s 2005 invention “Aggregating Context Data for Programmable Search Engines”, US 7,716,199. Will the other PSE inventions find their way out of the USPTO’s cave of winds? This “aggregation” invention is significant, so the fate of the other Guha inventions may not matter.

Stephen E Arnold, May 12, 2010

Freebie.

News Flash: Data Mining May Not Be an Information Cure All

May 7, 2010

Technology can work wonders. Technology is supposed to make it easier for downsized organizations to perform with agility and alacrity. I am “into” technology but I understand that the minimum wage workers at airline counters and financial institutions operate within systems assumed to work as intended. These systems, in my opinion, neither work at the level of answering a simple question like “Is the flight on time?” or at more a sophisticated level of “Where did this wire transfer come from?”

Why is it a surprise that technology does not do less familiar tasks with glitches or outright breakdowns? I was surprised to read “NY Plot Highlights Limitations of Data Mining.” There were three reasons:

  1. The writer for Network World expresses gentle surprise that predictive systems don’t work too well when applied to the actions of one person. Network World documents lots of system glitches, and the gentle surprise is not warranted.
  2. The story plants the seed that we have no choice but to rely on fancy content processing systems. Are there other options? None if you rely on this article’s analysis. In my experience there are indeed options, but these are conveniently nudged to the margins.
  3. The dancing around with data mining is specious. Text processing is one of those Rube Goldberg machines just built with software. Get the assumptions wrong, the inputs wrong, or the algorithms wrong to a slight degree and guess what? The outputs are likely to be wrong.

Here’s the passage I found interesting:

That fact is likely to provide more fodder for those who question the effectiveness of using data mining approaches to uncover and forecast terror plots. Since the terror attacks of Sept. 11, the federal government has spent tens of millions of dollars on data mining programs and behavioral surveillance technologies that are being used by several agencies to identify potential terrorists. The tools typically work by searching through mountains of data in large databases for unusual patterns of activity, which are then used to predict future behavior. The data is often culled from dozens of sources including commercial and government databases and meshed together to see what kind of patterns emerge.

In my experience, humans and text processing must work in an integrated way. Depend only on technology and the likelihood of getting actionable information that is immediately useful goes down. Even Google asks humans to improve on its machine translation outputs. Smart software may not be so smart.

Stephen E Arnold, May 7, 2010

Unsponsored post.

Milward from Linguamatics Wins 2010 Evvie Award

April 28, 2010

The Search Engine Meeting, held this year in Boston, is one of the few events that focuses on the substance of information retrieval, not the marketing hyperbole of the sector. Entering its second decade, the conference speakers tackle challenging subjects. This year speakers addressed such topics as “Universal Composable Indexing” by Chris Biow, Mark Logic Corporation, “Innovations in Social Search” by Jeff Fried, Microsoft, and “From Structured to Unstructured and Back Again: Database Offloading”, by Gregory Grefenstette, Exalead, and a dozen other important topics.

evvie2010

From left to right: Sue Feldman, Vice President, IDC, Dr. David Milward, Liz Diamond, Stephen E. Arnold, and Eric Rogge, Exalead.

Each year, the best paper is recognized with the Evvie Award. The “Evvie” was created in honor of Ev Brenner, one of the pioneers in machine-readable content. After a distinguished career at the American Petroleum Institute, Ev served on the planning committee for the Search Engine Meeting and contributed his insights to many search and content processing companies. One of the questions I asked after each presentation was, “What did Ev think?”. I valued Ev Brenner’s viewpoint as did many others in the field.

The winner of this year’s Evvie award is David R. Milward, Linguamatics, for his paper “From Document Search to Knowledge Discovery: Changing the Paradigm.” Dr. Milward said:

Business success is often dependent on making timely decisions based on the best information available. Typically, for text information, this has meant using document search. However, the process can be accelerated by using agile text mining to provide decision-makers directly with answers rather than sets of documents. This presentation will review the challenges faced in bringing together diverse and extensive information resources to answer business-critical R&D questions in the pharmaceutical domain. In particular, it will outline how an agile NLPbased approach for discovering facts and relationships from free text can be used to leverage scientific knowledge and move beyond search to  automated profiling and hypothesis generation from millions of documents in real time.

Dr. Milward has 20 years’ experience of product development, consultancy and research in natural language processing. He is a co-founder of Linguamatics, and designed the I2E text mining system which uses a novel interactive approach to information extraction. He has been involved in applying text mining to applications in the life sciences for the last 10 years, initially as a Senior Computer Scientist at SRI International. David has a PhD from the University of Cambridge, and was a researcher and lecturer at the University of Edinburgh. He is widely published in the areas of information extraction, spoken dialogue, parsing, syntax and semantics.

Presenting this year’s award was Eric Rogge, Exalead, and Liz Diamond, niece of Ev Brenner. The award winner received a recognition award and a check for $500. A special thanks to Exalead for sponsoring this year’s Evvie.

The judges for the 2010 Evvie were Dr. David Evans (Evans Research), Sue Feldman (IDC), and Jill O’Neill, NFAIS.

Congratulations, Dr. Milward.

Stuart Schram IV, April 28, 2010

Sponsored post.

New Search and Old Boundaries

April 28, 2010

Yesterday in my talk at a conference I pointed out that for many people, the Facebook environment will cultivate new species of information retrieval. Understandably the audience listened politely and converted my observations into traditional information retrieval methods. Several of the people with whom I spoke pointed out that the Facebook information was findable only with a programmatic query via the Facebook application programming interfaces or by taking a Facebook feed and processing it. The idea that “search” now spans silos, includes structured and unstructured data, and delivers actionable results describes what some organizations want. There are challenges, of course. These include:

  • Mandated silos of information; for example, in certain situations, mash ups and desiloization are prohibited for legal or practical reasons
  • The costs of shifting from inefficient, expensive methods to more informed methods; for example, the costs of data transformation can be onerous. I have talked with individuals who point out that data transformation can consume significant sums of money and these expenditures are often inadequately budgeted. One result is a slow down or cut back on the behind-the-scenes preparatory work
  • Business processes have sometimes emerged based on convention, user behavior or because the system was refined over time. When “data” are meshed with such a business process, the marriage is a less-than-happy one. Data centric thinking can be blunted when juxtaposed to certain traditional business processes and methods.

In short, the new world can be envisioned, based on speculation, or assembled from fragmentary reports from the field. I can imagine the intrepid 16th century navigators understanding why innovators have to push forward into a new and unknown world. One reminder is the assertion that an estimated 358 million personal data records have been leaked since 2005.

The Guardian article “Facebook Privacy Hole ‘Lets You See Where Strangers Plan to Go‘” provides an example of one challenge. The point of the write up is that the Facebook social network has a “privacy hole”. The Guardian says:

Some people report that they are able to see the public “events” that Facebook users have said they will attend – even if they person is not a “friend” on the social network…The implications of being able to find out the movements of any of the 400m people on Facebook are potentially wide-ranging – although the flaw does not seem to apply to every user, or every event. Yee says that the simplest way to prevent your name appearing in such lists is to put “not attending” against any event you are invited to.

As the Facebook approach to finding information captures users, the barriers between new types of information and the uses to which those information objects can be put come down. In a social space, the issue is personal privacy. In an organizational space, the issue is the security of information assets.

As young people enter the workforce, these folks bring a comfort level with Facebook type of systems markedly different from mine. I think organizations are largely unable to control effectively what some employees do with online services. Telework, mobile devices, and smart phones present a management and information challenge.

The lowering of information barriers and the efforts to dissolve silos further reduces an organization’s control of information and the knowledge of the uses to which that information may be put.

Let’s step back.

First, ineffective search and content processing systems exist, so organizations need ways to address the costs and inefficiencies of incumbent systems. Web services and fresh approaches to indexing content seem to be solutions to findability problems in some situations.

Second, employees—particularly those comfortable with pervasive connectivity and social methods of obtaining information—do what works for them. These methods are not necessarily controllable or known to some employers. An employee can use a personal smart phone to ask “friends” a question. After all, what are friends for?

Third, vendors want to describe their systems using words and phrases that connote ways to solve findability problems. Talking about merged data and collaboration may be what’s needed to close a deal.

When these three ingredients are mixed, the result is a security and information control challenge that is only partially understood.

Is it possible to deliver a next generation information experience and minimize the risks from such a system? Sure, but there will be surprises along the route. Whether it is Mr. Zuckerberg’s schedule or insights into the Web browsing habits of government employees, there will be unexpected and important insights about these systems. The ability to use a search interface to obtain reports is increasing. Are the privacy and security controls lagging behind?

Stephen E Arnold, April 28, 2010

Unsponsored post.

SAS and Social Media

April 28, 2010

The social media bandwagon rolls on. I read “SAS aims to Make a Splash in Social Media Analytics” and realized that even large firms cannot ignore the shift to Facebook’s impact. True, there are many social media companies, but Facebook has emerged as the go-to service, threatening to eclipse even Twitter. The story says:

SAS says its technology can identify influencers within social networks, quantify their impact and from that forecast the future volume of social media conversations. The ultimate aim is to predict what impact these conversations will have on a business so companies can allocate relevant resources, create “what-if” scenarios and correlate key marketing metrics like brand preference, web traffic, online campaign effectiveness and media mix.

IBM SPSS will be quick to respond. Statistics could even become even more fun.

Stephen E Arnold, April 28, 2010

Unsponsored post.

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta