August 26, 2013
Despite enterprise companies moving away from SQL databases to the more robust NoSQL, Oracle has updated its database to include new features, including a XQuery Full Text search. We found an article that examines how the new function will affect Oracle and where it seems to point. The article from Amis Technology Blog: “Oracle Database 12c: XQuery Full Text” explains that the XQuery Full Text search was made to handle unstructured XML content. It does so by extending the XQuery XMLDB language. This finally makes Oracle capable of working with all types of XML. The rest of the article focuses on the XQuery code.
When the new feature was used on Wikipedia Content with XML content as well the test results were positive:
“During tests it proved very fast on English Wikipedia content (10++ Gb) and delivered the results within less than a second. But such a statement will only be picked up very efficiently if the new, introduced in 12c, corresponding Oracle XQuery Full-Text Index has been created.”
Oracle is trying to improve its technology as more of its users switch over to NoSQL databases. Improving the search function as well as other features keeps Oracle in the competition as well as proves that relational tables still have some kick in them. Interestingly enough Oracle appears to be focusing its energies on MarkLogic’s technology to keep in the race.
Whitney Grace, August 26, 2013
August 23, 2013
By now most have heard that J.K. Rowling, famous for her astoundingly successful Harry Potter books, has been revealed as the author of the well-received crime novel “The Cuckoo’s Calling.” Time spoke to one of the analysts who discovered that author Robert Galbraith was actually Rowling, and shares what they learned in, “J.K. Rowling’s Secret: a Forensic Linguist Explains how He Figured it Out.”
It started with a tip. Richard Brooks, editor of the British “Sunday Times,” received a mysterious tweet claiming that “Robert Galbraith” was a pen name for Rowling. Before taking the claim to the book’s publisher, Brooks called on Patrick Juola of Duquesne University to linguistically compare “The Cuckoo’s Calling” with the Potter books. Joula has had years of experience with forensic linguistics, specifically authorship attribution. Journalist Lily Rothman writes:
“The science is more frequently applied in legal cases, such as with wills of questionable origin, but it works with literature too. (Another school of forensic linguistics puts an emphasis on impressions and style, but Juola says he’s always worried that people using that approach will just find whatever they’re looking for.)
“But couldn’t an author trying to disguise herself just use different words? It’s not so easy, Juola explains. Word length, for example, is something the author might think to change — sure, some people are more prone to ‘utilize sesquipedalian lexical items,’ he jokes, but that can change with their audiences. What the author won’t think to change are the short words, the articles and prepositions. Juola asked me where a fork goes relative to a plate; I answered ‘on the left’ and wouldn’t ever think to change that, but another person might say ‘to the left’ or ‘on the left side.’”
One tool Juola uses is the free Java Graphical Authorship Attribution Program. After taking out rare words, names, and plot points, the software calculates the hundred most-used words from an author under consideration. Though a correlation does not conclusively prove that two authors are the same person, it can certainly help make the case. “Sunday Times” reporters took their findings to Galbraith’s/ Rowling’s publisher, who confirmed the connection. Though Rowling has said that using the pen name was liberating, she (and her favorite charities) may be happy with the over 500,000 percent increase in “Cukoo’s Calling” sales since her identity was uncovered.
The article notes that, though folks have been statistically analyzing text since the 1800s, our turn to e-books may make for a sharp increase in such revelations. Before that development, the process was slow even with computers, since textual analysis had to be preceded by the manual entry of texts via keyboard. Now, though, importing an entire tome is a snap. Rowling may be just be the last famous author to enjoy the anonymity of a pen name, even for just a few months.
Cynthia Murrell, August 23, 2013
August 3, 2013
I read “How Can I Pass the String ‘Null’ through WSDL (SOAP)…” My hunch is that only a handful of folks will dig into this issue. Most senior managers buy the baloney generated by search and content processing. Yesterday I reviewed for one of the outfits publishing my “real” (for fee) columns a slide deck stuff full of “all’s” and “every’s”. The message was that this particular modern system which boasted a hefty price tag could do just about anything one wanted with flows of content.
Happily overlooked was the problem of a person with a wonky name. Case in point: “Null”. The link from Hacker News to the Stackoverflow item gathered a couple of hundred comments. You can find these here. If you are involved in one of the next-generation, super-wonderful content processing systems, you may find a few minutes with the comments interesting and possibly helpful.
My scan of the comments plus the code in the “How Can I” post underscored the disconnect between what people believe a system can do and what a here-and-now system can actually do. Marketers say one thing, buyers believe another, and the installed software does something completely different.
- A person’s name—in this case ‘Null’—cannot be located in a search system. With all the hoo-hah about Fancy Dan systems, is this issue with a named entity important? I think it is because it means that certain entities may not be findable without expensive, time-consuming human curation and indexing. Oh, oh.
- Non English names pose additional problems. Migrating a name in one language into a string that a native speaker of a different language can understand introduces some problems. Instead of finding one person, the system finds multiple people. Looking for a batch of 50 people each incorrectly identified during processing generates a lot of names which guarantees more work for expensive humans or many, many false drops. Operate this type of entity extraction system a number of times and one generates so much work there is not enough money or people to figure out what’s what. Oh, oh.
- Validating named entities requires considerable work. Knowledgebases today are “built automatically and on-the-fly. Rules are no longer created by humans. Rules, like some of Google’s “janitor” technology, figure out the rules themselves and then “workers” modify those rules on-the-fly. So what happens when errors are introduced via “rules.” The system keeps on truckin’. Anyone who has worked through fixing up the known tags from an smart system like Autonomy IDOL knows that degradation can set in when the training set does not represent the actual content flow. Any wonder why precision and recall scores have not improved too much in the last 20 years? Oh, oh.
I think this item about “Null” highlights the very real and important problems with assumptions about automated content processing. Whether the corpus is a telephone directory with a handful of names or the mind-boggling flows which stream from various content channels.
Buying does not solve long-standing, complicated problems in text processing. Fast talk like that which appears in some of the Search Wizards Speak interviews does not change the false drop problem.
So what does this mean for vendors of Fancy Dan systems? Ignorance on the part of buyers is one reason why deals may close. What does this mean for users of systems which generate false drops and dependent reports which are off base? Ignorance on the part of users makes it easy to use “good enough” information to make important decisions.
Stephen E Arnold, August 3, 2013
Sponsored by Xenky
July 17, 2013
We have found a new resource: an aggregation tailored to data analysts, The Text Analytics Pros Daily, is now available at the content curation site Paper.li. The page pulls in analytics-related news arranged under topics like Business, Technology, Education, and the Environment. Our question: what are their criteria for “pros”?
For those unfamiliar with Paper.li, it is a platform that allows users to create their own aggregations by choosing their sources and customizing their page. Their description specifies:
“The key to a great newspaper is a great newsroom. The Paper.li platform gives you access to an ever-expanding universe of articles, blog posts, and rich media content. Paper.li automatically processes more than 250 million social media posts per day, extracting & analyzing over 25 million articles. Only Paper.li lets you tap into this powerful media flow to find exactly what you need, and publish it easily on your own online newspaper.”
As I peruse the page, I see many articles from a wide variety of sources; Marketwired, the Conference Board, Marketing Cloud, and assorted tech bloggers. There is a lot of information here— it is worth checking out for any content analytics pros (or hobbyists.)
Cynthia Murrell, July 17, 2013
July 4, 2013
Stephen E Arnold, July 5, 2013
Stephen E Arnold, July 4, 2013
June 1, 2013
From a game show win to being inundated with “Watson pitches”, IBM is doing its best to make Watson more successful than Sherlock Holmes. I read “IBM Inundated with Watson Pitches as It Prepares to Offer Service to Developers.” The headline certainly suggests that for search and content processing, Watson is going like gangbusters.
At the last two search and content processing shows I attended, I heard nothing about Watson. I suppose that specialist conferences are not the place for IBM, which has larger designs on the market. However, there were some developers on the programs at these conferences, and I don’t recall hearing a direct reference to IBM. I think I mentioned Hewlett Packard once.
The write up seeks to set me straight on the powerful pull IBM Watson is exerting on the those involved in building search related applications:
IBM is receiving hundreds of ideas from developers wanting to use its Watson supercomputing technology, which will be made available to anyone wanting to build applications on top of its capabilities.
The information comes from John Gordon who is IBM’s vice president for Watson Solutions. I associate him with the phrase “data is the new oil,” but I have mixed up which “expert” drew the word picture in my mind. A biographical profile of Mr. Gordon is available on Yatedo. According to an item appearing on the University of Texas’ Web site here, he “is the director of Strategy and Product Management for IBM’s new Watson Solutions Division. He is responsible for developing end-to-end business models for transforming the innovations created by IBM Watson into a strategic set of industry solutions.” Another University of Texas Web page here pointed out
Prior to this [Watson] role Gordon held a number of executive strategy, market management, and business development positions within IBM. He joined the IT industry more than 17 years ago and has consistently helped global clients enhance their performance and results by leveraging innovative technology. John holds undergraduate degrees in Philosophy and Computer Applications from the University of Notre Dame and has an M.B.A. from The University of Texas at Austin. Additionally, John is a certified SOA architect with a foundational certification in the IT Infrastructure Library (ITIL) standards, and is a co-author of a Harvard Business School case on value-in-use solution pricing.
I think I know the magnitude of the developer stampede. Programmers are really into public relations and MBA analyses.
Stephen E Arnold, June 1, 2013
Sponsored by Xenky
May 29, 2013
We came across a 4,600 word news release about the language translation software market. The study has more than 400 pages and covers a wide range of topics, including mobile phone translation systems. We worked on the Topeka Capital Markets’ Google voice report. We are biased because Google seems to have a significant technology and resource edge. As we worked through the news release we did see a list of the firms which WinterGreen discusses.
A notable translation helper, the Rosetta Stone. A happy quack to the British Museum at www.britishmuseum.org.
I want to snag the list because it had some surprises as well as both familiar and unfamiliar firms in the inventory. Here’s what I noticed in the news release:
ABBYY Lingvo (http://www.lingvo-online.ru/en)
Alchemy CATALYST (http://www.alchemysoftware.com/)
AppTek HMT (now a unit of SAIC. http://www.saic.com)
Cognition Technologies (www.cognition.com)
Duolingo (more of a learning system. http://duolingo.com/)
Google (ah, the GOOG)
Hewlett Packard (maybe www.autonomy.com)
IBM WebSphere Translation Server (try http://goo.gl/hGS2R)
Kilgray Translation Technologies (http://kilgray.com/)
Language Engineering (http://www.lec.com)
Language Weaver (Now part of SDL. See http://goo.gl/IH3mg)
Lingo24 (An agency. See http://www.lingo24.com/)
Lionbridge (crowdsourcing and integrator at http://www.lionbridge.com/)
Mission Essential Personnel (humans for rent at http://www.lionbridge.com/)
Plunet BusinessManager (A management system at http://www.plunet.com/us/)
Proz.com (humans for rent at http://www.proz.com)
RWS Legal Translation (http://www.rws.com/EN/)
Reverso (Free. See http://www.reverso.net/text_translation.aspx?lang=EN)
SDL Trados (Part of SDL. See http://www.trados.com/en/)
Sail Labs (http://www.sail-labs.com/)
Softissimo (Services and software. http://www.softissimo.com/softissimo.asp?lang=IT)
Symbio Software (http://www.symbio.com/)
Translations.com (Services and software. http://www.translations.com/)
Translators without Borders (Humans for rent. http://translatorswithoutborders.org/)
Veveo (More semantics than translation. http://corporate.veveo.net/)
Vignette (Open Text. http://www.opentext.com)
Word Magic Technology (I could not locate.)
WorldLingo (Rent a human. http://goo.gl/dhiu)
Of these 30 or so companies, there were some which struck me a surprise. Hewlett Packard, for example, owns Autonomy. I suppose that other units of Hewlett Packard have translation capabilities, but were these licensed or home grown? Also, the inclusion of Vignette is interesting. I must admit that I don’t hear much about Vignette as a translation system. The list makes translation look robust. The key players boil down to a handful of companies. I did not spot firms in the translation services or software business in China, India, Japan, or Russia, but I may have missed these firms in the WinterGreen news release describing the report.
If you want to buy a copy of the report, which I assume has paragraphs unlike the news release, point your browser at http://goo.gl/97e2s and have your credit card ready. The report is about US$7,500.
Stephen E Arnold, May 29, 2013
Sponsored by Augmentext
May 29, 2013
Identifying user sentiment has become one of the most powerful analytic tools provided by text processing companies, and Bitext’s integrative software approach is making sentiment analysis available to companies seeking to capitalize on its benefits while avoiding burdensome implementation costs. A few years ago, Lexalytics merged with Infonics. Since that time, Lexalytics has been marketing aggressively to position the company as one of the leaders in sentiment analysis. Exalead also offered sentiment analysis functionality several years ago. I recall a demonstration which generated a report about a restaurant which provided information about how those writing reviews of a restaurant expressed their satisfaction.
Today vendors of enterprise search systems have added “sentiment analysis” as one of the features of their systems. The phrase “sentiment analysis” usually appears cheek-by-jowl with “customer relationship management,” “predictive analytics,” and “business intelligence.” My view is that the early text analysis vendors such as Trec participants in the early 2000’s recognized that key word indexing was not useful for certain types of information retrieval tasks. Go back and look at the suggestions for the benefit of sentiment functions within natural language processing, and you will see that the idea is a good one but it has taken a decade or more to become a buzzword. (See for example, Y. Wilks and M. Stevenson, “The Grammar of Sense: Using Part-of-Speech Tags as a First Step in Semantic Disambiguation, Journal of Natural Language Engineering,1998, Number 4, pages 135–144.)
One of the hurdles to sentiment analysis has been the need to add yet another complex function which has a significant computational cost to existing systems. In an uncertain economic environment, additional expenses are looked at with scrutiny. Not surprisingly, organizations which understand the value of sentiment analysis and want to be in step with the data implications of the shift to mobile devices want a solution which works well and is affordable.
Fortunately Bitext has stepped forward with a semantic analysis program that focuses on complementing and enriching systems, rather than replacing them. This is bad news for some of the traditional text analysis vendors and for enterprise search vendors whose programs often require a complete overhaul or replacement of existing enterprise applications.
I recently saw a demonstration of Bitext’s local sentiment system that highlights some of the integrative features of the application. The demonstration walked me through an online service which delivered an opinion and sentiment snap in, together with topic categorization. The “snap in” or cloud based approach eliminates much of the resource burden imposed by other companies’ approaches, and this information can be easily integrated with any local app or review site.
The Bitext system, however, goes beyond what I call basic sentiment. The company’s approach processes contents from user generated reviews as well as more traditional data such as information in a CRM solution or a database of agent notes, as they do with the Salesforce marketing cloud. One important step forward for Bitext’s system is its inclusion of trends analysis. Another is its “local sentiment” function, coupled with categorization. Local sentiment means that when I am in a city looking for a restaurant, I can display the locations and consumers’ assessments of nearby dining establishments. While a standard review consists of 10 or 20 lines of texts and an overall star scoring, Bitext can add to that precisely which topics are touched in the review and with associated sentiments. For a simple review like, “the food was excellent but the service was not that good”, Bitext will return two topics and two valuations: food, positive +3; service, negative -1).
A tap displays a detailed list of opinions, positive and negative. This list is automatically generated on the fly. The Bitext addition includes a “local sentiment score” for each restaurant identified on the map. The screenshot below shows how location-based data and publicly accessible reviews are presented.
Bitext’s system can be used to provide deep insight into consumer opinions and developing trends over a range of consumer activities. The system can aggregate ratings and complex opinions on shopping experiences, events, restaurants, or any other local issue. Bitext’s system can enrich reviews from such sources as Yelp, TripAdvisor, Epinions, and others in a multilingual environment
Bitext boasts social media savvy. The system can process content from Twitter, Google+ Local, FourSquare, Bing Maps, and Yahoo! Local, among others, and easily integrates with any of these applications.
The system can also rate products, customer service representatives, and other organizational concerns. Data processed by the Bitext system includes enterprise data sources, such as contact center transcripts or customer surveys, as well as web content.
In my view, the Bitext approach goes well beyond the three stars or two dollar signs approach of some systems. Bitext can evaluate topics or “aspects”. The system can generate opinions for each topic or facet in the content stream. Furthermore, Bitext’s use of natural language provides qualitative information and insight about each topic revealing a more accurate understanding of specific consumer needs that purely quantitative rating systems lacks. Unlike other systems I have reviewed, Bitext presents an easy to understand and easy to use way to get a sense of what users really have to say, and in multiple languages, not just English!
For those interested in analytics, the Bitext system can identify trending “places” and topics with a click.
Stephen E Arnold, May 29, 2013
Sponsored by Augmentext
April 29, 2013
I read “Connotate Announces Record Quarter, Driven by Market Demand for More Rapid Delivery of Web Content-Based Products and Services.” The main point is that Connotate is growing revenues and that growth is a result of “market demand.”
Market demand means, according to wise Geek means:
the total amount of purchases of a product or family of products within a specified demographic. The demographic may be based on factors such as age or gender, or involve the total amount of sales that are generated in a particular geographic location.
The market for the Connotate solution is broadening as the cost of deployment continues to decrease. “We’ve mastered the entire workflow stack from manual processes all the way to high-end automation,” said Mulholland. “This has resulted in significant growth for us in the SMB market, while we continue to increase the value that our large enterprise customers are receiving. For example, a major UK-based job board publisher was tasked with achieving a 40 percent increase in the number of listings in a very short period of time. Connotate’s automated solution is helping them meet this aggressive goal.”
Does this mean that other companies in Connotate’s competitive sphere will experience similar upsides? The anecdotal information available to me suggests that some of the companies competing directly with Connotate are having trouble closing commercial deals which generate a profit for the vendor. Examples range from specialists in pure analytics, providers of business information visualization systems, and metatagging outfits.
My hunch is that if Connotate is experiencing the financial gains attributed to this privately held company, other factors must be in play. What are those factors? Why are so many of Connotate’s competitors struggling to hit their numbers? Why are some investors slowing down their commitment to back some of Connotate’s rivals?
Perhaps market demand does not float the boats in this particular body of water? Worth monitoring the actions of big name investors with a fondness for this sector like Silver Lake, some of the companies which continue to receive injections of cash in the hops of hitting the big time, and the peregrinations of executives who jump from one content processing outfit to another.
I am assuming that the financial data referenced in the write up are accurate.
Stephen E Arnold, April 29, 2013
Sponsored by Augmentext
April 29, 2013
I met David Bean years ago. He was explaining “deep extraction” to me at a now defunct search engine conference. I recall that he had a number of US government clients. I noted in my analysis of the company which appeared in my analysis of the company that the firm wanted to break into non government markets.
I made sure that one of my team captured news releases about Attensity. When I checked the my files to update my Attensity profile, I noted that the company had done a merger with a couple of German outfits, was pushing into sentiment analysis, and beating the text analytics drum.
In one sense, Attensity was following the same path of Stratify, which as you probably know was Purple Yogi. Hewlett Packard now owns Stratify and I don’t hear too much about how its journey from government work to the wide world of non government work has worked out. Purple Yogi, now Hewlett Packard Autonomy, Stratify is doing legal stuff … I think. If I understand the write up by a high intellect consultant expert, Attensity is speedboating into customer support.
Can market niches like customer support, eDiscovery, and business intelligence keep some vendors afloat?
Two different markets but one common goal: Diversify in order to generate big revenues.
I read “Attensity Uses Social Media Technology for Smarter Customer Engagement.” On the surface, the story is a good one and it is earnestly told:
Its product Respond uses natural language-based analysis to derive insights from any form of text-based data and among other results can produce analyses of customer sentiment, hot issues, trends and key metrics. The product supports what Attensity calls LARA – listen, analyze, relate, act – which is a form of closed-loop performance management. It begins by extracting data from multiple sources of text-based data, (listening), analyzing the content of the data (analyze), linking this data with other sources of customer data, and producing alerts, workflows and reports to encourage action to be taken based on the insights (act).
Familiar stuff. Text processing, outputs, and payoffs for the licensees.
Attensity, founded in 2000, that’s 13 years ago, is no spring chicken. I learned from the write up:
Attensity has also made some technical improvements to the product. The architecture now supports multitenancy and automatic load balancing, which are especially useful in handling very large volumes of tweets. Reporting has been enhanced to include more visualization options, trend analysis, emerging hot issues, and process and performance analysis.
My thought is that many firms which flourished with the once generous assistance of the US government now have to find a way to generate top line revenue, sustainable growth, and profits.
In the present financial environment, text processing companies are flocking to specific problem areas in organizations. Customer support (a bit of an oxymoron in my opinion), eDiscovery, and business intelligence (not as amusing as military intelligence in my opinion) now are well served sectors.
The companies looking for software and systems to make sense of data, cut costs, gain a competitive advantage, or some other benefit much favored by MBAs have not found a magic carpet ride.
The noise from vendors is increasing. The time required to find and close a deal is increasing. Some customers are looking high and low for a solution which is “good enough”. Management turnover, frequent repositionings, and familiar marketing lingo by themselves may not be enough to keep the many firms competing in these “hot niches” afloat.
Stephen E Arnold, April 29, 2013