November 25, 2013
With Google becoming more difficult to use, many professionals need a way to locate, filter, and obtain high value information that works. Silobreaker is an online service and system that delivers actionable information.
The co-founder of Silobreaker said in an exclusive interview for Search Wizards Speaks says:
I learned that in most of the organizations, information was locked in separate silos. The information in those silos was usually kept under close control by the silo manager. My insight was that if software could make available to employees the information in different silos, the organization would reap an enormous gain in productivity. So the idea was to “break” down the the information and knowledge silos that exists within companies, organizations and mindsets.
And knock down barriers the system has. Silobreaker’s popularity is surging. The most enthusiastic supporters of the system come from the intelligence community, law enforcement, analysts, and business intelligence professionals. A user’s query retrieves up-to-the-minute information from Web sources, commercial services, and open source content. The results are available as a series of summaries, full text documents, relationship maps among entities, and other report formats. The user does not have to figure out which item is an advertisement. The Silobreaker system delivers muscle, not fatty tissue.
Mr. Bjore, a former intelligence officer, adds:
Silobreaker is an Internet and a technology company that offers products and services which aggregate, analyze, contextualize and bring meaning to the ever-increasing amount of digital information.
Underscoring the difference between Silobreaker and other online systems, Mr. Bjore points out:
What sets us apart is not only the Silobreaker technology and our commitment to constant innovation. Silobreaker embodies the long term and active experience of having a team of users and developers who can understand the end user environment and challenges. Also, I want to emphasize that our technology is one integrated technology that combines access, content, and actionable outputs.
The ArnoldIT team uses Silobreaker in our intelligence-related work. We include a profile of the system in our lectures about next-generation information gathering and processing systems.
Stephen E Arnold, November 25, 2013
November 2, 2013
The Linguamatics Blog recently reported on the outcome of the 2013 Text Mining Summit in the post “Pharma and Healthcare Come Together to See the Future of Text Mining.”
According to the article, this year’s event drew a record crowd of over 85 attendees who had the opportunity to listen to industry experts from the pharma and healthcare sector.
The article summarizes a few event highlights:
“Delegates were provided with an excellent opportunity to explore trends in text mining and analytics, natural language processing and knowledge discovery. Delegates discovered how I2E is delivering valuable intelligence from text in a range of applications, including the mining of scientific literature, news feeds, Electronic Health Records (EHRs), clinical trial data, FDA drug labels and more. Customer presentations demonstrated how I2E helps workers in knowledge driven organizations meet the challenge of information overload, maximize the value of their information assets and increase speed to insight.”
Events like the Text Analytics Summit are excellent opportunities for members of the data analytics community to gather and share their insights and new advances in the industry.
Jasmine Ashton, November 02, 2013
October 28, 2013
Semantria is a company focused on providing text and sentiment analysis to anyone. The company’s approach is to streamline the analysis of content to that in less than three minutes and for a nominal $1,000, the power of content processing can help answer tough business questions.
The firm’s founder is Oleg Rogynskyy, who has worked at Nstein (now part of Open Text) and Lexalytics. The idea for Semantria blossomed from Mr. Rogynskyy’s insight that text analytics technology was sufficiently mature so that it could be useful to almost any organization or business professionals.
I interviewed Mr. Rogynskyy on October 24, 2013. He told me:
At Semantria, we want to simplify and democratize access to text analytics technology. We want people to be able to get up and running in no time, with a small budget, and actually derive value from our technology. The classic story is you buy a system worth $100k and don’t deploy it.
Semantria focuses on a class of problems that a few years ago would have been outside the reach of many firms. He said:
We make it simple for our clients to solve the following problems: First, some organizations have too much text to read. For example, a Twitter stream or surveys with many responses. Also, there is the need to move quickly and reduce the time to get to market. Many survey results come with an expiry date before they’re irrelevant. Then there is reporting the information. Anyone can use their Excel smarts to build simple/interesting reports and visuals out of unstructured data. But that can take some time, and Semantria accelerates this step. Finally, users need to analyze text with the same impartiality each time. A human might see a glass as half full or half empty, but Semantria will always see a glass with water.
One of the most interesting aspects of Semantria is that the company delivers its solution as a cloud service. Mr. Rogynskyy observed:
We are happily in the cloud, and in the cloud we trust. We have android and iOS software development kits in the works, so whoever wants to talk to our API from mobile devices will be doing it with ease very soon.
You can get more information about Semantria at https://semantria.com.
This interview is one or more than 60 full-text interviews with individuals who are deeply involved in search, content processing, and analytics. You can find the full series at www.arnoldit.com/search-wizards-speak.
Stephen E Arnold, October 28, 2013
September 11, 2013
Author J.K. Rowling recently learned firsthand how sophisticated analytics software has become. It was a linguistic analysis of the text in The Cuckoo’s Calling‘s which unmasked her as the popular crime-novel’s author “Robert Galbraith.” (These tools were originally devised to combat plagiarism.) Now, I Programmer tells us in “Anonymouth Hides Identity,” open-source software is being crafted to foil such tools, and give writers “stylometric anonymity.”
Whether a wordsmith just wants to enjoy a long-lost sense of anonymity, as the wildly successful author of the Harry Potter series attempted to do, or has more high-stakes reasons to hide behind a pen name, a team from Drexel University has the answer. The students from the school’s Privacy, Security, and Automation Lab (PSAL) just captured the Andreas Pfitzmann Best Student Paper Award at this year’s Privacy Enhancing Technologies Symposium for their paper on the subject. The article reveals:
The idea behind Anonymouth is that sylometry can be a threat in situations where individuals want to ensure their privacy while continuing to interact with others over the Internet. A presentation about the program cites two hypothetical scenarios:
*Alice the Anonymous Blogger vs.Bob the Abusive Employer
*Anonymous Forum vs. Oppressive Government. . . .
The JStylo-Anonymouth (JSAN) framework is work in progress at PSAL under the supervision of assistant professor of computer science, Dr. Rachel Greenstadt. It consists of two parts:
*JStylo – authorship attribution framework, used as the underlying feature extraction employing a set of linguistic features
*Anonymouth – authorship evasion (anonymization) framework, which suggests changes that need to be made.
The admittedly very small study discussed in the paper found that 80 percent of participants were able to produce anonymous documents “to a limited extent.” It also found certain constraints– it was more difficult to anonymize existing documents than new creations, for example. Still, this is an interesting development, and I am sure we will see more efforts in this direction.
Cynthia Murrell, September 11, 2013
August 27, 2013
Often, we look more specifically at various apps and applications that address search needs. Sometimes, it is refreshing to find articles that take a step back and look at the overall paradigm shifts guiding the feature updates and new technology releases flooding the media. Forbes reports on the big picture in “NetAppVoice: How The Semantic Web Changes Everything. Again!”
Evolving out of the last big buzz word, big data, semantic Web is now ubiquitous. Starting at the beginning, the article explains what semantic search allows people to do. A user can search for terms that retrieve results that go beyond keywords–through metadata and other semantic technologies associations between related concepts are created.
According to the article hyperconnectivity is the goal for promised meaningful insights to be delivered through semantic search:
For example, if we could somehow acquire all of the world’s knowledge, it wouldn’t make us smarter. It would just make us more knowledgeable. That’s exactly how search worked before semantics came along. In order for us to become smarter, we somehow need to understand the meaning of information. To do that we need to be able to forge connections in all this data, to see how each piece of knowledge relates to every other. In the semantic Web, we users provide the connections, through our social media activity. The patterns that emerge, the sentiment in the interactions—comments, shares, tweets, Likes, etc.—allow a very precise, detailed picture to emerge.
Enterprise organizations are in a unique position to achieve this hyperconnectivity and they also have a growing list of technological solutions to help break down silos and promote safe and secure data access to appropriate users. For example, text analytics and semantic processing for Cogito Intelligence API enhances the ability to decipher meaning and insights from a multitude of content sources including social media and unstructured corporate data.
Megan Feil, August 27, 2013
August 23, 2013
By now most have heard that J.K. Rowling, famous for her astoundingly successful Harry Potter books, has been revealed as the author of the well-received crime novel “The Cuckoo’s Calling.” Time spoke to one of the analysts who discovered that author Robert Galbraith was actually Rowling, and shares what they learned in, “J.K. Rowling’s Secret: a Forensic Linguist Explains how He Figured it Out.”
It started with a tip. Richard Brooks, editor of the British “Sunday Times,” received a mysterious tweet claiming that “Robert Galbraith” was a pen name for Rowling. Before taking the claim to the book’s publisher, Brooks called on Patrick Juola of Duquesne University to linguistically compare “The Cuckoo’s Calling” with the Potter books. Joula has had years of experience with forensic linguistics, specifically authorship attribution. Journalist Lily Rothman writes:
“The science is more frequently applied in legal cases, such as with wills of questionable origin, but it works with literature too. (Another school of forensic linguistics puts an emphasis on impressions and style, but Juola says he’s always worried that people using that approach will just find whatever they’re looking for.)
“But couldn’t an author trying to disguise herself just use different words? It’s not so easy, Juola explains. Word length, for example, is something the author might think to change — sure, some people are more prone to ‘utilize sesquipedalian lexical items,’ he jokes, but that can change with their audiences. What the author won’t think to change are the short words, the articles and prepositions. Juola asked me where a fork goes relative to a plate; I answered ‘on the left’ and wouldn’t ever think to change that, but another person might say ‘to the left’ or ‘on the left side.’”
One tool Juola uses is the free Java Graphical Authorship Attribution Program. After taking out rare words, names, and plot points, the software calculates the hundred most-used words from an author under consideration. Though a correlation does not conclusively prove that two authors are the same person, it can certainly help make the case. “Sunday Times” reporters took their findings to Galbraith’s/ Rowling’s publisher, who confirmed the connection. Though Rowling has said that using the pen name was liberating, she (and her favorite charities) may be happy with the over 500,000 percent increase in “Cukoo’s Calling” sales since her identity was uncovered.
The article notes that, though folks have been statistically analyzing text since the 1800s, our turn to e-books may make for a sharp increase in such revelations. Before that development, the process was slow even with computers, since textual analysis had to be preceded by the manual entry of texts via keyboard. Now, though, importing an entire tome is a snap. Rowling may be just be the last famous author to enjoy the anonymity of a pen name, even for just a few months.
Cynthia Murrell, August 23, 2013
August 3, 2013
I read “How Can I Pass the String ‘Null’ through WSDL (SOAP)…” My hunch is that only a handful of folks will dig into this issue. Most senior managers buy the baloney generated by search and content processing. Yesterday I reviewed for one of the outfits publishing my “real” (for fee) columns a slide deck stuff full of “all’s” and “every’s”. The message was that this particular modern system which boasted a hefty price tag could do just about anything one wanted with flows of content.
Happily overlooked was the problem of a person with a wonky name. Case in point: “Null”. The link from Hacker News to the Stackoverflow item gathered a couple of hundred comments. You can find these here. If you are involved in one of the next-generation, super-wonderful content processing systems, you may find a few minutes with the comments interesting and possibly helpful.
My scan of the comments plus the code in the “How Can I” post underscored the disconnect between what people believe a system can do and what a here-and-now system can actually do. Marketers say one thing, buyers believe another, and the installed software does something completely different.
- A person’s name—in this case ‘Null’—cannot be located in a search system. With all the hoo-hah about Fancy Dan systems, is this issue with a named entity important? I think it is because it means that certain entities may not be findable without expensive, time-consuming human curation and indexing. Oh, oh.
- Non English names pose additional problems. Migrating a name in one language into a string that a native speaker of a different language can understand introduces some problems. Instead of finding one person, the system finds multiple people. Looking for a batch of 50 people each incorrectly identified during processing generates a lot of names which guarantees more work for expensive humans or many, many false drops. Operate this type of entity extraction system a number of times and one generates so much work there is not enough money or people to figure out what’s what. Oh, oh.
- Validating named entities requires considerable work. Knowledgebases today are “built automatically and on-the-fly. Rules are no longer created by humans. Rules, like some of Google’s “janitor” technology, figure out the rules themselves and then “workers” modify those rules on-the-fly. So what happens when errors are introduced via “rules.” The system keeps on truckin’. Anyone who has worked through fixing up the known tags from an smart system like Autonomy IDOL knows that degradation can set in when the training set does not represent the actual content flow. Any wonder why precision and recall scores have not improved too much in the last 20 years? Oh, oh.
I think this item about “Null” highlights the very real and important problems with assumptions about automated content processing. Whether the corpus is a telephone directory with a handful of names or the mind-boggling flows which stream from various content channels.
Buying does not solve long-standing, complicated problems in text processing. Fast talk like that which appears in some of the Search Wizards Speak interviews does not change the false drop problem.
So what does this mean for vendors of Fancy Dan systems? Ignorance on the part of buyers is one reason why deals may close. What does this mean for users of systems which generate false drops and dependent reports which are off base? Ignorance on the part of users makes it easy to use “good enough” information to make important decisions.
Stephen E Arnold, August 3, 2013
Sponsored by Xenky
July 17, 2013
We have found a new resource: an aggregation tailored to data analysts, The Text Analytics Pros Daily, is now available at the content curation site Paper.li. The page pulls in analytics-related news arranged under topics like Business, Technology, Education, and the Environment. Our question: what are their criteria for “pros”?
For those unfamiliar with Paper.li, it is a platform that allows users to create their own aggregations by choosing their sources and customizing their page. Their description specifies:
“The key to a great newspaper is a great newsroom. The Paper.li platform gives you access to an ever-expanding universe of articles, blog posts, and rich media content. Paper.li automatically processes more than 250 million social media posts per day, extracting & analyzing over 25 million articles. Only Paper.li lets you tap into this powerful media flow to find exactly what you need, and publish it easily on your own online newspaper.”
As I peruse the page, I see many articles from a wide variety of sources; Marketwired, the Conference Board, Marketing Cloud, and assorted tech bloggers. There is a lot of information here— it is worth checking out for any content analytics pros (or hobbyists.)
Cynthia Murrell, July 17, 2013
May 29, 2013
Identifying user sentiment has become one of the most powerful analytic tools provided by text processing companies, and Bitext’s integrative software approach is making sentiment analysis available to companies seeking to capitalize on its benefits while avoiding burdensome implementation costs. A few years ago, Lexalytics merged with Infonics. Since that time, Lexalytics has been marketing aggressively to position the company as one of the leaders in sentiment analysis. Exalead also offered sentiment analysis functionality several years ago. I recall a demonstration which generated a report about a restaurant which provided information about how those writing reviews of a restaurant expressed their satisfaction.
Today vendors of enterprise search systems have added “sentiment analysis” as one of the features of their systems. The phrase “sentiment analysis” usually appears cheek-by-jowl with “customer relationship management,” “predictive analytics,” and “business intelligence.” My view is that the early text analysis vendors such as Trec participants in the early 2000’s recognized that key word indexing was not useful for certain types of information retrieval tasks. Go back and look at the suggestions for the benefit of sentiment functions within natural language processing, and you will see that the idea is a good one but it has taken a decade or more to become a buzzword. (See for example, Y. Wilks and M. Stevenson, “The Grammar of Sense: Using Part-of-Speech Tags as a First Step in Semantic Disambiguation, Journal of Natural Language Engineering,1998, Number 4, pages 135–144.)
One of the hurdles to sentiment analysis has been the need to add yet another complex function which has a significant computational cost to existing systems. In an uncertain economic environment, additional expenses are looked at with scrutiny. Not surprisingly, organizations which understand the value of sentiment analysis and want to be in step with the data implications of the shift to mobile devices want a solution which works well and is affordable.
Fortunately Bitext has stepped forward with a semantic analysis program that focuses on complementing and enriching systems, rather than replacing them. This is bad news for some of the traditional text analysis vendors and for enterprise search vendors whose programs often require a complete overhaul or replacement of existing enterprise applications.
I recently saw a demonstration of Bitext’s local sentiment system that highlights some of the integrative features of the application. The demonstration walked me through an online service which delivered an opinion and sentiment snap in, together with topic categorization. The “snap in” or cloud based approach eliminates much of the resource burden imposed by other companies’ approaches, and this information can be easily integrated with any local app or review site.
The Bitext system, however, goes beyond what I call basic sentiment. The company’s approach processes contents from user generated reviews as well as more traditional data such as information in a CRM solution or a database of agent notes, as they do with the Salesforce marketing cloud. One important step forward for Bitext’s system is its inclusion of trends analysis. Another is its “local sentiment” function, coupled with categorization. Local sentiment means that when I am in a city looking for a restaurant, I can display the locations and consumers’ assessments of nearby dining establishments. While a standard review consists of 10 or 20 lines of texts and an overall star scoring, Bitext can add to that precisely which topics are touched in the review and with associated sentiments. For a simple review like, “the food was excellent but the service was not that good”, Bitext will return two topics and two valuations: food, positive +3; service, negative -1).
A tap displays a detailed list of opinions, positive and negative. This list is automatically generated on the fly. The Bitext addition includes a “local sentiment score” for each restaurant identified on the map. The screenshot below shows how location-based data and publicly accessible reviews are presented.
Bitext’s system can be used to provide deep insight into consumer opinions and developing trends over a range of consumer activities. The system can aggregate ratings and complex opinions on shopping experiences, events, restaurants, or any other local issue. Bitext’s system can enrich reviews from such sources as Yelp, TripAdvisor, Epinions, and others in a multilingual environment
Bitext boasts social media savvy. The system can process content from Twitter, Google+ Local, FourSquare, Bing Maps, and Yahoo! Local, among others, and easily integrates with any of these applications.
The system can also rate products, customer service representatives, and other organizational concerns. Data processed by the Bitext system includes enterprise data sources, such as contact center transcripts or customer surveys, as well as web content.
In my view, the Bitext approach goes well beyond the three stars or two dollar signs approach of some systems. Bitext can evaluate topics or “aspects”. The system can generate opinions for each topic or facet in the content stream. Furthermore, Bitext’s use of natural language provides qualitative information and insight about each topic revealing a more accurate understanding of specific consumer needs that purely quantitative rating systems lacks. Unlike other systems I have reviewed, Bitext presents an easy to understand and easy to use way to get a sense of what users really have to say, and in multiple languages, not just English!
For those interested in analytics, the Bitext system can identify trending “places” and topics with a click.
Stephen E Arnold, May 29, 2013
Sponsored by Augmentext
April 29, 2013
I read “Connotate Announces Record Quarter, Driven by Market Demand for More Rapid Delivery of Web Content-Based Products and Services.” The main point is that Connotate is growing revenues and that growth is a result of “market demand.”
Market demand means, according to wise Geek means:
the total amount of purchases of a product or family of products within a specified demographic. The demographic may be based on factors such as age or gender, or involve the total amount of sales that are generated in a particular geographic location.
The market for the Connotate solution is broadening as the cost of deployment continues to decrease. “We’ve mastered the entire workflow stack from manual processes all the way to high-end automation,” said Mulholland. “This has resulted in significant growth for us in the SMB market, while we continue to increase the value that our large enterprise customers are receiving. For example, a major UK-based job board publisher was tasked with achieving a 40 percent increase in the number of listings in a very short period of time. Connotate’s automated solution is helping them meet this aggressive goal.”
Does this mean that other companies in Connotate’s competitive sphere will experience similar upsides? The anecdotal information available to me suggests that some of the companies competing directly with Connotate are having trouble closing commercial deals which generate a profit for the vendor. Examples range from specialists in pure analytics, providers of business information visualization systems, and metatagging outfits.
My hunch is that if Connotate is experiencing the financial gains attributed to this privately held company, other factors must be in play. What are those factors? Why are so many of Connotate’s competitors struggling to hit their numbers? Why are some investors slowing down their commitment to back some of Connotate’s rivals?
Perhaps market demand does not float the boats in this particular body of water? Worth monitoring the actions of big name investors with a fondness for this sector like Silver Lake, some of the companies which continue to receive injections of cash in the hops of hitting the big time, and the peregrinations of executives who jump from one content processing outfit to another.
I am assuming that the financial data referenced in the write up are accurate.
Stephen E Arnold, April 29, 2013
Sponsored by Augmentext