Anonymizing Writing Style
September 11, 2013
Author J.K. Rowling recently learned firsthand how sophisticated analytics software has become. It was a linguistic analysis of the text in The Cuckoo’s Calling‘s which unmasked her as the popular crime-novel’s author “Robert Galbraith.” (These tools were originally devised to combat plagiarism.) Now, I Programmer tells us in “Anonymouth Hides Identity,” open-source software is being crafted to foil such tools, and give writers “stylometric anonymity.”
Whether a wordsmith just wants to enjoy a long-lost sense of anonymity, as the wildly successful author of the Harry Potter series attempted to do, or has more high-stakes reasons to hide behind a pen name, a team from Drexel University has the answer. The students from the school’s Privacy, Security, and Automation Lab (PSAL) just captured the Andreas Pfitzmann Best Student Paper Award at this year’s Privacy Enhancing Technologies Symposium for their paper on the subject. The article reveals:
The idea behind Anonymouth is that sylometry can be a threat in situations where individuals want to ensure their privacy while continuing to interact with others over the Internet. A presentation about the program cites two hypothetical scenarios:
*Alice the Anonymous Blogger vs.Bob the Abusive Employer
*Anonymous Forum vs. Oppressive Government. . . .
The JStylo-Anonymouth (JSAN) framework is work in progress at PSAL under the supervision of assistant professor of computer science, Dr. Rachel Greenstadt. It consists of two parts:
*JStylo – authorship attribution framework, used as the underlying feature extraction employing a set of linguistic features
*Anonymouth – authorship evasion (anonymization) framework, which suggests changes that need to be made.
The admittedly very small study discussed in the paper found that 80 percent of participants were able to produce anonymous documents “to a limited extent.” It also found certain constraints– it was more difficult to anonymize existing documents than new creations, for example. Still, this is an interesting development, and I am sure we will see more efforts in this direction.
Cynthia Murrell, September 11, 2013
Sponsored by ArnoldIT.com, developer of Augmentext
Text Analytics and Semantic Processing Fuel New Web Paradigm
August 27, 2013
Often, we look more specifically at various apps and applications that address search needs. Sometimes, it is refreshing to find articles that take a step back and look at the overall paradigm shifts guiding the feature updates and new technology releases flooding the media. Forbes reports on the big picture in “NetAppVoice: How The Semantic Web Changes Everything. Again!”
Evolving out of the last big buzz word, big data, semantic Web is now ubiquitous. Starting at the beginning, the article explains what semantic search allows people to do. A user can search for terms that retrieve results that go beyond keywords–through metadata and other semantic technologies associations between related concepts are created.
According to the article hyperconnectivity is the goal for promised meaningful insights to be delivered through semantic search:
For example, if we could somehow acquire all of the world’s knowledge, it wouldn’t make us smarter. It would just make us more knowledgeable. That’s exactly how search worked before semantics came along. In order for us to become smarter, we somehow need to understand the meaning of information. To do that we need to be able to forge connections in all this data, to see how each piece of knowledge relates to every other. In the semantic Web, we users provide the connections, through our social media activity. The patterns that emerge, the sentiment in the interactions—comments, shares, tweets, Likes, etc.—allow a very precise, detailed picture to emerge.
Enterprise organizations are in a unique position to achieve this hyperconnectivity and they also have a growing list of technological solutions to help break down silos and promote safe and secure data access to appropriate users. For example, text analytics and semantic processing for Cogito Intelligence API enhances the ability to decipher meaning and insights from a multitude of content sources including social media and unstructured corporate data.
Megan Feil, August 27, 2013
Sponsored by ArnoldIT.com, developer of Beyond Search
How Forensic Linguistics Helped Unmask Rowling
August 23, 2013
By now most have heard that J.K. Rowling, famous for her astoundingly successful Harry Potter books, has been revealed as the author of the well-received crime novel “The Cuckoo’s Calling.” Time spoke to one of the analysts who discovered that author Robert Galbraith was actually Rowling, and shares what they learned in, “J.K. Rowling’s Secret: a Forensic Linguist Explains how He Figured it Out.”
It started with a tip. Richard Brooks, editor of the British “Sunday Times,” received a mysterious tweet claiming that “Robert Galbraith” was a pen name for Rowling. Before taking the claim to the book’s publisher, Brooks called on Patrick Juola of Duquesne University to linguistically compare “The Cuckoo’s Calling” with the Potter books. Joula has had years of experience with forensic linguistics, specifically authorship attribution. Journalist Lily Rothman writes:
“The science is more frequently applied in legal cases, such as with wills of questionable origin, but it works with literature too. (Another school of forensic linguistics puts an emphasis on impressions and style, but Juola says he’s always worried that people using that approach will just find whatever they’re looking for.)
“But couldn’t an author trying to disguise herself just use different words? It’s not so easy, Juola explains. Word length, for example, is something the author might think to change — sure, some people are more prone to ‘utilize sesquipedalian lexical items,’ he jokes, but that can change with their audiences. What the author won’t think to change are the short words, the articles and prepositions. Juola asked me where a fork goes relative to a plate; I answered ‘on the left’ and wouldn’t ever think to change that, but another person might say ‘to the left’ or ‘on the left side.'”
One tool Juola uses is the free Java Graphical Authorship Attribution Program. After taking out rare words, names, and plot points, the software calculates the hundred most-used words from an author under consideration. Though a correlation does not conclusively prove that two authors are the same person, it can certainly help make the case. “Sunday Times” reporters took their findings to Galbraith’s/ Rowling’s publisher, who confirmed the connection. Though Rowling has said that using the pen name was liberating, she (and her favorite charities) may be happy with the over 500,000 percent increase in “Cukoo’s Calling” sales since her identity was uncovered.
The article notes that, though folks have been statistically analyzing text since the 1800s, our turn to e-books may make for a sharp increase in such revelations. Before that development, the process was slow even with computers, since textual analysis had to be preceded by the manual entry of texts via keyboard. Now, though, importing an entire tome is a snap. Rowling may be just be the last famous author to enjoy the anonymity of a pen name, even for just a few months.
Cynthia Murrell, August 23, 2013
Sponsored by ArnoldIT.com, developer of Augmentext
Search and Null: Not Good News for Some
August 3, 2013
I read “How Can I Pass the String ‘Null’ through WSDL (SOAP)…” My hunch is that only a handful of folks will dig into this issue. Most senior managers buy the baloney generated by search and content processing. Yesterday I reviewed for one of the outfits publishing my “real” (for fee) columns a slide deck stuff full of “all’s” and “every’s”. The message was that this particular modern system which boasted a hefty price tag could do just about anything one wanted with flows of content.
Happily overlooked was the problem of a person with a wonky name. Case in point: “Null”. The link from Hacker News to the Stackoverflow item gathered a couple of hundred comments. You can find these here. If you are involved in one of the next-generation, super-wonderful content processing systems, you may find a few minutes with the comments interesting and possibly helpful.
My scan of the comments plus the code in the “How Can I” post underscored the disconnect between what people believe a system can do and what a here-and-now system can actually do. Marketers say one thing, buyers believe another, and the installed software does something completely different.
Examples:
- A person’s name—in this case ‘Null’—cannot be located in a search system. With all the hoo-hah about Fancy Dan systems, is this issue with a named entity important? I think it is because it means that certain entities may not be findable without expensive, time-consuming human curation and indexing. Oh, oh.
- Non English names pose additional problems. Migrating a name in one language into a string that a native speaker of a different language can understand introduces some problems. Instead of finding one person, the system finds multiple people. Looking for a batch of 50 people each incorrectly identified during processing generates a lot of names which guarantees more work for expensive humans or many, many false drops. Operate this type of entity extraction system a number of times and one generates so much work there is not enough money or people to figure out what’s what. Oh, oh.
- Validating named entities requires considerable work. Knowledgebases today are “built automatically and on-the-fly. Rules are no longer created by humans. Rules, like some of Google’s “janitor” technology, figure out the rules themselves and then “workers” modify those rules on-the-fly. So what happens when errors are introduced via “rules.” The system keeps on truckin’. Anyone who has worked through fixing up the known tags from an smart system like Autonomy IDOL knows that degradation can set in when the training set does not represent the actual content flow. Any wonder why precision and recall scores have not improved too much in the last 20 years? Oh, oh.
I think this item about “Null” highlights the very real and important problems with assumptions about automated content processing. Whether the corpus is a telephone directory with a handful of names or the mind-boggling flows which stream from various content channels.
Buying does not solve long-standing, complicated problems in text processing. Fast talk like that which appears in some of the Search Wizards Speak interviews does not change the false drop problem.
So what does this mean for vendors of Fancy Dan systems? Ignorance on the part of buyers is one reason why deals may close. What does this mean for users of systems which generate false drops and dependent reports which are off base? Ignorance on the part of users makes it easy to use “good enough” information to make important decisions.
Interesting, Null?
Stephen E Arnold, August 3, 2013
Sponsored by Xenky
Text Analytics Pros Daily
July 17, 2013
We have found a new resource: an aggregation tailored to data analysts, The Text Analytics Pros Daily, is now available at the content curation site Paper.li. The page pulls in analytics-related news arranged under topics like Business, Technology, Education, and the Environment. Our question: what are their criteria for “pros”?
For those unfamiliar with Paper.li, it is a platform that allows users to create their own aggregations by choosing their sources and customizing their page. Their description specifies:
“The key to a great newspaper is a great newsroom. The Paper.li platform gives you access to an ever-expanding universe of articles, blog posts, and rich media content. Paper.li automatically processes more than 250 million social media posts per day, extracting & analyzing over 25 million articles. Only Paper.li lets you tap into this powerful media flow to find exactly what you need, and publish it easily on your own online newspaper.”
As I peruse the page, I see many articles from a wide variety of sources; Marketwired, the Conference Board, Marketing Cloud, and assorted tech bloggers. There is a lot of information here— it is worth checking out for any content analytics pros (or hobbyists.)
Cynthia Murrell, July 17, 2013
Sponsored by ArnoldIT.com, developer of Augmentext
Bitext Delivers a Breakthrough in Localized Sentiment Analysis
May 29, 2013
Identifying user sentiment has become one of the most powerful analytic tools provided by text processing companies, and Bitext’s integrative software approach is making sentiment analysis available to companies seeking to capitalize on its benefits while avoiding burdensome implementation costs. A few years ago, Lexalytics merged with Infonics. Since that time, Lexalytics has been marketing aggressively to position the company as one of the leaders in sentiment analysis. Exalead also offered sentiment analysis functionality several years ago. I recall a demonstration which generated a report about a restaurant which provided information about how those writing reviews of a restaurant expressed their satisfaction.
Today vendors of enterprise search systems have added “sentiment analysis” as one of the features of their systems. The phrase “sentiment analysis” usually appears cheek-by-jowl with “customer relationship management,” “predictive analytics,” and “business intelligence.” My view is that the early text analysis vendors such as Trec participants in the early 2000’s recognized that key word indexing was not useful for certain types of information retrieval tasks. Go back and look at the suggestions for the benefit of sentiment functions within natural language processing, and you will see that the idea is a good one but it has taken a decade or more to become a buzzword. (See for example, Y. Wilks and M. Stevenson, “The Grammar of Sense: Using Part-of-Speech Tags as a First Step in Semantic Disambiguation, Journal of Natural Language Engineering,1998, Number 4, pages 135–144.)
One of the hurdles to sentiment analysis has been the need to add yet another complex function which has a significant computational cost to existing systems. In an uncertain economic environment, additional expenses are looked at with scrutiny. Not surprisingly, organizations which understand the value of sentiment analysis and want to be in step with the data implications of the shift to mobile devices want a solution which works well and is affordable.
Fortunately Bitext has stepped forward with a semantic analysis program that focuses on complementing and enriching systems, rather than replacing them. This is bad news for some of the traditional text analysis vendors and for enterprise search vendors whose programs often require a complete overhaul or replacement of existing enterprise applications.
I recently saw a demonstration of Bitext’s local sentiment system that highlights some of the integrative features of the application. The demonstration walked me through an online service which delivered an opinion and sentiment snap in, together with topic categorization. The “snap in” or cloud based approach eliminates much of the resource burden imposed by other companies’ approaches, and this information can be easily integrated with any local app or review site.
The Bitext system, however, goes beyond what I call basic sentiment. The company’s approach processes contents from user generated reviews as well as more traditional data such as information in a CRM solution or a database of agent notes, as they do with the Salesforce marketing cloud. One important step forward for Bitext’s system is its inclusion of trends analysis. Another is its “local sentiment” function, coupled with categorization. Local sentiment means that when I am in a city looking for a restaurant, I can display the locations and consumers’ assessments of nearby dining establishments. While a standard review consists of 10 or 20 lines of texts and an overall star scoring, Bitext can add to that precisely which topics are touched in the review and with associated sentiments. For a simple review like, “the food was excellent but the service was not that good”, Bitext will return two topics and two valuations: food, positive +3; service, negative -1).
A tap displays a detailed list of opinions, positive and negative. This list is automatically generated on the fly. The Bitext addition includes a “local sentiment score” for each restaurant identified on the map. The screenshot below shows how location-based data and publicly accessible reviews are presented.
Bitext’s system can be used to provide deep insight into consumer opinions and developing trends over a range of consumer activities. The system can aggregate ratings and complex opinions on shopping experiences, events, restaurants, or any other local issue. Bitext’s system can enrich reviews from such sources as Yelp, TripAdvisor, Epinions, and others in a multilingual environment
Bitext boasts social media savvy. The system can process content from Twitter, Google+ Local, FourSquare, Bing Maps, and Yahoo! Local, among others, and easily integrates with any of these applications.
The system can also rate products, customer service representatives, and other organizational concerns. Data processed by the Bitext system includes enterprise data sources, such as contact center transcripts or customer surveys, as well as web content.
In my view, the Bitext approach goes well beyond the three stars or two dollar signs approach of some systems. Bitext can evaluate topics or “aspects”. The system can generate opinions for each topic or facet in the content stream. Furthermore, Bitext’s use of natural language provides qualitative information and insight about each topic revealing a more accurate understanding of specific consumer needs that purely quantitative rating systems lacks. Unlike other systems I have reviewed, Bitext presents an easy to understand and easy to use way to get a sense of what users really have to say, and in multiple languages, not just English!
For those interested in analytics, the Bitext system can identify trending “places” and topics with a click.
Stephen E Arnold, May 29, 2013
Sponsored by Augmentext
Connotate: Private Company Toot Toot
April 29, 2013
I read “Connotate Announces Record Quarter, Driven by Market Demand for More Rapid Delivery of Web Content-Based Products and Services.” The main point is that Connotate is growing revenues and that growth is a result of “market demand.”
Market demand means, according to wise Geek means:
the total amount of purchases of a product or family of products within a specified demographic. The demographic may be based on factors such as age or gender, or involve the total amount of sales that are generated in a particular geographic location.
I learned:
The market for the Connotate solution is broadening as the cost of deployment continues to decrease. “We’ve mastered the entire workflow stack from manual processes all the way to high-end automation,” said Mulholland. “This has resulted in significant growth for us in the SMB market, while we continue to increase the value that our large enterprise customers are receiving. For example, a major UK-based job board publisher was tasked with achieving a 40 percent increase in the number of listings in a very short period of time. Connotate’s automated solution is helping them meet this aggressive goal.”
Does this mean that other companies in Connotate’s competitive sphere will experience similar upsides? The anecdotal information available to me suggests that some of the companies competing directly with Connotate are having trouble closing commercial deals which generate a profit for the vendor. Examples range from specialists in pure analytics, providers of business information visualization systems, and metatagging outfits.
My hunch is that if Connotate is experiencing the financial gains attributed to this privately held company, other factors must be in play. What are those factors? Why are so many of Connotate’s competitors struggling to hit their numbers? Why are some investors slowing down their commitment to back some of Connotate’s rivals?
Perhaps market demand does not float the boats in this particular body of water? Worth monitoring the actions of big name investors with a fondness for this sector like Silver Lake, some of the companies which continue to receive injections of cash in the hops of hitting the big time, and the peregrinations of executives who jump from one content processing outfit to another.
I am assuming that the financial data referenced in the write up are accurate.
Stephen E Arnold, April 29, 2013
Sponsored by Augmentext
Attensity: Evolving and Repositioning Again
April 29, 2013
I met David Bean years ago. He was explaining “deep extraction” to me at a now defunct search engine conference. I recall that he had a number of US government clients. I noted in my analysis of the company which appeared in my analysis of the company that the firm wanted to break into non government markets.
I made sure that one of my team captured news releases about Attensity. When I checked the my files to update my Attensity profile, I noted that the company had done a merger with a couple of German outfits, was pushing into sentiment analysis, and beating the text analytics drum.
In one sense, Attensity was following the same path of Stratify, which as you probably know was Purple Yogi. Hewlett Packard now owns Stratify and I don’t hear too much about how its journey from government work to the wide world of non government work has worked out. Purple Yogi, now Hewlett Packard Autonomy, Stratify is doing legal stuff … I think. If I understand the write up by a high intellect consultant expert, Attensity is speedboating into customer support.
Can market niches like customer support, eDiscovery, and business intelligence keep some vendors afloat?
Two different markets but one common goal: Diversify in order to generate big revenues.
I read “Attensity Uses Social Media Technology for Smarter Customer Engagement.” On the surface, the story is a good one and it is earnestly told:
Its product Respond uses natural language-based analysis to derive insights from any form of text-based data and among other results can produce analyses of customer sentiment, hot issues, trends and key metrics. The product supports what Attensity calls LARA – listen, analyze, relate, act – which is a form of closed-loop performance management. It begins by extracting data from multiple sources of text-based data, (listening), analyzing the content of the data (analyze), linking this data with other sources of customer data, and producing alerts, workflows and reports to encourage action to be taken based on the insights (act).
Familiar stuff. Text processing, outputs, and payoffs for the licensees.
Attensity, founded in 2000, that’s 13 years ago, is no spring chicken. I learned from the write up:
Attensity has also made some technical improvements to the product. The architecture now supports multitenancy and automatic load balancing, which are especially useful in handling very large volumes of tweets. Reporting has been enhanced to include more visualization options, trend analysis, emerging hot issues, and process and performance analysis.
My thought is that many firms which flourished with the once generous assistance of the US government now have to find a way to generate top line revenue, sustainable growth, and profits.
In the present financial environment, text processing companies are flocking to specific problem areas in organizations. Customer support (a bit of an oxymoron in my opinion), eDiscovery, and business intelligence (not as amusing as military intelligence in my opinion) now are well served sectors.
The companies looking for software and systems to make sense of data, cut costs, gain a competitive advantage, or some other benefit much favored by MBAs have not found a magic carpet ride.
The noise from vendors is increasing. The time required to find and close a deal is increasing. Some customers are looking high and low for a solution which is “good enough”. Management turnover, frequent repositionings, and familiar marketing lingo by themselves may not be enough to keep the many firms competing in these “hot niches” afloat.
Stephen E Arnold, April 29, 2013
New Tool Integrates with Text Analytics
March 21, 2013
Language and analytics are starting a new trend by coming together. According to the Destination CRM.com article “New SDL Machine Translation Tool Integrates with Text Analytics” SDL has announced that its machine translation tool can now be integrated to work with text analytics solutions. SDL BeGlobal can translate both structured and unstructured information across more than 80 different language combinations. The information is then analyzed using text analytics solutions. This gives users the ability to access global customer insights as well as important business trends. Jean-Francois Damais, Deputy Managing Director of loyalty global clients solutions at Ispos had the following to say regarding SDL BeGlobal.
“With the growth in global business and the accessibility of online information, we now have a much greater need to access and analyze data from multiple languages. As a company focused on innovation and dedicated to our clients’ successes, we deployed SDL BeGlobal machine translation to further improve our research insights and bring new value to our customers.”
SDL BeGlobal has already caught on with several companies in the text analytics industry and several well known companies have jumped on the bandwagon. Raytheon BBN Technologies currently uses the technology for broadcast and Web content monitoring and Expert Systems uses it for semantic intelligence. Language and analytics are two things that are not normally thought of together but seems like SDL BeGlobal has a good thing going. Only time will tell if the new friendship between language and analytics will last the test of time.
April Holmes, March 21, 2012
Sponsored by ArnoldIT.com, developer of Augmentext
Semantria Adds Value to Unstructured Data With Sentiment Analysis
March 19, 2013
We are constantly on the lookout for movers and shakers in the area of text analysis and sentiment analysis. So, I was intrigued when I came across Semantria’s Web site recently, a company claiming text and sentiment analysis is made fast and easy with their software. With claims to simplify costs and high-value capturing, I had to research further.
The company was founded in 2011 as a software-as-a-service and services company, specializing in cloud-based text and sentiment analysis.The team boasts a foundation from text analytics provider Lexalytics, software development Postindustria, and demand generation consultancy DemandGen.
The company page shares about how its software can give insight into unstructured content:
“Semantria’s API helps organizations to extract meaning from large amounts of unstructured text. The value of the content can only be accessed if you see the trends and sentiments that are hidden within. Add sophisticated text analytics and sentiment analysis to your application: turn your unstructured content into actionable data.”
Semantria API is powered by the Lexalytics Salience 5 analytics engine and is fully REST compliant. A processing demo is available at at https://semantria.com/demo. We think it is well worth a look.
Andrea Hayden, March 19, 2013
Sponsored by ArnoldIT.com, developer of Beyond Search