Text Process Made Simple

September 17, 2013

Nothing involving text sees simple: lines of words that go on for miles, often without proper punctuation or any at all. It needs to be cataloged and organized and tagged, but no one really wants to do that task. That is why “TextBlob: Simplified Text Processing” was born. What exactly is TextBlob? Here is the description straight from TextBlob’s homepage:

“TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, translation, and more.”

TextBlob is available for free download and has its own GitHub following. When it comes to installing the library, be aware that it relies on NLTK and pattern.en. Many of the features include: part-of-speech tagging, JSON serialization, word and phrase frequencies, n-grams, word inflection, tokenization, language translation and detection, noun phrase extraction, and sentiment analysis.

After downloading TextBlob, the Web site offers a comprehensive quick start guide for its users to understand how to implement and make the best usage out of the library. Free libraries make the open source community go around and improve ease of use for all users. If you use TextBlob, be sure to share any of your own libraries.

Whitney Grace, September 17, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

SharePoint Search: An Open Source Widget

September 15, 2013

If you have SharePoint responsibilities, you know how fabulous Microsoft’s Swiss Army knife solution is. Let me explain. The “fabulousness” applies to consultants, integrators, and “experts” who can make the rusty blade cut better than it does once the system is installed.

I learned about “SharePoint 2013 Search Query Tool” from one of the ArnoldIT SharePoint experts. You can download tool to test out and debug search queries against the SharePoint 2013 REST API. The tool does not help improve either the system or the user queries, but I find this software interesting for three reasons:

After years of Microsoft innovation, there are still issues with getting relevant results. Ergo the open source tool.

SharePoint does not provide a native administrative function to perform this type of testing.

Open source may be edging toward SharePoint. If the baby steps mature, will an open source snap in to replace the wild and crazy Fast Search & Transfer technology pop into being?

Stephen E Arnold, one of the world’s leading experts in information retrieval said:

Fast Search is on a technical par with SharePoint. The idea that two flawed systems can cope with changing user needs, Big Data, and unexpected system interactions is making SharePoint software which boosts costs. Change may be forced on Microsoft and without warning.

Worth thinking about and checking out the free widget.

Stuart Schram

Anonymizing Writing Style

September 11, 2013

Author J.K. Rowling recently learned firsthand how sophisticated analytics software has become. It was a linguistic analysis of the text in The Cuckoo’s Calling‘s which unmasked her as the popular crime-novel’s author “Robert Galbraith.” (These tools were originally devised to combat plagiarism.) Now, I Programmer tells us in “Anonymouth Hides Identity,” open-source software is being crafted to foil such tools, and give writers “stylometric anonymity.”

Whether a wordsmith just wants to enjoy a long-lost sense of anonymity, as the wildly successful author of the Harry Potter series attempted to do, or has more high-stakes reasons to hide behind a pen name, a team from Drexel University has the answer. The students from the school’s Privacy, Security, and Automation Lab (PSAL) just captured the Andreas Pfitzmann Best Student Paper Award at this year’s Privacy Enhancing Technologies Symposium for their paper on the subject. The article reveals:

The idea behind Anonymouth is that sylometry can be a threat in situations where individuals want to ensure their privacy while continuing to interact with others over the Internet. A presentation about the program cites two hypothetical scenarios:

*Alice the Anonymous Blogger vs.Bob the Abusive Employer

*Anonymous Forum vs. Oppressive Government. . . .

The JStylo-Anonymouth (JSAN) framework is work in progress at PSAL under the supervision of assistant professor of computer science, Dr. Rachel Greenstadt. It consists of two parts:

*JStylo – authorship attribution framework, used as the underlying feature extraction employing a set of linguistic features

*Anonymouth – authorship evasion (anonymization) framework, which suggests changes that need to be made.

The admittedly very small study discussed in the paper found that 80 percent of participants were able to produce anonymous documents “to a limited extent.” It also found certain constraints– it was more difficult to anonymize existing documents than new creations, for example. Still, this is an interesting development, and I am sure we will see more efforts in this direction.

Cynthia Murrell, September 11, 2013

Sponsored by ArnoldIT.com, developer of Augmentext

Text Analytics and Semantic Processing Fuel New Web Paradigm

August 27, 2013

Often, we look more specifically at various apps and applications that address search needs. Sometimes, it is refreshing to find articles that take a step back and look at the overall paradigm shifts guiding the feature updates and new technology releases flooding the media. Forbes reports on the big picture in “NetAppVoice: How The Semantic Web Changes Everything. Again!

Evolving out of the last big buzz word, big data, semantic Web is now ubiquitous. Starting at the beginning, the article explains what semantic search allows people to do. A user can search for terms that retrieve results that go beyond keywords–through metadata and other semantic technologies associations between related concepts are created.

According to the article hyperconnectivity is the goal for promised meaningful insights to be delivered through semantic search:

For example, if we could somehow acquire all of the world’s knowledge, it wouldn’t make us smarter. It would just make us more knowledgeable. That’s exactly how search worked before semantics came along. In order for us to become smarter, we somehow need to understand the meaning of information. To do that we need to be able to forge connections in all this data, to see how each piece of knowledge relates to every other. In the semantic Web, we users provide the connections, through our social media activity. The patterns that emerge, the sentiment in the interactions—comments, shares, tweets, Likes, etc.—allow a very precise, detailed picture to emerge.

Enterprise organizations are in a unique position to achieve this hyperconnectivity and they also have a growing list of technological solutions to help break down silos and promote safe and secure data access to appropriate users. For example, text analytics and semantic processing for Cogito Intelligence API enhances the ability to decipher meaning and insights from a multitude of content sources including social media and unstructured corporate data.

Megan Feil, August 27, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

Oracle Focuses On New Full Text Query

August 26, 2013

Despite enterprise companies moving away from SQL databases to the more robust NoSQL, Oracle has updated its database to include new features, including a XQuery Full Text search. We found an article that examines how the new function will affect Oracle and where it seems to point. The article from Amis Technology Blog: “Oracle Database 12c: XQuery Full Text” explains that the XQuery Full Text search was made to handle unstructured XML content. It does so by extending the XQuery XMLDB language. This finally makes Oracle capable of working with all types of XML. The rest of the article focuses on the XQuery code.

When the new feature was used on Wikipedia Content with XML content as well the test results were positive:

“During tests it proved very fast on English Wikipedia content (10++ Gb) and delivered the results within less than a second. But such a statement will only be picked up very efficiently if the new, introduced in 12c, corresponding Oracle XQuery Full-Text Index has been created.”

Oracle is trying to improve its technology as more of its users switch over to NoSQL databases. Improving the search function as well as other features keeps Oracle in the competition as well as proves that relational tables still have some kick in them. Interestingly enough Oracle appears to be focusing its energies on MarkLogic’s technology to keep in the race.

Whitney Grace, August 26, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

How Forensic Linguistics Helped Unmask Rowling

August 23, 2013

By now most have heard that J.K. Rowling, famous for her astoundingly successful Harry Potter books, has been revealed as the author of the well-received crime novel “The Cuckoo’s Calling.” Time spoke to one of the analysts who discovered that author Robert Galbraith was actually Rowling, and shares what they learned in, “J.K. Rowling’s Secret: a Forensic Linguist Explains how He Figured it Out.”

It started with a tip. Richard Brooks, editor of the British “Sunday Times,” received a mysterious tweet claiming that “Robert Galbraith” was a pen name for Rowling. Before taking the claim to the book’s publisher, Brooks called on Patrick Juola of Duquesne University to linguistically compare “The Cuckoo’s Calling” with the Potter books. Joula has had years of experience with forensic linguistics, specifically authorship attribution. Journalist Lily Rothman writes:

“The science is more frequently applied in legal cases, such as with wills of questionable origin, but it works with literature too. (Another school of forensic linguistics puts an emphasis on impressions and style, but Juola says he’s always worried that people using that approach will just find whatever they’re looking for.)

“But couldn’t an author trying to disguise herself just use different words? It’s not so easy, Juola explains. Word length, for example, is something the author might think to change — sure, some people are more prone to ‘utilize sesquipedalian lexical items,’ he jokes, but that can change with their audiences. What the author won’t think to change are the short words, the articles and prepositions. Juola asked me where a fork goes relative to a plate; I answered ‘on the left’ and wouldn’t ever think to change that, but another person might say ‘to the left’ or ‘on the left side.'”

One tool Juola uses is the free Java Graphical Authorship Attribution Program. After taking out rare words, names, and plot points, the software calculates the hundred most-used words from an author under consideration. Though a correlation does not conclusively prove that two authors are the same person, it can certainly help make the case. “Sunday Times” reporters took their findings to Galbraith’s/ Rowling’s publisher, who confirmed the connection. Though Rowling has said that using the pen name was liberating, she (and her favorite charities) may be happy with the over 500,000 percent increase in “Cukoo’s Calling” sales since her identity was uncovered.

The article notes that, though folks have been statistically analyzing text since the 1800s, our turn to e-books may make for a sharp increase in such revelations. Before that development, the process was slow even with computers, since textual analysis had to be preceded by the manual entry of texts via keyboard. Now, though, importing an entire tome is a snap. Rowling may be just be the last famous author to enjoy the anonymity of a pen name, even for just a few months.

Cynthia Murrell, August 23, 2013

Sponsored by ArnoldIT.com, developer of Augmentext

Search and Null: Not Good News for Some

August 3, 2013

I read “How Can I Pass the String ‘Null’ through WSDL (SOAP)…” My hunch is that only a handful of folks will dig into this issue. Most senior managers buy the baloney generated by search and content processing. Yesterday I reviewed for one of the outfits publishing my “real” (for fee) columns a slide deck stuff full of “all’s” and “every’s”. The message was that this particular modern system which boasted a hefty price tag could do just about anything one wanted with flows of content.

Happily overlooked was the problem of a person with a wonky name. Case in point: “Null”. The link from Hacker News to the Stackoverflow item gathered a couple of hundred comments. You can find these here. If you are involved in one of the next-generation, super-wonderful content processing systems, you may find a few minutes with the comments interesting and possibly helpful.

My scan of the comments plus the code in the “How Can I” post underscored the disconnect between what people believe a system can do and what a here-and-now system can actually do. Marketers say one thing, buyers believe another, and the installed software does something completely different.

Examples:

  1. A person’s name—in this case ‘Null’—cannot be located in a search system. With all the hoo-hah about Fancy Dan systems, is this issue with a named entity important? I think it is because it means that certain entities may not be findable without expensive, time-consuming human curation and indexing. Oh, oh.
  2. Non English names pose additional problems. Migrating a name in one language into a string that a native speaker of a different language can understand introduces some problems. Instead of finding one person, the system finds multiple people. Looking for a batch of 50 people each incorrectly identified during processing generates a lot of names which guarantees more work for expensive humans or many, many false drops. Operate this type of entity extraction system a number of times and one generates so much work there is not enough money or people to figure out what’s what. Oh, oh.
  3. Validating named entities requires considerable work. Knowledgebases today are “built automatically and on-the-fly. Rules are no longer created by humans. Rules, like some of Google’s “janitor” technology, figure out the rules themselves and then “workers” modify those rules on-the-fly. So what happens when errors are introduced via “rules.” The system keeps on truckin’. Anyone who has worked through fixing up the known tags from an smart system like Autonomy IDOL knows that degradation can set in when the training set does not represent the actual content flow. Any wonder why precision and recall scores have not improved too much in the last 20 years? Oh, oh.

I think this item about “Null” highlights the very real and important problems with assumptions about automated content processing. Whether the corpus is a telephone directory with a handful of names or the mind-boggling flows which stream from various content channels.

Buying does not solve long-standing, complicated problems in text processing. Fast talk like that which appears in some of the Search Wizards Speak interviews does not change the false drop problem.

So what does this mean for vendors of Fancy Dan systems? Ignorance on the part of buyers is one reason why deals may close. What does this mean for users of systems which generate false drops and dependent reports which are off base? Ignorance on the part of users makes it easy to use “good enough” information to make important decisions.

Interesting, Null?

Stephen E Arnold, August 3, 2013

Sponsored by Xenky

Text Analytics Pros Daily

July 17, 2013

We have found a new resource: an aggregation tailored to data analysts, The Text Analytics Pros Daily, is now available at the content curation site Paper.li. The page pulls in analytics-related news arranged under topics like Business, Technology, Education, and the Environment. Our question: what are their criteria for “pros”?

For those unfamiliar with Paper.li, it is a platform that allows users to create their own aggregations by choosing their sources and customizing their page. Their description specifies:

“The key to a great newspaper is a great newsroom. The Paper.li platform gives you access to an ever-expanding universe of articles, blog posts, and rich media content. Paper.li automatically processes more than 250 million social media posts per day, extracting & analyzing over 25 million articles. Only Paper.li lets you tap into this powerful media flow to find exactly what you need, and publish it easily on your own online newspaper.”

As I peruse the page, I see many articles from a wide variety of sources; Marketwired, the Conference Board, Marketing Cloud, and assorted tech bloggers. There is a lot of information here— it is worth checking out for any content analytics pros (or hobbyists.)

Cynthia Murrell, July 17, 2013

Sponsored by ArnoldIT.com, developer of Augmentext

01Business and Search

July 4, 2013

Take a look at the article about Sinequa. Just run a query in the next few days at www.01net.com. The story presents some interesting information.

Stephen E Arnold, July 5, 2013

Stephen E Arnold, July 4, 2013

Sponsored by Xenky, the portal to ArnoldIT where you can find the world’s largest collection of first-person explanations of enterprise search

Watson Draws Attention

June 1, 2013

From a game show win to being inundated with “Watson pitches”, IBM is doing its best to make Watson more successful than Sherlock Holmes. I read “IBM Inundated with Watson Pitches as It Prepares to Offer Service to Developers.” The headline certainly suggests that for search and content processing, Watson is going like gangbusters.

At the last two search and content processing shows I attended, I heard nothing about Watson. I suppose that specialist conferences are not the place for IBM, which has larger designs on the market. However, there were some developers on the programs at these conferences, and I don’t recall hearing a direct reference to IBM. I think I mentioned Hewlett Packard once.

The write up seeks to set me straight on the powerful pull IBM Watson is exerting on the those involved in building search related applications:

IBM is receiving hundreds of ideas from developers wanting to use its Watson supercomputing technology, which will be made available to anyone wanting to build applications on top of its capabilities.

The information comes from John Gordon who is IBM’s vice president for Watson Solutions. I associate him with the phrase “data is the new oil,” but I have mixed up which “expert” drew the word picture in my mind. A biographical profile of Mr. Gordon is available on Yatedo. According to an item appearing on the University of Texas’ Web site here, he “is the director of Strategy and Product Management for IBM’s new Watson Solutions Division. He is responsible for developing end-to-end business models for transforming the innovations created by IBM Watson into a strategic set of industry solutions.” Another University of Texas Web page here pointed out

Prior to this [Watson] role Gordon held a number of executive strategy, market management, and business development positions within IBM. He joined the IT industry more than 17 years ago and has consistently helped global clients enhance their performance and results by leveraging innovative technology.  John holds undergraduate degrees in Philosophy and Computer Applications from the University of Notre Dame and has an M.B.A. from The University of Texas at Austin.  Additionally, John is a certified SOA architect with a foundational certification in the IT Infrastructure Library (ITIL) standards, and is a co-author of a Harvard Business School case on value-in-use solution pricing.

I think I know the magnitude of the developer stampede. Programmers are really into public relations and MBA analyses.

Stephen E Arnold, June 1, 2013

Sponsored by Xenky

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta