When a Search Vendor Says Fuzzy

August 9, 2014

Short honk: You may have heard a search or content processing vendor use the word “fuzzy” or “fuzzify” to describe a smart system. If you have, you may want to know what may be behind the jargon curtain. For fuzzy insights (intentional word choice, gentle reader) check out “Binary Fuzzing Strategies: What Works, What Doesn’t.” If you are not sure what “works” means, feel free to contact the saucisson at the IDC-type consulting firms. Illumination is only a payment away. If you choose another route, get your math T shirt on and check out https://code.google.com/p/american-fuzzy-lop/wiki/StatusScreen.

Stephen E Arnold, August 9, 2014

Written by Stephen E. Arnold · Filed Under algorithms, News | Comments Off on When a Search Vendor Says Fuzzy

The IHS Invention Machine: US 8,666,730

July 31, 2014

I am not an attorney. I consider this a positive. I am not a PhD with credentials as impressive Vladimir Igorevich Arnold, my distant relative. He worked with Andrey Kolmogorov, who was able to hike in some bare essentials AND do math at the same time. Kolmogorov and Arnold—both interesting, if idiosyncratic, guys. Hiking in the wilderness with some students, anyone?

Now to the matter at hand. Last night I sat down with a copy of US 8,666,730 B2 (hereinafter I will use this shortcut for the patent, 730), filed in an early form in 2009, long before Information Handing Service wrote a check to the owners of The Invention Machine.

The title of the system and method is “Question Answering System and Method Based on Semantic Labeling of Text Documents and User Questions.” You can get your very own copy at www.uspto.gov. (Be sure to check out the search tips; otherwise, you might get a migraine dealing with the search system. I heard that technology was provided by a Canadian vendor, which seems oddly appropriate if true. The US government moves in elegant, sophisticated ways.

Well, 730 contains some interesting information. If you want to ferret out more details, I suggest you track down a friendly patent attorney and work through the 23 page document word by word.

My analysis is that of a curious old person residing in rural Kentucky. My advisors are the old fellows who hang out at the local bistro, Chez Mine Drainage. You will want to keep this in mind as I comment on this James Todhunter (Framingham, Mass), Igor Sovpel (Minsk, Belarus), and Dzianis Pastanohau (Minsk, Belarus). Mr. Todhunter is described as “a seasoned innovator and inventor.” He was the Executive Vice President and Chief Technology Officer for Invention Machine. See http://bit.ly/1o8fmiJ, Linked In at (if you are lucky) http://linkd.in/1ACEhR0, and this YouTube video at http://bit.ly/1k94RMy. Igor Sovpel, co inventor of 730, has racked up some interesting inventions. See http://bit.ly/1qrTvkL. Mr. Pastanohau was on the 730 team and he also helped invent US 8,583,422 B2, “System and Method for Automatic Semantic Labeling of Natural Language Texts.”

The question answering invention is explained this way:

A question-answering system for searching exact answers in text documents provided in the electronic or digital form to questions formulated by user in the natural language is based on automatic semantic labeling of text documents and user questions. The system performs semantic labeling with the help of markers in terms of basic knowledge types, their components and attributes, in terms of question types from the predefined classifier for target words, and in terms of components of possible answers. A matching procedure makes use of mentioned types of semantic labels to determine exact answers to questions and present them to the user in the form of fragments of sentences or a newly synthesized phrase in the natural language. Users can independently add new types of questions to the system classifier and develop required linguistic patterns for the system linguistic knowledge base.

The idea, as I understand it, is that I can craft a question without worrying about special operators like AND or field labels like CC=. Presumably I can submit this type of question to a search system based on 730 and its related inventions like the automatic indexing in 422.

The references cited for this 2009 or earlier invention are impressive. I recognized Mr. Todhunter’s name, that of a person from Carnegie Mellon, and one of the wizards behind the tagging system in use at SAS, the statistics outfit loved by graduate students everywhere. There were also a number of references to Dr. Liz Liddy, Syracuse University. I associated her with the mid to late 1990s system marketed then as DR LINK (Document Retrieval Linguistic Knowledge). I have never been comfortable with the notion of “knowledge” because it seems to require that subject matter experts and other specialists update, edit, and perform various processes to keep the “knowledge” from degrading into a ball of statistical fuzz. When someone complains that a search system using Bayesian methods returns off point results, I look for the humans who are supposed to perform “training,” updates, remapping, and other synonyms for “fixing up the dictionaries.” You may have other experiences which I assume are positive and have garnered you rapid promotion for your search system competence. For me, maintaining knowledge bases usually leads to lots of hard work, unanticipated expenses, and the customary termination of a scapegoat responsible for the search system.

I am never sure how to interpret extensive listings of prior art. Since I am not qualified to figure out if a citation is germane, I will leave it to you to wade through the full page of US patent, foreign patent documents, and other publications. Who wants to question the work of the primary examiner and the Faegre Baker Daniels “attorney, agent, or firm” tackling 730.

On to the claims. The patent lists 28 claims. Many of them refer to operations within the world of what the inventors call expanded Subject-Action-Object or eSAO. The idea is that the system figures out parts of speech, looks up stuff in various knowledge bases and automatically generated indexes, and presents the answer to the user’s question. The lingo of the patent is sufficiently broad to allow the system to accommodate an automated query in a way that reminded me of Ramanathan Guha’s massive semantic system. I cover some of Dr. Guha’s work in my now out of print monograph, Google Version 2.0, published by one of the specialist publishers that perform Schubmehl-like maneuvers.

My first pass through the 730’s claims was a sense of déjà vu, which is obviously not correct. The invention has been award the status of a “patent”; therefore, the invention is novel. Nevertheless, these concepts pecked away at me with the repetitiveness of the woodpecker outside my window this morning:

Automatic semantic labeling which I interpreted as automatic indexing
Natural language process, which I understand suggests the user takes the time to write a question that is neither too broad nor too narrow. Like the children’s story, the query is “just right.”
Assembly of bits and chunks of indexed documents into an answer. For me the idea is that the system does not generate a list of hits that are probably germane to the query. The Holy Grail of search is delivering to the often lazy, busy, or clueless user an answer. Google does this for mobile users by looking at a particular user’s behavior and the clusters to which the user belongs in the eyes of Google math, and just displaying the location of the pizza joint or the fact that a parking garage at the airport has an empty space.
The system figures out parts of speech, various relationships, and who-does-what-to-whom. Parts of speech tagging has been around for a while and it works as long as the text processed in not in the argot of a specialist group plotting some activity in a favela in Rio.
The system performs the “e” function. I interpreted the “e” to mean a variant of synonym expansion. DR LINK, for example, was able in 1998 to process the phrase white house and display content relevant to presidential activities. I don’t recall how this expansion from bound phrase to presidential to Clinton. I do recall that DR LINK had what might be characterized as a healthy appetite for computing resources to perform its expansions during indexing and during query processing. This stuff is symmetrical. What happens to source content has to happen during query processing in some way.
Relevance ranking takes place. Various methods are in use by search and content processing vendors. Some of based on statistical methods. Others are based on numerical recipes that the developer knows can be computed within the limits of the computer systems available today. No N=NP, please. This is search.
There are linguistic patterns. When I read about linguistic patterns I recall the wild and crazy linguistic methods of Delphes, for example. Linguistics are in demand today and specialist vendors like Bitext in Madrid, Spain, are in demand. English, Chinese, and Russian are widely used languages. But darned useful information is available in other languages. Many of these are kept fresh via neologisms and slang. I often asked my intelligence community audiences, “What does teddy bear mean?” The answer is NOT a child’s toy. The clue is the price tag suggested on sites like eBay auctions.

The interesting angle in 730 is the causal relationship. When applied to processes in the knowledge bases, I can see how a group of patents can be searched for a process. The result list could display ways to accomplish a task. NOTting out patents for which a royalty is required leaves the searcher with systems and methods that can be used, ideally without any hassles from attorneys or licensing agents.

Several questions popped into my mind as I reviewed the claims. Let me highlight three of these:

First, computational load when large numbers of new documents and changed content has to be processed. The indexes have to be updated. For small domains of content like 50,000 technical reports created by an engineering company, I think the system will zip along like a 2014 Volkswagen Golf.

Source: US8666730, Figure 1

When terabytes of content arrived every minute, then the functions set forth in the block diagram for 730 have to be appropriately resourced. (For me, “appropriately resourced” means lots of bandwidth, storage, and computational horsepower.)

Second, the knowledge base, as I thought about when I first read the patent, has to be kept in tip top shape. For scientific, technical, and medical content, this is a more manageable task. However, when processing intercepts in slang filled Pashto, there is a bit more work required. In general, high volumes of non technical lingo become a bottleneck. The bottleneck can be resolved, but none of the solutions are likely to make a budget conscious senior manager enjoy his lunch. In fact, the problem of processing large flows of textual content is acute. Short cuts are put in place and few of those in the know understand the impact of trimming on the results of a query. Don’t ask. Don’t tell. Good advice when digging into certain types of content processing systems.

Third, the reference to databases begs this question, “What is the amount of storage required to reduce index latency to less than 10 seconds for new and changed content?” Another question, “What is the gap that exists for a user asking a mission critical question between new and changed content and the indexes against which the mission critical query is passed?” This is not system response time, which as I recall for DR LINK era systems was measured in minutes. The user sends a query to the system. The new or changed information is not yet in the index. The user makes a decision (big or small, significant or insignificant) based on incomplete, incorrect, or stale information. No big problem is one is researching a competitor’s new product. Big problem when trying to figure out what missile capability exists now in an region of conflict.

My interest is enterprise search. IHS, a professional publishing company that is in the business of licensing access to its for fee data, seems to be moving into the enterprise search market. (See http://bit.ly/1o4FyL3.) My researchers (an unreliable bunch of goslings) and I will be monitoring the success of IHS. Questions of interest to me include:

What is the fully loaded first year cost of the IHS enterprise search solution? For on premises installations? For cloud based deployment? For content acquisition? For optimization? For training?
How will the IHS system handle flows of real time content into its content processing system? What is the load time for 100 terabytes of text content with an average document size of 50 Kb? What happens to attachments, images, engineering drawings, and videos embedded in the stream as native files or as links to external servers?
What is the response time for a user’s query? How does the user modify a query in a manner so that result sets are brought more in line with what the user thought he was requesting?
How do answers make use of visual outputs which are becoming increasingly popular in search systems from Palantir, Recorded Future, and similar providers?
How easy is it to scale content processing and index refreshing to keep pace with the doubling of content every six to eight weeks that is becoming increasingly commonplace for industrial strength enterprise search systems? How much reengineering is required for log scale jumps in content flows and user queries?

Take a look at 730 an d others in the Invention Machine (IHS) patent family. My hunch is that if IHS is looking for a big bucks return from enterprise search sales, IHS may find that its narrow margins will be subjected to increased stress. Enterprise search has never been nor is now a license to print money. When a search system does pump out hundreds of millions in revenue, it seems that some folks are skeptical. Autonomy and Fast Search & Transfer are companies with some useful lessons for those who want a digital Klondike.

Written by Stephen E. Arnold · Filed Under algorithms, News, Search | 2 Comments

Microsoft and Its Don Quixotesque Deep Learning

July 15, 2014

I know that artificial intelligence, predictive analytics, digital brains, and the other next generation technology will revolutionize search and then everything. I will be pushing up daisies when the future arrives.

I just read “Microsoft Research Reveals Project Adam, A New Deep-Learning System That Trumps Google Brain.” Maybe the future is here. Even though I have a Windows Lumia 925, I don’t have the Windows 8.1 update. I can’t control the colors of the nifty yet obscure icons. I can’t reliably access my email. Yep, the future of Project Adam awaits me.

Wired Magazine, on the other hand, is excited:

Microsoft even brought dogs on to the stage and demoed the system in which the mobile camera recognized the dog breed when pointed at the dog.

I need to do this with my mobile phone.

Here’s another passage I liked:

Lee believes Adam could be part of what he calls an “ultimate machine intelligence,” something that could function in ways that are closer to how we humans handle different types of modalities—like speech, vision, and text—all at once. The road to that kind of technology is long—people have been working towards it since the 50s—but we’re certainly getting closer.

Yep, closer. I am looking forward to news releases and demonstrations from Microsoft’s many competitors. For me, I will be happy with a 925 upgrade that improves the rocket science of getting an email message.

Stephen E Arnold, July 15, 2014

Written by Stephen E. Arnold · Filed Under algorithms, Microsoft, News | Comments Off on Microsoft and Its Don Quixotesque Deep Learning

Striking a Balance Between Human and Machine

June 19, 2013

Math is on the march, as computer algorithms are poised to take over everything from driving cars to granting parole to inmates. Aeon takes a thoughtful look at the phenomenon in, “Slaves to the Algorithm.” The article asks whether the trajectory will leave a place for human judgment. Writer Steven Pool ponders:

“What lies behind our current rush to automate everything we can imagine? Perhaps it is an idea that has leaked out into the general culture from cognitive science and psychology over the past half-century — that our brains are imperfect computers. If so, surely replacing them with actual computers can have nothing but benefits.”

I recommend checking out the article. It is full of developments that have already taken place (some startling), as well as strong indicators of where we are headed with our increasingly algorithm-managed world. The central issue is the degree to which we will allow the math to take over. Example questions include: How would, and should, an automated driver react to an out-of-control school bus full of kids? Or to what extent do automatic content filters hamper free speech? Will news aggregators continue to narrow our personal windows onto the world? Issues that stand in the way of transparency in these decisions are also examined.

There is hope. Unsurprisingly, money is the motive for the first evidence Pool sees of a check to the algorithm’s power grab. Now that high-frequency trading algorithms have shown (more than once) that they have the power to bring the market to its knees, regulators are beginning to impose checks on the use of these tools. The article observes:

“Here, then, are the first ‘algorithmic auditors’. Perhaps their example will prompt similar developments in other fields — culture, education, and crime — that are considerably more difficult to quantify, even when there is no immediate cash peril.”

Perhaps, but I suspect it will take more to spur such “auditors” outside of finance. Meanwhile, now is the time to have a discussion about what we want our future to look like, and how much control we humans wish to retain.

Cynthia Murrell, June 19, 2013

Sponsored by ArnoldIT.com, developer of Augmentext

Written by Stephen E. Arnold · Filed Under algorithms, News, Technology | Comments Off on Striking a Balance Between Human and Machine

« Previous Page

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.