Autumn Approaches: Time for Realism about Search

September 1, 2014

Last week I had a conversation with a publisher who has a keen interest in software that “knows” what content means. Armed with that knowledge, a system can then answer questions.

The conversation was interesting. I mentioned my presentations for law enforcement and intelligence professionals about the limitations of modern and computationally expensive systems.

Several points crystallized in my mind. One of these is addressed, in part, in a diagram created by a person interested in machine learning methods. Here’s the diagram created by SciKit:

image

The diagram is designed to help a developer select from different methods of performing estimation operations. The author states:

Often the hardest part of solving a machine learning problem can be finding the right estimator for the job. Different estimators are better suited for different types of data and different problems. The flowchart below is designed to give users a bit of a rough guide on how to approach problems with regard to which estimators to try on your data.

First, notice that there is a selection process for choosing a particular numerical recipe. Now who determines which recipe is the right one? The answer is the coding chef. A human exercises judgment about a particular sequence of operation that will be used to fuel machine learning. Is that sequence of actions the best one, the expedient one, or the one that seems to work for the test data? The answer to these questions determines a key threshold for the resulting “learning system.” Stated another way, “Does the person licensing the system know if the numerical recipe is the most appropriate for the licensee’s data?” Nah. Does a mid tier consulting firm like Gartner, IDC, or Forrester dig into this plumbing? Nah. Does it matter? Oh, yeah. As I point out in my lectures, the “accuracy” of a system’s output depends on this type of plumbing decision. Unlike a backed up drain, flaws in smart systems may never be discerned. For certain operational decisions, financial shortfalls or the loss of an operation team in a war theater can be attributed to one of many variables. As decision makers chase the Silver Bullet of smart, thinking software, who really questions the output in a slick graphic? In my experience, darned few people. That includes cheerleaders for smart software, azure chip consultants, and former middle school teachers looking for a job as a search consultant.

Second, notice the reference to a “rough guide.” The real guide is understanding of how specific numerical recipes work on a set of data that allegedly represents what the system will process when operational. Furthermore, there are plenty of mathematical methods available. The problem is that some of the more interesting procedures lead to increased computational cost. In a worst case, the more interesting procedures cannot be computed on available resources. Some developers know about N=NP and Big O. Others know to use the same nine or ten mathematical procedures taught in computer science classes. After all, why worry about math based on mereology if the machine resources cannot handle the computations within time and budget parameters? This means that most modern systems are based on a set of procedures that are computationally affordable, familiar, and convenient. Does this similar of procedures matter? Yep. The generally squirrely outputs from many very popular systems are perceived as completely reliable. Unfortunately, the systems are performing within a narrow range of statistical confidence. Stated in a more harsh way, the outputs are just not particularly helpful.

In my conversation with the publisher, I asked several questions:

  1. Is there a smart system like Watson that you would rely upon to treat your teenaged daughter’s cancer? Or, would you prefer the human specialist at the Mayo Clinic or comparable institution?
  2. Is there a smart system that you want directing your only son in an operational mission in a conflict in a city under ISIS control? Or, would you prefer the human-guided decision near the theater about the mission?
  3. Is there a smart system you want managing your retirement funds in today’s uncertain economy? Or, would you prefer the recommendations of a certified financial planner relying on a variety of inputs, including analyses from specialists in whom your analyst has confidence?

When I asked these questions, the publisher looked uncomfortable. The reason is that the massive hyperbole and marketing craziness about fancy new systems creates what I call the Star Trek phenomenon. People watch Captain Kirk talking to devices, transporting himself from danger, and traveling between far flung galaxies. Because a mobile phone performs some of the functions of the fictional communicator, it sure seems as if many other flashy sci-fi services should be available.

Well, this Star Trek phenomenon does help direct some research. But in terms of products that can be used in high risk environments, the sci-fi remains a fiction.

Believing and expecting are different from working with products that are limited by computational resources, expertise, and informed understanding of key factors.

Humans, particularly those who need money to pay the mortgage, ignore reality. The objective is to close a deal. When it comes to information retrieval and content processing, today’s systems are marginally better than those available five or ten years ago. In some cases, today’s systems are less useful.

Read more

I2E Semantic Enrichment Unveiled by Linguamatics

July 21, 2014

The article titled Text Analytics Company Linguamatics Boosts Enterprise Search with Semantic Enrichment on MarketWatch discusses the launch of 12E Semantic Enrichment from Linguamatics. The new release allows for the mining of a variety of texts, from scientific literature to patents to social media. It promises faster, more relevant search for users. The article states,

“Enterprise search engines consume this enriched metadata to provide a faster, more effective search for users. I2E uses natural language processing (NLP) technology to find concepts in the right context, combined with a range of other strategies including application of ontologies, taxonomies, thesauri, rule-based pattern matching and disambiguation based on context. This allows enterprise search engines to gain a better understanding of documents in order to provide a richer search experience and increase findability, which enables users to spend less time on search.”

Whether they are spinning semantics for search, or if it is search spun for semantics, Linguamatics has made their technology available to tens of thousands of users of enterprise search. Representative John M. Brimacombe was straightforward in his comments about the disappointment surrounding enterprise search, but optimistic about 12E. It is currently being used by many top organizations, as well as the Food and Drug Administration.

Chelsea Kerwin, July 21, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

Information Manipulation: Accountability Pipe Dream

July 5, 2014

I read an article with what I think is the original title: “What does the Facebook Experiment Teach us? Growing Anxiety About Data Manipulation.” I noted that the title presented on Techmeme was “We Need to Hold All Companies Accountable, Not Just Facebook, for How They Manipulate People.” In my view, this mismatch of titles is a great illustration of information manipulation. I doubt that the writer of the improved headline is aware of the irony.

The ubiquity of information manipulation is far broader than Facebook twirling the dials of its often breathless users. Navigate to Google and run this query:

cloud word processing

Note anything interesting in the results list displayed for me on my desktop computer:

image

The number one ad is for Google. In the first page of results, Google’s cloud word processing system is listed three more times. I did not spot Microsoft Office in the cloud except in item eight: Is Google Docs Making Microsoft Word Redundant.

For most Google search users, the results are objective. No distortion evident.

Here’s what Yandex displays for the same query:

image

No Google word processing and no Microsoft word processing whether in the cloud or elsewhere.

When it comes to searching for information, the notion that a Web indexing outfit is displaying objective results is silly. The Web indexing companies are in the forefront of distorting information and manipulating users.

Flash back to the first year of the Bush administration when Richard Cheney was vice president. I was in a meeting where the request was considered to make sure that the vice president’s office Web site would appear in FirstGov.gov hits in a prominent position. This, gentle reader, is a request that calls for hit boosting. The idea is to write a script or configure the indexing plumbing to make darned sure a specific url or series of documents appears when and where they are required. No problem, of course. We created a stored query for the Fast Search & Transfer search system and delivered what the vice president wanted.

This type of results manipulation is more common than most people accept. Fiddling Web search, like shaping the flow of content on a particular semantic vector, is trivial. Search engine optimization is a fools’ game compared with the tried and true methods of weighting or just buying real estate on a search results page, a Web site from a “real” company.

The notion that disinformation, reformation, and misinformation will be identifiable, rectified, and used to hold companies accountable is not just impossible. The notion itself reveals how little awareness of the actual methods of digital content injection work.

How much of the content on Facebook, Twitter, and other widely used social networks is generated by intelligence professionals, public relations “professionals,” and folks who want to be perceived as intellectual luminaries? Whatever your answer, what data do you have to back up your number? At a recent intelligence conference in Dubai, one specialist estimated that half of the traffic on social networks is shaped or generated by law enforcement and intelligence entities. Do you believe that? Probably not. So good for you.

Amusing, but as someone once told me, “Ignorance is bliss.” So, hello, happy idealists. The job is identifying, interpreting, and filtering. Tough, time consuming work. Most of the experts prefer to follow the path of least resistance and express shock that Facebook would toy with its users. Be outraged. Call for action. Invent an algorithm to detect information manipulation. Let me know how that works out when you look for a restaurant and it is not findable from your mobile device.

Stephen E Arnold, July 5, 2014

Elasticsearch: Bulldozing Content Processing

June 7, 2014

When I left the intelligence conference in Prague, there were a number of companies in my graphic about open source search. When I got off the airplane, I edited my slide. Looks to me as if Elasticsearch has just bulldozed the search and content sector, commercialized open source group. I would not want to be the CEO of LucidWorks, Ikanow, or any other open sourcey search and content processing company this weekend.

I read “Elasticsearch Scores $70 Million to Help Sites Crunch Tons of Data Fast.” Forget the fact that Elasticsearch is built on Lucene and some home grown code. Ignore the grammar in “data fast.” Skip over the sports analogy “scores.” Dismiss the somewhat narrow definition of what Elasticsearch ELK can really deliver.

What’s important is the $70 million committed to Elasticsearch. Added to the $30 or $40 million the outfit had obtained before, we are looking at a $100 million bet on an open source search based business. Compare this to the trifling $40 million the proprietary vendor Coveo had gathered or the $30 million put on LucidWorks to get into the derby.

I have been pointing out that Elasticsearch has demonstrated that it had several advantages over its open source competitors; namely, developers, developers, and developers.

Now I want to point out that it has another angle of attack: money, money, and money.

With the silliness of the search and content processing vendors’ marketing over the last two years, I think we have the emergence of a centralizing company.

No, it’s not HP’s new cloudy Autonomy. No, it’s not the wonky Watson game and recipe code from IBM. No, it’s not the Google Search Appliance, although I do love the little yellow boxes.

I will be telling those who attend my lectures to go with Elasticsearch. That’s where the developers and the money are.

Stephen E Arnold, June 7, 2014

Watson: The Most Gifted Digital Chef Using Butternut Squash

May 30, 2014

Does silicon have taste buds? Do algorithms sniff the essence of Kentucky barbecue?

I read a darned amazing article called “I Tasted BBQ Sauce Made By IBM’s Watson, And Loved It.” The write up reports that IBM and partner Co.Design used the open source, home grown code, and massive database to whip up a recipe for grilling. IBM is going whole hog with the billion dollar baby Watson, which is supposed to be one of IBM’s revenue fountains any day now.

According the write up, which may or may not have the ingredients of a “real” news story:

Most BBQ sauces start with ingredients like vinegar, tomatoes, or even water, but IBM’s stands out from the get go. Ingredient one: White wine. Ingredient two: Butternut squash. The list contains more Eastern influences, such as rice vinegar, dates, cilantro, tamarind (a sour fruit you may know best from Pad Thai), cardamom (a floral seed integral to South Asian cuisine) and turmeric (the yellow powder that stained the skull-laden sets of True Detective) alongside American BBQ sauce mainstays molasses, garlic, and mustard.

And most important for the grillin’ fans in Harrod’s Creek, the author used the Watson concoction of tofu. I am not sure that the folks in Harrod’s Creek know what tofu is. I do know that the idea of creating a barbecue sauce without bourbon in it is a culinary faux pas. Splash tamarind on a couple of dead squirrels parked above the coals, and the friends of Daniel Boone may skin the offender and think about grillin’ something larger than a squirrel.

The author who is scoring the tofu and broccoli treat reports:

I test it again and again. Finally I just slather my plate in the stuff. It’s delicious–the best way I can describe it is as a Thai mustard sauce, or maybe the middle point between a BBQ sauce and a curry. Does that sound gross? I assure you that it isn’t…But as I mop my plate of the last drips of Bengali Butternut BBQ Sauce, contemplating the difference between a future in which computers addict us to the next Lean Cuisine and one where they attempt to eradicate us with Terminators, Napoleon’s old adage comes to mind: An army marches on its stomach. He–or that–who controls our stomachs controls it all.

Yes. From game show win to a tofu topping, IBM Watson is redefining search, corporate strategy, and the vocabulary of cuisine for tofu and broccoli lovers. Kentucky frshly killed and skinned grilled squirrel may not benefit.

Anyone who suggests that vendors of information retrieval technology have lost their keen marketing edge, you are not in touch with butternut squash and reality. Should the digital chefs Put Kentucky bourbon in Bengali Butternut BBQ Sauce? Myron Mixon, the winningest man in barbecue, may say, “That’s what I am talkin’ for my whole hog.” Couild IBM sponsor the barbecue cook off program? Mr. Mixon may be a lover of tamarind and tofu too.

Stephen E Arnold, May 30, 2014

Watson on the Move: Cognea

May 20, 2014

I wanted to associate Cognos with Cognea. Two different things. IBM’s Watson unit, according to “IBM Watson Acquires Artificial Intelligence Startup Cognea,” is beefing up its artificial intelligence capabilities. Facebook, Google, and other outfits are embracing the dreams of artificial intelligence like it is 1981 when Marvin Weinberger was giving talks about AI’s revolutionizing information processing. I have lost track of Marvin, although I recall his impassioned polemics, 30 years after hearing him lecture. Unfortunately I remain skeptical about “artificial intelligence” because Watson, as I understood the pitch after Jeopardy, was already super smart. I suppose Cognea can add some marketing credibility to Watson. That system is curing disease and performing wonders for the insurance industry, if I embrace the IBM public relations’ flow.

In my lectures about the Big O problem, I point out that many of today’s smartest systems (for example, Search2, to name one) implements clever methods to make well known numerical recipes run like a teenager who just gulped three cans of Jolt Cola followed by a Red Bull energy drink.

The reality is that there are more sophisticated mathematical tools available. The problem is that the systems available cannot exploit these algorithmic methods. I am pretty confident that Cognea tells a great story. I am even more confident that IBM will do the “Vivisimo” thing with whatever technology Cognea actually has. Without a concrete demo, benchmarks, and independent evaluations, I will remain skeptical about “a cognitive computing and conversational artificial intelligence platform.”

I am far more interested in the Cybertap technology that IBM acquired and seems to  be keeping under wraps. Cybertap works. Artificial intelligence, well, it depends on how one defines “artificial” and “intelligence” doesn’t it?

Stephen E Arnold, May 20, 2014

Trifles in Enterprise Search History

May 6, 2014

Search conferences are, in my experience, context free. The history of enterprise search is interesting and contains useful examples pertaining to findability. Stephen E Arnold’s new video is “Trifles from Enterprise Search History.” The eight minute video reviews developments from the late 1970s and early 1980s. These mini snapshots provide information about where some of the hottest concepts today originated. Do you think MarkLogic invented an XML data management system that could do search and analytics? The correct answer may be Titan Search. What about “inventing” an open source search business model. Do you think Lucid Imagination, now Lucid Works, cooked up the concept of challenging proprietary systems with community created software? The correct answer may be Fulcrum Technologies’ early concoction of home brew code with the WAIS server. What about the invention of jargon that permeates discussions of content processing. A good example is a “parametric cube”. Is this the conjuring of Spotfire and Palantir? Verity is, in Mr. Arnold’s view, the undisputed leader in this type of lingo in its attempts to sell search without using the word “search.” Grab some SkinnyPop and check out Trifles.

Kenneth Toth, May 6, 2014

SAS Text Miner Gets An Upgrade

May 5, 2014

SAS is a well-recognized player in IT game as a purveyor of data, security, and analytics software. In modern terms they are a big player in big data and in order to beef up their offerings we caught word that SAS had updated its Text Miner. SAS Text Miner is advertised as a way for users to not only harness information in legacy data, but also in Web sites, databases, and other text sources. The process can be used to discover new ideas and improve decision-making.

SAS Text Miner a variety of benefits that make it different from the standard open source download. Not only do users receive the license and tech support, but Text Miner offers the ability to process and analyze knowledge in minutes, an interactive user interface, and predictive and data mining modeling techniques. The GUI is what will draw in developers:

“Interactive GUIs make it easy to identify relevance, modify algorithms, document assignments and group materials into meaningful aggregations. So you can guide machine-learning results with human insights. Extend text mining efforts beyond basic start-and-stop lists using custom entities and term trend discovery to refine automatically generated rules.”

Being able to modify proprietary software is a deal breaker these days. With multiple options for text mining software, being able to make it unique is what will sell it.

Whitney Grace, May 05, 2014
Sponsored by ArnoldIT.com, developer of Augmentext

Meme Attention Deficit

April 27, 2014

I read “Algorithm Distinguishes Memes from Ordinary Information.” The article reports that algorithms can pick out memes. A “meme”, according to Google, is “an element of a culture or system of behavior that may be considered to be passed from one individual to another by nongenetic means, especially imitation.” The passage that caught my attention is:

Having found the most important memes, Kuhn and co studied how they have evolved in the last hundred years or so. They say most seem to rise and fall in popularity very quickly. “As new scienti?c paradigms emerge, the old ones seem to quickly lose their appeal, and only a few memes manage to top the rankings over extended periods of time,” they say.

The factoid that reminded me how far smart software has yet to travel is:

To test whether these phrases are indeed interesting topics in physics, Kuhn and co asked a number of experts to pick out those that were interesting. The only ones they did not choose were: 12. Rashba, 14. ‘strange nonchaotic’ and 15. ‘in NbSe3′. Kuhn and co also checked Wikipedia, finding that about 40 per cent of these words and phrases have their own corresponding entries. Together this provides compelling evidence that the new method is indeed finding interesting and important ideas.

Systems produce outputs that are not yet spot on. I concluded that scientists, like marketers, like whizzy new phrases and ideas. Jargon, it seems, is an important part of specialist life.

Stephen E Arnold, April 27, 2014

Small Analytics Firms Reaping the Benefit of Investment Cycle

April 23, 2014

Small time analytics isn’t really as startup-y as people may think anymore. These companies are in high demand and are pulling in some serious cash. We discovered just how much and how serious from a recent Cambridge Science Park article, “Cambridge Text Analytics Linguamatics Hits $10m in Sales.”

According to the story:

Linguamatics’ sales showed strong growth and exceeded ten million dollars in 2013, it was announced today – outperforming the company’s targeted growth and expected sales figures.  The increased sales came from a boost in new customers and increased software licenses to existing customers in the pharmaceutical and healthcare sectors. This included 130 per cent growth in healthcare sales plus increased sales in professional services.

This earning potential has clearly grabbed the attention of investors. This, is feeding a cycle of growth, which is why the Linguamaticses of the world can rake in impressive numbers. Just the other day, for example, Tech Circle reported on a microscopic Mumbai big data company that landed $3m in investments. They say it takes money to make money and right now, the world of big data analytics has that cycle down pat. It won’t last forever, but it’s fun to watch as it does.

Patrick Roland, April 23, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta