LucidWorks: Mom, Do My Three Cs Add Up to an A?

February 19, 2020

Search firm Lucidworks has put out a white paper explaining their new 3 C’s of enterprise search, we learn from the write-up, “Understanding Intention: Using Content, Context, and the Crowd to Build Better Search Applications” from InsideBigData. Registration is required to download and read the paper, but they have also put out a PDF called more simply, “Understanding Intention” that gives us their perspective.

In the 3 Cs section of that document, they note that enterprise search pretty much has content wrapped up. With tools like Hadoop, Solr, and NoSQL, we can now access unstructured as well as structured data. Context means, in part, understanding how different pieces of content relate to each other. It also means analyzing which pieces of information will be relevant to each searcher—and this is the exciting part for Lucidworks. The document explains:

“When a search app knows more about you, it can create a relevant search experience that helps you get personal, actionable search results on a consistent basis. Search apps have solved that problem with signal processing. A signal is any bit of information that tells the app more about who you are. Signals can include your job title, business unit, location, device, and search history, as well as past actions within the search app like clickstream, purchasing behavior, direct reports, upcoming meetings or events, and more.”

Interesting. As for the crowd portion, it has to do with matching searchers with content found by similar entities that have searched before. We’re told:

“When a search app uses the crowd, it goes beyond documents and data, past your specific user profile and relationship, and examines how other users are interacting with the data and information. A search app knows the behavioral information of thousands — sometimes millions — of other users. By keeping track of every user, search apps can bubble up what you will find important and relevant and what other users like you will want, too. The tech uses its knowledge of your office, role, and demographic to match to the same in other users and make intelligent judgments about what will help you the most.”

But how good is the tech, really, at identifying what information one truly needs, and how would we know? Do three Cs add up to an A in search? Not yet, Willy.

Cynthia Murrell, February 19, 2020s

A Fanciful Explanation of the Expensive Failure of IBM Watson

February 19, 2020

I love the idea of revisionist history. I associate the method with an individual named Ioseb Besarionis dze Jughashvili.

Alleged Stalin quote: It is not heroes that make history, but history that makes heroes.

You may know this allegedly competent leader as Joseph Stalin. Changing history is one way to make sure the present comes out in a way that is more satisfying — at least to some people.

I read “IBM Watson And The Value Of Open.” I thought of Jughashvili in the terms my former history professor (Dr. Philip Miller Crane) explained the revisionist thing.

My analysis of IBM Watson included information I obtained when I was researching my various and sundry books about search and retrieval. I did not include IBM as a “recommended” solution for three reasons:

  1. Watson was a marketing confection which conflated a range of technologies: Some developed by IBM and others obtained via an open source download or by paying money for technology; for example, Vivisimo, a metasearch and clustering system
  2. Training “Watson” required programmers to interview subject matter experts, create specific content domains, test, do more interviews, retrain, and test. Once the content domain was in hand, Watson would crunch away to locate an answer. Many companies do a similar expensive process. IBM was unique in making Watson seem something other than what other vendors offered. By sweeping the time and cost of training under the digital rug, Watson was cut loose from reality.
  3. Question answering systems work when certain conditions are met; for example, content, response expected, handcrafted rules that mostly work. Toss the system questions based on new content, and the responses are going to be interesting if not off base a certain percentage of the time.

To sum up, the cost and unreliability of Watson were wildly out of step with the marketing of cognitive computing. IBM’s billions made it possible for search and retrieval carpetbaggers to describe their systems as “cognitive”; that is, infused with artificial intelligence, predictive linguistics, and my favorite bit of jargon natural language processing.

The article’s explanation of the failure of IBM’s billion dollar bet, the office near NYU, and the absolutely bonkers ad in the New York Times for Watson as a collection of digital molecules is at odds with my assessment.

That’s okay. Let’s look at a couple of the “revelations” in this Forbes’ article.

The Texas Fold

The write up explains the outright failure of Watson as a useful medical tool for cancer doctors says:

But with the passage of more time, it must be said that IBM Watson has not delivered the results that IBM expected. One particular moment was the decision of MD Anderson’s Cancer Center to withdraw from its partnership with IBM in 2017. An internal audit by the University of Texas found that the university had spent over $62 million dollars (not counting internal staff time) and did not meet its goals.[i]  Other health partners soon followed.

Yep, to summarize. Watson did not work. In fact, I heard from a reliable source that cancer doctors in New York City refused to answer endless programmer questions. The message for me was, “Cancer doctors don’t want to teach programmers how to be cancer doctors.” Hasta la vista to Texas.

The Wrong Explanation: Vertical Integration

Why did IBM Watson succumb to its self generated cancer. Here’s what the Forbes’ write up asserts:

Being vertically integrated gave IBM complete end-to-end control over Watson. But it condemned Watson to being applied in only a few areas. IBM essentially had to guess where this powerful technology could best be applied. Even within health care, some likely areas for Watson like radiology were not pursued in its early years. Because of the limited number of areas IBM was able to explore for using Watson, we will never know whether there were other areas where Watson might have performed beautifully.

Okay, this means in my opinion that IBM engineers and scientists wanted to run the show. There was, therefore, one throat to choke. That throat was IBM Watson’s. The fall out continues. A new CEO, hoots of laughter when I tell people about IBM’s Watson ads, and the loss of shareholder value. I would roll in the weird layoffs as a somewhat desperate way to slash costs too.

Alleged Stalin quote: Death is the solution to all problems. No man – no problem.

Forget vertical integration. The reason for failure was that the system and method did not work.

The Reality

Mr. Jughashvili would be proud of this analysis. It rewrites history. But like Mr. Jughashvili’s, Watson’s actions live on. Changing the words does not alter the reality.

Don’t believe me? Just ask IBM Watson. Is IBM “open”?

Stephen E Arnold, February 19, 2020

NoSQL DBMS: A Surprising Inclusion

February 12, 2020

Top Databases Used in Machine Learning Project” is a listicle. The information in the write up is similar to the lists of “best” products whipped up by Silicon Valley type publications, mid tier consulting firms (a shade off the blue chip outfits like McKinsey, Booz, and BCG), and 20 somethings fresh from university.

The interesting inclusion in the list of DBMS is?

If you said, Elasticsearch you would be correct. Elasticsearch is an open source play doing business as Elastic. The open source version is at its core a search and retrieval system. (Does this mean the index is the data and the database?)

DarkCyber is not going to get into a discussion of whether an enterprise search system can be a database management system. Both sides in the battle are less interested in resolving the fuzzy language than making sales.

Maybe Elasticsearch is just doing what other enterprise search systems have done since the 1980s? Vendors describe search and retrieval as the solution to the world’s data management Wu Flu.

Net net: Without boundaries, why make distinctions? Just close the deal. Distinctions are irrelevant for some business tasks.

Stephen E Arnold, February 12, 2020

Founder of Autonomy: Extradition Action

February 5, 2020

DarkCyber noted this CBR Online story: “Mike Lynch Submits Himself for Arrest.” The write up states:

Former Autonomy CEO Dr Mike Lynch has submitted himself for arrest this morning, in what his legal team described as a formality required as part of an extradition process initiated by the US Department of Justice.  Lynch is still contesting extradition.

The story about the founder of Autonomy and DarkTrace continues. A free profile about Autonomy is available at this link. (Note: this document is a rough draft prepared for a client before the Hewlett Packard purchase of the company. Also, Autonomy was a client of mine before I retired in 2013.)

Stephen E Arnold, February 5, 2020

Paris Museums: More Art Online. Search Means Old Fashioned Hunting Around

February 5, 2020

Oh, boy—it is a collection of art from the many Paris Museums available online at Paris Musées Collections. This artist’s daughter is delighted!

Unfortunately, the site’s search functionality disappoints. Unless your goal is either to find a specific work or to aimlessly browse the 150,213 public domain images, it is another almost unusable collection. I suppose trusting to serendipity has its place, but most of us are looking for something a bit more specific, even if we don’t have a particular title or artist in mind.

There is a section titled “Thematic Discovering,” which might be useful to some. They have put together 11 preconfigured themes that span museums, like “Sport, Jeux Olympiqes et Paris” (Sports, Olympic Games, and Paris) or “Elements: Air, Terre, Feu, Eau” (Elements: Air, Earth, Fire, Water). They do make for interesting guided tours. There are also a highlighted Virtual Exhibition and a few suggested works at the bottom of the page.

I was excited to find this resource—it really is a valuable collection to have at our fingertips. If only it were easier to navigate. Check it out if you feel persistent.

And for those who think search is really great. None of the visual art collections feature a search which delivers what most users seek.

Cynthia Murrell, February 5, 2020

Search Vendor Comparison

January 30, 2020

The Finland-based AddSearch published a comparison of its search and retrieval service with two competitors: Algolia and Swiftype. Each of these is a for-fee solution. The write up appeared in 2019, but DarkCyber wanted to call attention to the article because it does a good job of outlining some of the main characteristics of commercial search solutions. You can locate the article by Anna Pogrebniak in the AddSearch Blog.

Kenny Toth, January 30, 2020

Verizon Tries Its Hand at… Search

January 29, 2020

Verizon is hopping on the privacy bandwagon but how do we know we can trust it? ArsTechnica reports, “Verizon Offers No-Tracking Search Engine, Promises to Protect Your Privacy.” The company’s new platform is called OneSearch. It gets its results from Bing, but imposes privacy features on them. While still running contextual advertising, of course, but ones that rely on keywords, not tracking. This search service is in addition to, not replacing, the existing Yahoo search engine. (Verizon bought Yahoo in 2017.) Yahoo search also gets its results from Bing, but makes no promises about privacy.

Here are the promises we get from OneSearch: no cookie tracking, retargeting, or personal profiling; no sharing of personal data with advertisers; and no storing search histories. When users click on the “Advanced Privacy Mode” button, the platform encrypts their search terms and URL and sets the results link to expire in an hour. However, writer Jon Brodkin reports:

“The Verizon search engine homepage says, ‘OneSearch doesn’t use cookies. Period.’ Chrome detected that OneSearch did set one cookie on my computer, so that statement seems to be exaggerated. The EFF’s Privacy Badger detected a potential tracker that’s tied to the u.yimg.com domain, indicating a connection between OneSearch and Yahoo’s image service. What Verizon apparently means is that it doesn’t use cookies to build ad-targeting profiles. Verizon uses your IP address to determine your ‘general location,’ helping it deliver location-specific search results. Verizon said that ‘We only ever infer location data up to the city level of specificity for search localization purposes.’”

The write-up also lists in more detail the steps OneSearch performs for each query. We are also reminded of less-than-stellar performance of Verizon’s media division thus far. See the article for those details.

So, can we trust Verizon to deliver on its privacy vows? Brodkin notes several events that may give us pause: In 2016, the company paid a $1.35 million fine and agreed to change their ways over “supercookies;” it, along with T-Mobile, Sprint, and AT&T, was caught selling mobile customers’ location data to third-party brokers; and Verizon regularly opposes government regulations that would require carriers to protect customer privacy. The write-up suggests DuckDuckGo and Startpage as alternatives for anyone hesitant to take Verizon at its word.

Cynthia Murrell, January 29, 2020

Calling Out Search: Too Little, Too Late

January 20, 2020

The write up’s title is going to be censored in DarkCyber. We are not shrinking violets, but we think that stop word lists do exist. Problem? Buzz your favorite ad supported search vendor and voice your complaints.

The write is “How Is Search So #%&! Bad? A ‘Case Study’.” The author appears to be frustrated with the outputs of ad supported and probably other types of seemingly “free” search systems providing links to Web content. This is what some people call “open source intelligence online”. There are other information resources available, but most of the consumer oriented, eyeball hungry vendors ignore i2p, forums with minimal traffic, what some experts call the Dark Web, and even some government information services. How many people pay any attention to the US National Archives? Be honest in your assessment.

Here’s a passage we noted:

Google Search is ridiculously, utterly bad.

This seems clear.

The write up provides some examples, but I anticipate that some other people have found that the connection between a user’s query and the Google search outputs is tenuous at best. One criticism DarkCyber has of the write up is that it mentions Google, shifts to Reddit, and then to metadata. The key point for us was the focus on time.

Now time is an interesting issue in indexing. Years ago I did a research project on the “meaning” of “real time” in online services. I think my research team identified five or six different types of time. I will skip the nuances we identified and focus only on the data or freshness of an item in a results list.

Let’s by sympathetic to the indexing company. Here’s why:

First, many documents do not provide an explicit date in the text of the article. In Beyond Search and DarkCyber, you will notice that we provide the author’s name and a day and data at which the article was posted. Many write ups on the open Web don’t bother. In fact, there will be no easy way to date the time the author posted the story within the content displayed in a browser. Don’t you love news releases which do not include a date, time, and time zone?

Second, many write ups include dates and times in the text of an article. For example, the reference to Day 2 of the recent CES trade show may include the explicit date January 8, 2020, for a product announcement. The approach is similar to using CES without spelling out “Consumer Electronics Show.” Buy, hey, these folks are busy, and everyone in the know understands the what and when, right?

Third, auto-assigned dates by operating systems may be “correct” when a file or content object is created. But what happens when a file or drive is restored? The original dates and metadata may be replaced with the time stamp of the restore. What about date last accessed or date last changed? Too much detail. Yada yada.

Fourth, time sorting is possible. Google invested in Recorded Future (now part of Insight). I had heard that someone at the GOOG thought Recorded Future’s time functions were nifty. Guess not. Google did not implement more sophisticated time functions in any service other than those related to advertising. For the great unwashed masses of those who don’t work at Google, tough luck I supposed.

Fifth, when was the content first indexed. More significantly, when was the content last updated. Important? May be, gentle reader. May be.

There are several other conditions as well. For the purposes of a blog post, I want to make clear: The person who is annoyed with search should have been annoyed decades ago. These time problems are not new, and they are persistent.

The author with a penchant for tardy profanity stated:

Part of the issue in this specific case is that they’ve started ignoring settings for displaying results from specific time periods. It’s definitely not the whole issue though, and not something new or specific to phone searches. Now, I’ve always been biased towards the new – books, tech, everything, but I can’t help but feel that a lot of things which were done pretty well before are done worse today. We do have better technology, yet we somehow build inferior solutions with it all too often. Further, if they had the same bias of showing me only recent results I’ll understand it better, but that’s not even the case. And yes, I get that the incentives of users and providers don’t align perfectly, that Google isn’t your friend, etc. But what is DDG’s excuse? As for the Case Study part, and me saying this isn’t simply a rant – I lied, hence the quotation marks in the title. Don’t trust everything you read, especially the goddamn dates on your search results.

The write up omits a few other minor problems with modern search and retrieval systems. Yep, this includes Reddit, LinkedIn, and a bunch of others. Let me provide a few dot points:

  • Poorly implemented Boolean search
  • Zero information about what’s in an index
  • Zero information about what’s excluded from and index and why
  • Minimal auto linking to information about an “author” or the “source” of the content
  • No data to make a precision or recall calculation possible and reproducible
  • No data to make it possible to determine overlap among Web indexes. Analyses must be brute forced. Due to the volatility, latency, and editorial vagaries of ad supported Web search systems, data are mostly suggestive.

Why? Why are none of these dot points operative?

Answer: Too expensive, too hard, not appropriate for our customers, and “What are you talking about? We never heard of half these issues you identified.”

Net net: Years ago I wrote an article for Searcher Magazine, edited at the time by Barbara Quint, a bit of an expert in online information retrieval. She worked at RAND for a number of years as an information expert. She said, “Do you really want me to use the title ‘Search Sucks’ on your article.” I told her, use whatever title you want. But if you agree with me, go with “sucks.”  She used “sucks”. Let’s see that was a couple of decades ago.

Did anyone care? Nope. Does anyone care today? Nope. There you go.

Stephen E Arnold, January 20, 2020

DuckDuckGo Lands for European Search Users

January 14, 2020

I read “DuckDuckGo Beats Microsoft Bing In Google’s New Android Search Engine Ballot.” There have been numerous reports about this decision.

Digital Information World is a representative write up in today’s world of Google EU analysis. DarkCyber noted:

The introduction of this “choice screen” seems to be a clear response to the antitrust ruling from the European Union during last March and how Google was fined $5 billion by EU regulators. According to them, Google was playing illegally in tying up the search engine to its browser for mobile OS.

Okay. But how does a search engine get listed? We learned:

you can expect Google to not show search engines which are popular but the ones whose providers are willing to pay well.

The write up includes a run down of what search options will be displayed in each EU country. The ones we spotted are:

  • DuckDuckGo
  • GMX
  • Info.com
  • Privacy Wall
  • Qwant
  • Yandex.

Bing is a no show as are Giburu, iSeek, Mojeek, Yippy, and others. It is worth noting that some of these outfits are metasearch engines. This means that the systems send queries to Bing, Google, and other services and aggregate the results. Dogpile and Vivisimo were metasearch engines. DuckDuckGo and Ixquick (StartPage) are metasearch engines`.  The reason metasearch is available boils down to cost. It is very expensive to index the public Web.

The DarkCyber team formulated a few hypotheses about the auction, the limitations on default search engines, and the dominance of Google search in Europe; for example, Google accounts for more than 95 percent of the search traffic in Denmark. The same situation exists in Germany and other EU countries.

Will these choices make any difference? Sure, for small outfits like DuckDuckGo any increase in traffic is good news. But will the choices alter Google’s lock on search queries from Europe?

Not a chance.

Does anyone in the EU government know? Probably not. Do these people care? Not to much.

Remember one of my Laws of Information: Online generates natural monopolies. Here’s another Law: User behavior is almost impossible to change once mental memory locks in.

So Google gets paid and keeps on trucking.

Stephen E Arnold, January 14, 2020

Enterprise Search and the AI Autumn

January 13, 2020

DarkCyber noted this BBC write up: “Researchers: Are We on the Cusp of an AI Winter?” Our interpretation of the Beeb story can be summarized this way:

“Yikes. Maybe this stuff doesn’t work very  well?”

The Beeb explains in Queen’s English based on quotes of experts:

Gary Marcus, an AI researcher at New York University, said: “By the end of the decade there was a growing realization that current techniques can only carry us so far.”

He [Gary Marcus and AI wizard at NYU] thinks the industry needs some “real innovation” to go further. “There is a general feeling of plateau,” said Verena Rieser, a professor in conversational AI at Edinburgh’s Herriot Watt University. One AI researcher who wishes to remain anonymous said we’re entering a period where we are especially skeptical about AGI.

Well, maybe.

But the enterprise search cheerleaders have not gotten the memo. The current crop of “tap your existing unstructured information” companies assert that artificial intelligence infuses their often decades old systems with zip.

The story is being believed by venture outfits. The search for the next big thing is leading to making sense of unstructured text. After all, the world is awash in unstructured text. Companies have to solve this problem or red ink and extinction are just around the corner.

Net net: AI is a collection of tools, some useful, some not too useful. Enterprise search vendors are looking for a way to make sales to executives who don’t know or don’t care about past failures to index unstructured text on a company wide basis with a single system.

Stephen E Arnold, January 13, 2020

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta