Enterprise Search: Still Crazy after All These Years

November 20, 2020

This is not old wine in new bottles. This is wine in those weird clay jars with the nifty moniker “amphora” filled with Oak Leaf Vineyards Sauvignon Blanc White Wine. Cough, cough.

CMS Wire gets it correct when it declares, “Scanning and Selecting Enterprise Search Results: Not as Easy as it Looks.” The article doesn’t even approach the formation of a query—finding the right wording then tweaking filters and facets to produce a manageable list. Here we are only looking at the next step. Though the task seems simple on its surface—scan a list of results and select the most relevant ones—writer Martin White explains why it is not so straightforward.

First is scanning results. Users’ perceptual speed differs, so for some folks (like those who are dyslexic, for example) the process can be so tedious as to make searching pointless. White tells us that inconvenient fact is often overlooked in the discussion of search functionality. Also under-considered is the issue of snippet length. A bit of research has been performed, but it involved web pages, which are themselves more easily scanned and assessed than content found in enterprise databases. Those documents are often several hundred pages long, so ranking algorithms often have trouble picking out a helpful snippet. Some platforms serve up a text sequence that contains the query term, others create computer-generated summaries of documents, and others reproduce the first few lines of each document. Each of these approaches is imperfect. Still others produce a thumbnail of a whole page that contains the search term, and that probably helps many users. However, there are accessibility problems with that method.

White concludes:

“We know from recent research that people may make different decisions from the information they perceive initially as relevant based on their expertise. Equally, most search metrics are based around the notional relevance of the results being presented in response to a query. If the true value of relevance cannot be well judged from the snippet, that calls any metrics associated with query performance (especially precision) into question.

“There are no easy solutions to the issues raised in this column. In the quest for achieving an acceptable user experience the points to consider are:

*Are the techniques used by the search application to create snippets appropriate to the types of content being searched?

*Can the format of snippets be customized by the user?

*How easy is it to scan and assess results from a federated search?

“In the final analysis, it doesn’t matter how sophisticated the search technology is (in terms of semantic analysis, etc.). What matters is if the user can make an informed judgment of which piece of content in the results serves their information requirement, reinforces their trust in the application and maintains the highest possible level of overall search satisfaction.”

Sigh. It seems the more developers work on enterprise search, the more complicated it is to effectively operate. The field has been at it for 50 years, and is still trying to deliver something useful. Still crazy after all these years too.

PS. Our esteemed check writer (Stephen E Arnold) wrote a book about enterprise search with the author of the source document. No wonder this essay seemed weirdly familiar. I had to proofread what turned out to be prose that made the Oak Leaf stuff welcome at the end of an editing day. Cough, cough, eeep. 

Cynthia Murrell, November 20, 2020

Survey Says Data Governance Is Important. But What Is Data Governance?

November 20, 2020

Here’s what the Google says governance means: The action or manner of governing. Okay, but what exactly is governing. Google says: Having authority to conduct the policy, actions, and affairs of a state, organization, or people.

Okay, now let’s add the magic word “data,” which is a plural, not a single thing. (That’s what datum means, right?)

Google says: Facts and statistics collected together for reference or analysis.

Let’s put the information together, shall we?

An organization uses authority to conduct policy, actions and affairs to deal with facts and statistics for reference or analysis.

Why care? The answer is found in “Businesses Positive about Data Governance but Still Struggle with Privacy Concerns.”

Okay, now we have linked dealing with information and privacy. This is getting interesting or is it? I go with the “not interesting,” but let’s plod forward in the write up.

A vendor of search and retrieval software sponsored a research project conducted by Standard & Poor 451 Research. Note: That report is titled “Pathfinder Report Market Intelligence: Information Driven Compliance and Insight. Two Sides of the Same Coin.” I am not sure about the “coin” metaphor, compliance, insight, and pathfinding. But no one ever accused me of understanding mid-tier consulting firms, sponsored research, and 18 year old vendors of proprietary search and retrieval software.

The 451 outfit tapped its pool of “survey responders” and discovered:

72 percent of enterprises believe data governance is an enabler of business value rather than a cost center.

Okay, that’s a lot of enterprises, assuming the sample was statistically valid, the questions not shaped, and the data analysis of the survey responses was performed on the up and up. But sponsored research is different from the often wonky academic research churned out by professors and work-from-home students. That’s better, right? 

I learned:

  • One in four organizations have more than 50 distinct data silos
  • 37 per cent of respondents say having relevant information automatically displayed, when the team needs it, would benefit them the most in the pursuit of automation.
  • Budget, privacy issues, and expertise are barriers. 

How does one deal with data silos, which I assume is “governance”? How does one deal with security? Privacy? How does an enterprise search company cope with the assorted sixes and sevens of data in an organization; for example, tweets, encrypted messages, images, geospatial data, videos, and information which must be kept isolated from the grubby “let’s federate information” crowd? (Why must some data be isolated? Find an attorney. Ask her what happens if information in a legal matter is out of her span of control.)

What’s the net net of the mid-tier consulting outfit’s report? Here it is:

Success requires alignment of business objectives by looking for common-denominator requirements across business units.

Let me be clear: Enterprise search is not the solution to problems with an “authority to conduct policy, actions and affairs to deal with facts and statistics for reference or analysis.”

Enterprise search is information retrieval, data governance no matter how much a marketer wishes it were. Enterprise search vendors have been struggling for relevance because Lucene/Solr are good enough and users want information to address right now business issues. Library style lists of stuff to read or look up may not ring the chimes of a thumb typing user.

Want the full report? Go here. Please, keep marketing and governance separate. Statistics 101 offered some useful guidelines. Some, however, did not pay attention. You will have to register. Marketing is still marketing.

Stephen E Arnold, November 20, 2020

Expert System Has Embraced the AI Revolution

November 19, 2020

It’s official. Expert System S.p. A. (Italy) is now Expert.ai. I know because the firm’s Web site displays this message:

image

Expert System has moved along a business path like one of those Amalfi coast cliff side roads: Breathtaking turns, chilling confrontations with other vehicles, and a lack of guard rails.

image

Repositioning a big rig is a thrill for sure.

The company’s tag line is:

It’s time to make all data actionable.

Yep, “all.” Even video, encrypted messages among employees, and confidential compensation data? Sure, “all.”

Plus, the firm has tweaked its description of its focus to assert:

Expert.ai is the premier artificial intelligence platform for language understanding. Its unique hybrid approach to NL combines symbolic human-like comprehension and machine learning to transform language-intensive processes into practical knowledge, providing the insight required to improve decision making throughout organizations.

Vendors of search and content processing widgets are responding to today’s business environment with marketing. Expert System was founded in 1989 in Modena, Italy.

Premier too.

Stephen E Arnold, November 19, 2020

Palantir Technologies: Once Secretive Company Explains What It Is Not

November 17, 2020

I enjoy once-secretive companies explaining what they are not. A good example of this type of re-formation is “Palantir Is Not a Data Company (Palantir Explained, #1).” The headline makes it clear to me that there will be additional “we are not” essays coming down the intelware pike. The first installment of what a stealth company communicated incorrectly it seems is:

Palantir is not a data company and not a data aggregator.

The write up wants to differentiate from a company like Datminr or Oracle BlueKai and similar firms. These outfits suck up information and then sell access to those data.

Palantir Technologies is not in that “data” business. The company processes the data its clients have, license, or to which the clients link in an appropriate manner.

The essay makes clear that Palantir is a “software company.” That’s true. Much of the software is open source or crafted to perform specific functions which customers pay Palantir to effectuate. (There are partners and integrators who perform other work for Palantir licensees. Most of these companies keep a low profile and do not advertise their Palantir goodness.

Several observations:

  1. Palantir is a hybrid outfit; that is, it combines open source software, custom code, and consulting to generate revenue
  2. Partners and integrators contribute expertise and software shims to allow a licensee obtain a desire output from the Palantir system
  3. Much of Palantir “runs” on cloud services; for example, Amazon Web Services.

Now that Palantir is a publicly traded company, the once stealthy firm which operated as a start up for more than a decade has to demonstrate that it is avoiding some of the public relations pitfalls for intelware and policeware vendors in the public eye.

How difficult is this task? Quite challenging in my opinion.

I am looking forward to the second installment of explaining Palantir.

Stephen E Arnold, November 17, 2020

Comments about Web Search: Prompted by a Hacker News Thread

November 13, 2020

I spotted a Web search related threat on Hacker News. You can locate the comments at this link. Several observations:

  1. Metasearch. Confusion seems to exist between a dedicated Web search system like Bing, Google, and Yandex and metasearch systems like DuckDuckGo and Startpage. Dedicated Web search systems require considerable effort, but there is less appreciation for the depth of the crawl, the index updating cycle, and similar factors.
  2. Competitors to Google. The comments present a list of search systems which are relatively well known. Omitted are some other services; for example, iSeek, Swisscows, and 50kft.
  3. Bias. The comments do not highlight some of the biases of Web search systems; for example, when are pages reindexed, what pages are on a slow or never update cycle, blacklisted, or processed against a stop word list.

So what?

  1. Many profess to be experts at finding information online. The comments suggest that perception is different from reality.
  2. Locating content on publicly accessible Web sites is more difficult than at any other time in my professional career in the online information sector.
  3. Locating relevant information is increasingly time consuming because predictive, personalized, and wisdom of crowd results don’t work; for example, run this query on any of the search engines:

Voyager search

Did your results point to the Voyager Labs’s system, the UK HR company’s search engine, a venture capital firm, or a Lucene repackager in Orange County? What about Voyager patents?  What about Voyager customers?

How can one disambiguate when the index scope is unknown, entity extraction is almost non existent, and deduplication almost laughable? Real time? Ho ho ho.

One can do this work manually. Who wants to volunteer for that. The most innovative specialized search vendors try to automate the process. Some of these systems are helpful; most are not.

Is search getting better? Rerun that Voyager search. See for yourself.

Without field codes, Boolean, and a mechanism to search across publicly accessible content domains, Web search reveals its shortcomings to those who care to look.

Not many look, including professionals at some of the better known Web search outfits.

Stephen E Arnold, November 13, 2020

Voyager Search Tapped for USDA Search and Discovery Project

November 4, 2020

Low-profile enterprise search company Voyager Search just made an important deal with a high-profile government agency. AIThority announces, “New Light Technologies and Voyager Search Team Win New Contracts with the U.S. Department of Agriculture to Implement Data Search and Discovery Solutions.” Voyager’s partner in the project, New Light Technologies (NLT), is a consulting firm working in the areas of cloud tech, cybersecurity, software development, data analytics, geospatial tech, and scientific R&D. The write-up reports:

“Access to accurate information is crucial to the department’s mission to support sustainable agriculture production and protection of natural resources. Both NLT and Voyager Search bring many years of experience developing award-winning federal data integration and dissemination platforms and will build federated data search solutions to index and link disparate cloud-based and on-prem data sources, including large repositories of imagery and geospatial data files that are used for a variety of analytical reporting and data dissemination systems, such as the Global Agricultural Information Network, Global Agricultural & Disaster Assessment System, Crop Explorer, and the Geospatial Data Gateway. Leveraging NLT and Voyager Search’s Professional Services Department and Vose technology which provides robust spatial search capabilities, the team’s solution will enable users to search for data, content, and documents by who, what, when, and where. Together, the team is providing the technology and services to advance a modern data architecture for the department that will support improved information flow, security, and analysis as well as power the Artificial Intelligence (AI) and Machine Learning (ML) of the future.”

“Voyager” is a popular name for a business, so do not confuse Voyager Search with other enterprises like digital innovation firm Voyager, manufacturer Voyager Industries, or even the Voyager Company that pioneered DC-ROM production back in the day. Vose is the name of Voyager Search’s platform that will be used for the USDA project, but the company also offers Server, essentially Vose for larger implementations, and ODN (Open Data Network), a searchable global-content catalog. Both products build on Vose’s “smart spatial search” technology. Based in Redlands, California, Voyager Search was founded in 2008.

Cynthia Murrell, November 4, 2020

Google Reveals Its Aspiration: Everything

October 30, 2020

An online publication called Gadgets360 published “Google Renames the Chromebook Search Button to the Everything Button.” The lowly capitalization lock key has been identified as expendable. By repurposing a way to create CAPS, Google has performed two vital services:

  1. Easier access to search
  2. A way to reveal its aspiration: To be “everything” to a human user.

The article states:

Google is renaming a button on Chrome OS PC keyboards to ‘Everything Button. … Google said that the new name for the Launcher button was chosen to reflect user feedback; the search giant hoped that the inclusion of the new name for the button will help highlight that Chromebook laptops have a dedicated button on their keyboards. Clicking on the Everything Button will open up a search bar through which you can search for things on Google, as well as for apps and files on the Chrome OS machine.

Interesting. What about confusion with the freeware application called Everything. David Carpenter at Voidtools.com has offered his useful information retrieval software for several years. Google is indeed innovative and proving that it is “everything” a me-too outfit would want to be.

Stephen E Arnold, October 30, 2020

Newspaper Search: Another Findability Challenge

October 13, 2020

Here is an interesting project any American-history enthusiast could get lost in for hours: Newspaper Navigator. I watched the home page’s 15-minute video, which gives both an explanation of the search tool’s development and a demo. Then I played around with the tool for a bit. Here’s what I learned.

Created by Ben Lee, the Library of Congress’ 2020 Innovator in Residence, The Newspaper Navigator is built on the Library of Congress’s Chronicling America, a search portal that allows one perform keyword searches on 16 million pages of historical US newspapers using optical character recognition. That is a great resource—but how to go about an image search for such a collection? That’s where Newspaper Navigator comes in.

Lee used thousands of annotations of the collection’s visual content, created by volunteers in the Library’s Beyond Words crowdsourcing initiative of 2017, to train a machine learning model to recognize visual content. (He released the dataset, which can be found here. He also created hundreds of prepackaged downloadable datasets organized by year and type, like maps, photos, cartoons, etcetera.) The Newspaper Navigator search interface allows users to plumb 1.5 million high-confidence, public-domain photos from newspapers published between 1900-1963. The app allows for standard search, but the juicy bit is the ability to search by visual similarity using machine learning.

Lee walks us through two demo searches—one that begins with the keyword “baseball” and with “sailboat.” One can filter by location and time frame, then hover over results to get more info on the image itself and the paper in which it appeared. Select images to build a Collection, then tap into the AI prowess via the “Train my AI Navigators” button. The AI uses the selected images to generate a page of similar images, each with a clickable + or – button. Clicking these tells the tool which images are more and which are less like what is desired. Click “Train my AI Navigators” again to generate a more refined page, and repeat until only (or almost only) the desired type of image appears. When that happens, clicking the Save button creates a URL to take one right back to those results later.

Lee notes that machine learning is not perfect, and some searches lend themselves to refinement better than others. He suggests starting again and retraining if results start refining themselves in the wrong direction.

The video acknowledges the potential marginalization issues in any machine learning project. Click on the Data Archaeology tab to read about Lee’s investigation of the Navigator dataset and app from the perspective of bias.

I suggest curious readers play around with the search app for themselves. Lee closes by inviting users to share their experiences through LC-Labs@loc.gov or on twitter @LC_Labs, #NewspaperNavigator.

Cynthia Murrell, October 13, 2020

Does Search Breed Fraud?

October 11, 2020

The question “Does search breed fraud?” is an interesting one. As far as I know, none of the big time MBA case studies address the topic. If any academic discipline knows about fraud, I believe it is those very same big time MBA programs.

South Korean Search Giant Fined US $23 Million for Manipulating Results” reveals that Naver has channeled outfits with a penchant for results fiddling. The write up states:

The Korea Fair Trade Commission, the country’s antitrust regulator, ruled Naver altered algorithms on multiple occasions between 2012 and 2015 to raise its own items’ rankings above those of competitors.

Naver responded, according to the write up, with this statement:

“The core value of search service is presenting an outcome that matches the intentions of users,” it said in a statement, adding: “Naver has been chosen by many users thanks to our focus on this essential task.”

The pressure to generate revenue is significant. Engineers, who may be managed loosely or steered by the precepts of high school science club thought processes, can make tiny changes with significant impact. As a result, the manipulation can arise from a desire to get promoted, be cool, or land a bonus.

The implications can be profound. Google may be less evil because fiddling is an emergent behavior.

Stephen E Arnold, October 11, 2020

An Oath from the Past: Yahoo Web Scale Semantic Search

October 9, 2020

I spotted a link to “Yahoo: Web Scale Semantic Search.” You remember Yahoo, don’t you. This is the outfit with the data breaches, the clueless business model, and the sale to the Baby Bell Verizon. The executives too are memorable: Marissa, Alex, Terry, and the Peanut Butter memo man.

The link displayed a presentation by Edgar Meij, a laborer in Yahoo Labs. The topic was an X ray view from Mt. Olympus intended to reveal Web scale semantic search.

The slide deck requires 62 clicks to traverse. There are many riches in the presentation. I want to highlight three of these, and invite you to make your own determination of these insights.

First, there is a “text” accompanying the deck. It contains a riot of jargon and buzzwords. In fact, I have saved the text, despite a portion being truncated, as a glossary of Web search jive talk; for example “s a sequence of terms s 2 s drawn from the set S, s ? Multinomial(?s) e a set of entities e 2 e.” (I knew you would experience the same thrill I did when I read this line.) True to Slideshare’s attention to detail, the text for slides 32 to 62 has been removed. Great loss indeed.

Second, Yahoo cares about knowledge. Consider this diagram:

image

The idea is that one acquires knowledge (I assume this means scraping and indexing Web site content), knowledge integration (creating a big index), and knowledge consumption (maybe finding something when a user or system sends a query to the search subsystem). The key point is “knowledge” is important. How about that? Yahoo search was focusing on knowledge? Is that why Yahoo floundered in search for many, many years before bowing to failure?

Third, Yahoo’s approach to semantic search requires humans. Here’s proof:

image

When Yahoo announced Vin Diesel was dead, he was alive. So much for smart software.

Why am I mentioning this blast from the past.

Knowledge was talked about in my interview/discussion with Dr. Stavros Macrakis. We tackled the difference between Web search and enterprise search. This Yahoo deck illustrates that talk about knowledge is one thing. Delivering useful results to a user is quite another.

Jargon in search and retrieval has made more progress than search technology itself. That’s why the Yahoo deck could have been crafted yesterday by one of the search vendors still chasing a huge market in the era of Lucene/Solr and “good enough” information access.

Stephen E Arnold, October 9, 2020

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta