Glean: Another Enterprise Search Solution

October 12, 2021

Enterprise search features are interesting, but users accept it as an unavoidable tech problems like unfindable content and sluggish indexing.. A former Google engineering director recognized the problem when he started his own startup and Forbes article, “Glean Emerges from Stealth With $55 Million To Bring Search To The Enterprise” tells the story.

Arvind Jain cofounded the cloud data management company Rubrik and always had problems locating information. Rubrik is now worth $3.7 million, but Jain left and formed the new startup Glean with Google veterans Piyush Prahladka, Tony Gentilcore, and T.R. Vishwanath. The team have developed a robust enterprise search engine application from multiple applications. Glean has raised $55 million in funding.

Other companies like Algolia and Elastic addressed the same enterprise search problem, but they focused on search boxes on consumer-facing Web sites instead of working for employees. With more enterprise systems shifting to the cloud and SaaS, Glean’s search product is an invaluable tool. Innovations with deep learning also make Glean’s search product more intuitive and customizable for each user:

“On the user side, Glean’s software analyzes the wording of a search query—for example, it understands that “quarterly goals” or “Q1 areas of focus” are asking the same thing—and shows all the results that correspond to it, whether they are located in Salesforce, Slack or another of the many applications that a company uses. The results are personalized based on the user’s job. Using deep learning, Glean can differentiate personas, such as a salesperson from an engineer, and tailor recommendations based on the colleagues that a user interacts with most frequently.”

Will Glean crack the enterprise search code? Interesting question to which the answer is not yet known.

Whitney Grace, October 12, 2021

SEO Relevance Destroyers and Semantic Search

August 18, 2021

Search Engine Journal describes to SEO professionals how the game has changed since early days, when it was all about keywords and backlinks, in “Semantic Search: What it Is & Why it Matters.” Writer Aleh Barysevich emphasizes:

“Now, you need to understand what those keywords mean, provide rich information that contextualizes those keywords, and firmly understand user intent. These things are vital for SEO in an age of semantic search, where machine learning and natural language processing are helping search engines understand context and consumers better. In this piece, you’ll learn what semantic search is, why it’s essential for SEO, and how to optimize your content for it.”

Semantic search strives to comprehend each searcher’s intent, a query’s context, and the relationships between words. The increased use of voice search adds another level of complexity. Barysevich traces Google’s semantic search evolution from 2012’s Knowledge Graph to 2019’s BERT. SEO advice follows, including tips like these: focus on topics instead of keywords, optimize site structure, and continue to offer authoritative backlinks. The write-up concludes:

“Understanding how Google understands intent in intelligent ways is essential to SEO. Semantic search should be top of mind when creating content. In conjunction, do not forget about how this works with Google E-A-T principles. Mediocre content offerings and old-school SEO tricks simply won’t cut it anymore, especially as search engines get better at understanding context, the relationships between concepts, and user intent. Content should be relevant and high-quality, but it should also zero in on searcher intent and be technically optimized for indexing and ranking. If you manage to strike that balance, then you’re on the right track.”

Or one could simply purchase Google ads. That’s where traffic really comes from, right?

Cynthia Murrell, August 17, 2021

Algolia and Its View of the History of Search: Everyone Has an Opinion

August 11, 2021

Search is similar to love, patriotism, and ethical behavior. Everyone has a different view of the nuances of meaning with a specific utterance. Agree? Let’s assume you cannot define one of these words in a way that satisfies a professor from a mid tier university teaching a class to 20 college sophomores who signed up for something to do with Western philosophy: Post Existentialism. Imagine your definition. I took such a class, and I truly did not care. I wrote down the craziness the brown clad PhD provided, got my A, and never gave that stuff a thought. And you, gentle reader, are you prepared to figure out what an icon in an ibabyrainbow chat stream “means.” We captured a stream for one of my lectures to law enforcement in which she says, “I love you.” Yeah, right.

Now we come to “Evolution of Search Engines Architecture – Algolia New Search Architecture Part 1.” The write up explains finding information, and its methods through the lens of Algolia, a publicly traded firm. Search, which is not defined, characterizes the level of discourse about findability. The write up explains an early method which permitted a user to query by key words. This worked like a champ as long as the person doing the search knew what words to use like “nuclear effects modeling”.

The big leap was faster computers and clever post-Verity methods of getting distributed index to mostly work. I want to mention that Exalead (which may have had an informing role to play in Algolia’s technical trajectory) was a benchmark system. But, alas, key words are not enough. The Endeca facets were needed. Because humans had to do the facet identification, the race was on to get smart software to do a “good enough” job so old school commercial database methods could be consigned to a small room in the back of a real search engine outfit.

Algolia includes a diagram of the post Alta Vista, post Google world. The next big leap was scaling the post Google world. What’s interesting is that in my experience, most search problems result in processing smaller collections of information containing disparate content types. What’s this mean? When were you able to use a free Web search system or an enterprise search system like Elastic or Yext to retrieve text, audio, video, engineering drawings and their associated parts data, metadata from surveilled employee E2EE messages, and TikTok video résumés or the wildly entertaining puff stuff on LinkedIn? The answer is and probably will be for the foreseeable future, “No.” And what about real time data, the content on a sales person’s laptop with the changed product features and customer specific pricing. Oh, right. Some people forget about that. Remember. I am talking about a “small” content set, not the wild and crazy Internet indexes. Where are those changed files on the Department of Energy Web site? Hmmm.

The fourth part of the “evolution” leaps to keeping cloud centric, third party hosted chugging along. Have you noticed the latency when using the OpenText cloud system? What about the display of thumbnails on YouTube? What about retrieving a document from a content management system before lunch, only to find that the system reports, “Document not found.” Yeah, but. Okay, yeah but nothing.

The final section of the write up struck me as a knee slapper. Algolia addresses the “current challenges of search.” Okay, and what are these from the Algolia point of view: The main points have to do with using a cloud system to keep the system up and running without trashing response time. That’s okay, but without a definition of search, the fixes like separating search and indexing may not be the architectural solution. One example is processing streams of heterogeneous data in real time. This is a big thing in some circles and highly specialized systems are needed to “make sense” of what’s rushing into a system. Now means now, not a latency centric method which has remain largely unchanged for – what? — maybe 50 years.

What is my view of “search”? (If you are a believer that today’s search systems work, stop reading.) Here you go:

  1. One must define search; for example, chemical structure search, code search, HTML content search, video search, and so on. Without a definition, explanations are without context and chock full of generalizations.
  2. Search works when the content domain is “small” and clearly defined. A one size fits all content is pretty much craziness, regardless of how much money an IPO’ed or SPAC’ed outfit generates.
  3. The characteristic of the search engines my team and I have tested over the last — what is it now, 40 or 45 years — is that whatever system one uses is “good enough.” The academic calculations mean zero when an employee cannot locate the specific item of information needed to deal with a business issue or a student wants to locate a source for a statement from a source about voter fraud. Good enough is state of the art.
  4. The technology of search is like a 1962 Corvette. It is nice to look at but terrible to drive.

Net net: Everyone is a search expert now. Yeah, right. Remember: The name of the game is sustainable revenue, not precision and recall, high value results, or the wild and crazy promise that Google made for “universal search”. Ho ho ho.

Stephen E Arnold, August 11, 2021

NSO Group: The Rip in the Fabric of Intelware

July 22, 2021

A contentious relationship with the “real news” organizations can be risky. I have worked at a major newspaper and a major publisher. The tenacity of some of my former colleagues is comparable to the grit one associates with an Army Ranger or Navy Seal, just with a slightly more sensitive wrapper. Journalists favored semi with it clothes, not bushy beards. The editorial team was more comfortable with laptops than an F SCAR.

Communications associated with NSO Group — the headline magnet among the dozens of Israel-based specialized software companies (an very close in group by the way)— may have torn the fabric shrouding the relationship among former colleagues in the military, government agencies, their customers, and their targets.

Whose to blame? The media? Maybe. I don’t have a dog in this particular season’s of fights. The action promises to be interesting and potentially devastating to some comfortable business models. NSO Group is just one of many firms working to capture the money associated with cyber intelligence and cyber security. The spat between the likes of journalists at the Guardian and the Washington Post and NSO Group appears to be diffusing like spilled ink on a camouflage jacket.

I noted “Pegasus Spyware Seller: Blame Our Customers Not Us for Hacking.” The main point seems to be that NSO Group allegedly suggests that those entities licensing the NSO Group specialized software are responsible for their use of the software. The write up reports:

But a company spokesman told BBC News: “Firstly, we don’t have servers in Cyprus.

“And secondly, we don’t have any data of our customers in our possession.

“And more than that, the customers are not related to each other, as each customer is separate.

“So there should not be a list like this at all anywhere.”

And the number of potential targets did not reflect the way Pegasus worked.

“It’s an insane number,” the spokesman said.

“Our customers have an average of 100 targets a year.

“Since the beginning of the company, we didn’t have 50,000 targets total.”

For me, the question becomes, “What controls exist within the Pegasus system to manage the usage of the surveillance system?” If there are controls, why are these not monitored by an appropriate entity; for example, an oversight agency within Israel? If there are no controls, has Pegasus become an “on premises” install set up so that a licensee has a locked down, air tight version of the NSO Group tools?

The second item I noticed was “NSO Says ‘Enough Is Enough,’ Will No Longer Talk to the Press About Damning Reports.” At first glance, I assumed that an inquiry was made by the online news service and the call was not returned. That happens to me several times a day. I am an advocate of my version of cancel culture. I just never call the entity again and move on. I am too old to fiddle with the egos of a younger person who believes that a divine entity has given that individual special privileges. Nope, delete.

But not NSO Group. According to the write up:

“Enough is enough!” a company spokesperson wrote in a statement emailed to news organizations. “In light of the recent planned and well-orchestrated media campaign lead by Forbidden Stories and pushed by special interest groups, and due to the complete disregard of the facts, NSO is announcing it will no longer be responding to media inquiries on this matter and it will not play along with the vicious and slanderous campaign.” NSO has not responded to Motherboard’s repeated requests for comment and for an interview.

Okay, the enough is enough message is allegedly in “writing.” That’s better than a fake message disseminated via TikTok. However, the “real journalists” are likely to become more persistent. Despite a lack of familiarity with the specialized software sector, a large number of history majors and liberal arts grads can do what “real” intelligence analysts do. Believe me, there’s quite a bit of open source information about the cozy relationship within and among Israel’s specialized software sector, the interaction of these firms with certain government entities, and public messages parked in unlikely open source Web sites to keep the “real” journalists learning, writing, and probing.

In my opinion, allowing specialized software services to become public; that is, actually talk about the capabilities of surveillance and intercept systems was a very, very bad idea. But money is money and sales are sales. Incentive schemes for the owners of specialized software companies guarantee than I can spend eight hours a day watching free webinars that explain the ins and outs of specialized software systems. I won’t but some of the now ignited flames of “real” journalism will. They will learn almost exactly what is presented in classified settings. Why? Capabilities when explained in public and secret forums use almost the same slide decks, the same words, and the same case examples which vary in level of detail presented. This is how marketing works in my opinion.

Observations:

1. A PR disaster is, it appears, becoming a significant political issue. This may pose some interesting challenges within the Israel centric specialized software sector. NSO Group’s system ran on cloud services like Amazon’s until AWS allegedly pushed Pegasus out of the Bezos stable.

2. A breaker of the specialized software business model of selling to governments and companies. The cost of developing, enhancing, and operating most specialized software systems keeps companies on the knife edge of solvency. The push into commercial use of the tools by companies or consumerizing the reports means government contracts will become more important if the non-governmental work is cut off. Does the world need several dozen Dark Web indexing outfits and smart time line and entity tools? Nope.

3. A boost to bad actors. The reporting in the last week or so has provided a detailed road map to bad actors in some countries about [a] What can be done, [b] How systems like Pegasus operate, [c] the inherent lack of security in systems and devices charmingly labeled “insecure by design” by a certain big software company, and [d] specific pointers to the existence of zero day opportunities in blast door protected devices. That’s a hoot at ??????? ???? “Console”.

Net net: The NSO Group “matter” is a very significant milestone in the journey of specialized software companies. The reports from the front lines will be fascinating. I anticipate excitement in Belgium, France, Germany, Israel, the United Kingdom, and a number of other countries. Maybe a specialized software Covid Delta?

Stephen E Arnold, July 22, 2021

A Xoogler Wants to Do Search: Channeling Exalead and Smoking InfinitySearch?

July 16, 2021

Remember Exalead. This was a search engine created by a person who was asked to join the Google. The system was very good: 64 bit architecture, timely indexing of new and previously indexed sites, and novel features like searching via text for a specific point in an Exalead processed video. Now the system is part of Dassault Systèmes because senior management grew frustrated with one of the aggressively marketed “smart systems” available in the mid 2000s.

Now a Xoogler realizes that Google search is just an artifact of the Backrub search and retrieval system. What was “clever” in 1998 now generally a version of MySpace.com. Maybe anigifs are in the fridge waiting to become the next big thing at the GOOG.

Now there’s a new Google called Neeva, a subscription-based, allegedly non-tracking, ad-free alternative to Google. Plus, Neeva, is out of beta—let the marketing begin! Fast Company explores the new search engine and its developers in depth in, “Inside Neeva, the Ad-Free, Privacy-First Search Engine from Ex-Googlers.” (Keep in mind that InfinitySearch.co is a new search engine with an almost identical subscription business model. Haven’t heart of InfinitySearch? Hmmm. What about Okeano? Oh, not that system either? Hmmm.)

Co-founders Sridhar Ramaswamy and Vivek Raghunathan, who both used to work at Google, had front-row seats to the dominant search engine’s evolution. They were unhappy to see advertising become more and more intrusive over the years. They are betting many users are ready to pay a $4.95 a month to access what Google could have been if it were not in hot pursuit of the almighty ad dollar. Anyone who has been googling for years has watched ads migrate from a relatively unobtrusive position on the right of the page to the top of search results. For a while after that shift they were delineated by a shaded box, but now they suspiciously blend into the organic results. Google also started pushing links to its own services to the top, even when a competitor might better serve the searcher’s needs. The Fast Company write up states:

“Then there’s the fact that Google builds profiles of its users based on their online activity, the better to precisely target them with advertising not only at its own sites but all the other ones across the web whose ads are powered by Google. With no ads to serve up, Neeva shouldn’t leave privacy-conscious types feeling like they’re being monitored for ulterior purposes. (By default, Neeva does hold onto your searches for 90 days to improve the quality of features such as autosuggestions, but you can erase this log or tell the service you don’t want it to keep it in the first place.) In another break from search-engine tradition, Neeva says that it will turn at least 20 percent of its top-line revenue over to publishing partners, including the first two it’s announced, Quora and Medium. Though the details of where this could lead remain vague, it’s another attempt to set Neeva apart from Google, which has often been accused of benefiting from media outlets’ content without adequate compensation, a long-simmering dispute that has led to lawsuits and legislation.”

The founders hired on several other ex-Googlers. The team worked to create a platform that is close enough to their former employer’s to feel familiar while nixing all the advertising misery. To do this, Neeva blends its own indexing with results from Apple, Bing, Yelp, Intrinio, Weather.com, Xignite, and even Google Maps. McCracken reports the platform performs well for most tasks, falling short only on local searches. There is also the small inconvenience that, as of this writing, Chrome is the only browser that lets one set Neeva as the default search platform. Is this an acquisition-friendly move. See the Fast Company article for more on Neeva’s features as well as details on Ramaswamy’s and Raghunathan’s experiences that led them down the path to this adventure.

And you can check out Exalead search at this link. Yep, still online. May I suggest the Web, video, and forums search be expanded and enhanced. As I said, it was quite good.

Cynthia Murrell, July 16, 2021

Google and Unreliable Results: Like the Jack Benny One Liner, I Am Thinking, I Am Thinking

June 25, 2021

I read a “real” news story called “Google Is Starting to Warn Users When It Doesn’t Have a Reliable Answer.”  (No, I will not ask, What’s reliable mean.)

Here’s the statement which snagged my attention in the write up:

“When anybody does a search on Google, we’re trying to show you the most relevant, reliable information we can,” said Danny Sullivan, a public liaison for Google Search. “But we get a lot of things that are entirely new,” Sullivan said the notice isn’t saying that what you’re seeing in search results is right or wrong — but that it’s a changing situation, and more information may come out later.

I think Mr. Sullivan, a former search engine optimization guru and conference organizer, is the “new” Matt Cutts, a Google professional helping to point the way to the digital future at the US government. Is key word packing the path to more patents than China?

I loved this statement which I know is pretty Tasmanian devil like: “Most relevant, reliable information we can.” I did a query for garage floor epoxy coating in Louisville. I gathered about 20 businesses display on the first two pages of Google search results. Two companies were in this business. Others were out of business. One “company” called me back and said, “My loser son has been gone for two years.”

I have other examples as well of search either being out of date, spoofed, or just weird.

Let’s look at some of the reasons why Google made a statement about “reliable answers.”

First, I think the difficulty of providing real-time indexing is beyond three Google capabilities: Outfits with real time content won’t play ball with Google unless Google pays up and works out a mechanism to move the content to a Google indexing queue. (Yep, queue as in long line at the McDonald’s drive through.)

Second, Google is not set up to do real time. I think the notion of having a short list of “must ping frequently sites” may be a hold over from the distant past. The reason? As the cost of indexing, updating, and making the Google indexes “consistent” – some of the practices no longer fit the current iteration of “relevant” and “reliable.” Google is not Twitter, and it is not Facebook. Therefore, the pipelines for real time content simply don’t exist. Googlers tried but seemed to be better at selling ads than dealing with new content types.

Third, hot info appears in non text form on Instagram, TikTok, and even places like DailyMotion and Vimeo sometimes days before the content plops into YouTube. Ever try to locate a video using the creator assigned index terms. That’s an exercise in futility. Ads, gentle reader, not relevant and reliable information.

From my vantage point on the porch overlooking a mine drainage pond, I have some hypotheses:

  1. Google is under financial pressure, a competitive pressure from Amazon and Facebook, and a legal pressure. Almost any nation state with an appetite to drag the Google into court is in gear.
  2. Google is just not able to handle the real time flows of content, either textual or imagery. Too bad, but that’s the excitement of Hegel’s these, antithese, synthese which “real” Googlers learn along with search engine optimization marketing methods.
  3. Google’s propagandistic and jingoistic assurances that it returns relevant and reliable results is more and more widely seen as key word spam.
  4. Google’s management methods are not tuned for the current business environment. I may be alone in noticing that high school science club thinking and management from assumed superiority is out of favor. (If Sergey Brin were to ride a Russian rocket into space, wou8ld he attract more signatures that Jeff Bezos. The quasi referendum did not want Mr. Bezos to return to earth. Mr. Brin’s ride did not materialize, so I won’t know who “won” the most votes.)

Net net: Relevant and reliable. That’s a line worthy of Jack Benny when he is asked about Fred Allen. I give up, “What does ‘reliable’ mean, Googlers.” My suggestion is marketing hoo haa with metatags.

Stephen E Arnold, June 25, 2021

Surveillance: Looking Forward

May 28, 2021

I read “The Future of Communication Surveillance: Moving Beyond Lexicons.” The article explains that word lists and indexing are not enough. (There’s no mention of non text objects and icons with specific meanings upon which bad actors agree before including them in a text message.)

I noted this passage:

Advanced technology such as artificial intelligence (AI), machine learning (ML) and pre-trained models can better detect misconduct and pinpoint the types of risk that a business cares about. AI and ML should work alongside metadata filtering and lexicon alerting to remove irrelevant data and classify communications.

This sounds like cheerleading. The Snowden dump of classified material makes clear that smart software was on the radar of the individuals creating the information released to journalists. Subsequent announcements from policeware and intelware vendors have included references to artificial intelligence and its progeny as a routine component. It’s been years since the assertions in the Snowden documents became known and yet shipping cyber security solutions are not delivering.

The article includes this statement about AI:

Automatically learn over time by taking input from the team’s review of prior alerts

And what about this one? AI can

Adapt quickly to changing language to identify phrases you didn’t know you needed to look for

What the SolarWinds’ misstep revealed was:

  1. None of the smart cyber security systems noticed the incursion
  2. None of the smart real time monitoring systems detected repeated code changes and downstream malware within the compromised system
  3. None of the threat alert services sent a warning to users of compromised systems.

Yet we get this write up about the future of surveillance?

Incredible and disconnected from the real life performance of cyber security vendors’ systems.

Stephen E Arnold, May 28, 2021

Another Way to Inject Ads into Semi-Relevant Content?

May 25, 2021

It looks like better search is just around the corner. Again. MIT Technology Review proclaims, “Language Models Like GPT-3 Could Herald a New Type of Search Engine.” Google’s PageRank has reigned over online search for over two decades. Even today’s AI search tech works as a complement to that system, used to rank results or better interpret queries. Now Googley researchers suggest a way to replace the ranking system altogether with an AI language model. This new technology would serve up direct answers to user queries instead of supplying a list of sources. Writer Will Douglas Heaven explains:

“The problem is that even the best search engines today still respond with a list of documents that include the information asked for, not with the information itself. Search engines are also not good at responding to queries that require answers drawn from multiple sources. It’s as if you asked your doctor for advice and received a list of articles to read instead of a straight answer. Metzler and his colleagues are interested in a search engine that behaves like a human expert. It should produce answers in natural language, synthesized from more than one document, and back up its answers with references to supporting evidence, as Wikipedia articles aim to do. Large language models get us part of the way there. Trained on most of the web and hundreds of books, GPT-3 draws information from multiple sources to answer questions in natural language. The problem is that it does not keep track of those sources and cannot provide evidence for its answers. There’s no way to tell if GPT-3 is parroting trustworthy information or disinformation—or simply spewing nonsense of its own making.”

The next step, then, is to train the AI to keep track of its sources when it formulates answers. We are told no models are yet able to do this, but it should be possible to develop that capability. The researchers also note the thorny problem of AI bias will have to be addressed for this approach to be viable. Furthermore, as search expert Ziqi Zhang at the University of Sheffield points out, technical and specialist topics often stump language models because there is far less relevant text on which to train them. His example—there is much more data online about e-commerce than quantum mechanics.

Then there are the physical limitations. Natural-language researcher Hanna Hajishirzi at the University of Washington warns the shift to such large language models would gobble up vast amounts of memory and computational resources. For this reason, she believes a language model will not be able to supplant indexing. Which researchers are correct? We will find out eventually. That is ok, we are used to getting ever less relevant search results.

Cynthia Murrell, May 25, 2021

Web Search: In Flux

May 17, 2021

I listened to an interview conducted by the host of the Big Technology podcast and Sridhar Ramaswamy, the former Xoogler who was in charge of Google Advertising for a number of years. Mr. Ramaswamy’s new venture is a subscription Web search engine. The interview was interesting, but I somehow missed the definition of what will be the “Web” content the system would index. I brought up this “missing question” at lunch today because the “Web” can mean different things to different searchers. Does the system search dynamic sites like those built on Shopify? Does it index forums and public discussion groups? Does it index password protected but no cost sites like Nextdoor.com? You get the idea without my tossing in videos, audio, and tabular data on government Web sites.

What the interview did not touch upon was the Infinity search system. You can get information about this $5.00 US per month service at this link. The system seems to be a combination of metasearch and proprietary indexing. Our tests, prior to its becoming a subscription service, were mixed. Overall, the results were not as useful as those retrieved from Swisscows.com, for example. The value proposition of the Xoogler’s subscription search service and Infinity seemed similar.

I want to mention that Yippy, the Web search component of Vivisimo seems to have gone offline. I thought the Vivisimo service was interesting even though the company focused on selling itself to IBM and becoming a cog in the IBM Big Data Watson world. The on-the-fly clustering was as good if not better than the original version of Northern Light clustering. As I listened to the explanation of why the time is right for subscription search of Web (whatever that means), I wondered why Yippy did not push aggressively for subscription revenues. Perhaps subscription services make sense when plugging assumptions into an Excel model? In real life, subscriptions are difficult.

The realities of Web (whatever that means) search is that costs go up. The brutal fact is that once content is indexed, that content must be revisited and changes discerned. Indexing changed content keeps the information in the index for those sites fresh. Also, the flows of new content mean that wonky new sites like those tallied by Product Hunt have to identified, indexed, and then passed to the update queue. The users are often indifferent to indexing update cycles. Web search engines have to allocate their resources among a number of different demands; for example, which sites get updated in near real time? What sites get indexed every six months like the US government Railway Retirement Board site? What sites get a look every couple of months?

And what about the rich media? The discussion groups? The Web sites which change their method of presenting content so that a crawler just skips the site? How deep does the crawler go? What happens to images? What about sites which require users to do something to get access; for example, a user name, a password, and then authentication on a smartphone?

Net net: The world of Web search is in flux. It is more difficult than at any time in my professional life to locate specific information. Maybe subscription services will do the trick? My hunch is that the lessons of the DataStars and Dialcoms and Lycoses will helpful to today’s innovators.

What you don’t remember DataStar? That’s one of the issues experts in search and retrieval face: Learning from yesterday’s innovators.

Stephen E Arnold, May 17, 2021

Why Metadata? The Answer: Easy and Good Enough

April 30, 2021

I read “We Were Promised Strong AI, But Instead We Got Metadata Analysis.” The essay is thoughtful and provides a good summary of indexing’s virtues. The angle of attack is that artificial intelligence has not delivered the zip a couple of bottles of Red Bull provides. Instead, metadata is more like four ounces of Sunny D tangy original.

The write up states:

The phenomenon of metadata replacing AI isn’t just limited to web search. Manually attached metadata trumps machine learning in many fields once they mature – especially in fields where progress is faster than it is in internet search engines. When your elected government snoops on you, they famously prefer the metadata of who you emailed, phoned or chatted to the content of the messages themselves. It seems to be much more tractable to flag people of interest to the security services based on who their friends are and what websites they visit than to do clever AI on the messages they send. Once they’re flagged, a human can always read their email anyway.

This is an accurate statement.

The write up does not address a question I think is important in the AI versus metadata discussion. That question is, “Why?”

Here are some of the reasons I have documented in my books and writings over the years:

  1. Metadata is cheaper to process than spending to get smart software to work in a reliable way
  2. Metadata is good enough; that is, key insights can be derived with maths taught in most undergraduate mathematics programs. (I lectured about the 10 algorithms which everyone uses. Why? These are good enough.)
  3. Machines can do pretty good indexing; that is, key word and bound phrase extraction and mapping, clustering, graphs of wide paths among nodes, people, etc.
  4. Humans have been induced to add their own – often wonky – index terms or hash tags as the thumbtypers characterize their tags
  5. Index analysis (Gene Garfield’s citation analysis) provides reasonably useful indications of what’s important even if one knows zero about a topic, entity, etc.
  6. Packaging indexing – sorry, metadata – as smart software and its ilk converts VCs from skeptics into fantasists. Money flows even though Google’s DeepMind technology is not delivering dump trucks of money to the Alphabet front door. Maybe soon? Who knows?

Net net: The strongest supporters of artificial intelligence have specific needs: Money, vindication of an idea gestated among classmates at a bar, or a desire to become famous.

Who agrees with me? Probably not too many people. As the professionals who founded commercial database products in the late 1970s and early 1980s die off, any chance of getting the straight scoop on the importance of indexing decreases. For AI professionals, that’s probably good news. For those individuals who understand indexing in today’s context, good luck with your mission.

Stephen E Arnold, April 30, 2021

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta