Listen Up! Direct from a Former Verity Executive: Google Search Lags

July 13, 2022

If anyone knows about falling behind in search and retrieval, it is probably a former Verity executive. Verity provided a decent security token to limit content access and created one of the world’s most sluggish indexing updating methods I had ever encountered. When was this? The late 1990s. Verity ended up as a contributor to the estimable Autonomy “search” offering. Therefore, experience in moving users to content is a core competency of former Verity executives.

I spotted a Googler who was a former Verity executive. The individual identified how a search and retrieval system does not meet the needs of the here and now user. The information is contained in what I think somewhat askew discussion of the Google finding system. The information appears in “Google Exec Suggests Instagram and TikToc Are Eating into Google’s Core Products, Search and Maps.” The write up includes some interesting observations. These comments reveal Google’s apparently slow realization that it is making money as it loses the hearts and minds of a couple of important customer segments. It also colors the outlines of Google’s hesitancy to identify one of its most difficult search problems: Amazon.com.

I noted these statements in the article:

he [the former Verity executive] somewhat offhandedly noted that younger users were now often turning to apps like Instagram and TikTok instead of Google Search or Maps for discovery purposes. “We keep learning, over and over again, that new internet users don’t have the expectations and the mindset that we have become accustomed to.” Raghavan said, adding, “the queries they ask are completely different.”

Experience matters. Verity went nowhere and ended up a footnote in Autonomy’s quest for customers, not technology and cutting edge functionality. Been there, seen that could be one of the triggers for this moment of candor.

Here’s another:

“In our studies, something like almost 40% of young people, when they’re looking for a place for lunch, they don’t go to Google Maps or Search,” he continued. “They go to TikTok or Instagram.” The figure sounds a bit shocking, we have to admit. Google confirmed to us his comments were based on internal research that involved a survey of U.S. users, ages 18 to 24. The data has not yet been made public, we’re told, but may later be added to Google’s competition site, alongside other stats — like how 55% of product searches now begin on Amazon, for example.

Flash back to the Verity era. New systems were becoming available. The wild and crazy Fast Search & Transfer folks were demonstrating a different almost “webby” approach to finding enterprise information. There was a sporty system from ISYS Search which provided a graphical interface, which — believe it or not — is still in the commercial market. There were quite fascinating folder oriented systems like Folio and Lextek. There were rumblings about semantics from Purple Yogi, later renamed Stratify, and also still available sort of from a records management company. Verity was lagging in the race to search domination.

So is Google. And a former Verity wizard identifies three companies which pose a bit of a challenge to a company which lacks focus, urgency, and hunger.

Add to this mea culpa the allegedly accurate statements reported in “Read the Memo Google’s CEO Sent Employees about a Hiring Slowdown.” The main idea in my opinion is that the mighty Googzilla is wandering in the wilderness with billions from online advertising. The problem is that developers are putting up trailer parks, slumurbia housing, and giant digital K-Marts. Googzilla is confused. Where’s the Moses to snap a leash on the beastie and pull the multi-ton monster to a valley filled with prey?

The trajectory for Alphabet Google YouTube DeepMind and the solving death folks seems to be discernable. Peak Google, yep. Now gravity. (No, I won’t quote from the endlessly readable Gravity’s Rainbow. Sorry, I lied. How about this line from the page turner?

You think you’d rather hear about what you call life: the growing, organic Kartell. But it’s only another illusion. A very clever robot. The more dynamic it seems to you, the more deep and dead, in reality, it grows.

Verity, mostly dead. The Google? Well, gravity. No pot of gold at the end of this digital rainbow I surmise.

Stephen E Arnold, July 13, 2022

Search the Web: Maybe Find a Nugget or Two for Intrepid Researchers?

June 21, 2022

A Look at Search Engines with Their Own Indexes” has been updated. The article provides a run down of systems and services which offer Web search services.

Some of the factoids in the article are ones often overlooked by many of the “search experts” generating information about how to find information via open sources. Here are a few which deserve more attention from students of search:

  1. Bing is the most promiscuous supporter of metasearch
  2. YaCy is included in the “unusable” category; however, it is not. YaCy has some interesting properties of interest to cyber sleuths
  3. Neeva’s index is exposed as a mix of some original crawl content with Bing results. (Where’s the Google love for a former Googler’s search system.)
  4. Qwant is exposed for using Bing data
  5. Exalead, arguably better than Pertimm which influenced Qwant, takes some bullets. But Dassault is into other, more lucrative businesses than “search”
  6. Kagi is a for fee service which uses its own index and, like other metasearch systems, taps results from Bing and Google. (Is Google excited yet?)
  7. The Thunderstone service is noted. (How long has Thunderstone been around? Answer: A long time.)

Worth noting the links. Perhaps someone will create a list of the services indexing content for specialized software applications and government agencies. There are hundreds of “data aggregators” but how does one search them for useful results?

I addressed findability issue in my recent OSINT lecture for the National Cyber Crime Conference attendees and in a follow up session for the Mass. Asso. of Crime Analysts.

Stephen E Arnold, June 21, 2022

Decentralized Presearch Moves from Testnet to Mainnet

June 15, 2022

Yet another new platform hopes to rival the king of the search-engine hill. We think this is one to watch, though, for its approach to privacy, performance, and scope of indexing. PCMagazine asks, “The Next Google? Decentralized Search Engine ‘Presearch’ Exits Testing Phase.” The switch from its Testnet at Presearch.org to the Mainnet at Presearch.com means the platform’s network of some 64,000 volunteer nodes will be handling many more queries. They expect to process more than five million searches a day at first but are prepared to scale to hundreds of millions. Writer Michael Kan tells us:

“Presearch is trying to rival Google by creating a search engine free of user data collection. To pull this off, the search engine is using volunteer-run computers, known as ‘nodes,’ to aggregate the search results for each query. The nodes then get rewarded with a blockchain-based token for processing the search results. The result is a decentralized, community-run search engine, which is also designed to strip out the user’s private information with each search request. Anyone can also volunteer to turn their home computer or virtual server into a node. In a blog post, Presearch said the transition to the Mainnet promises to make the search engine run more smoothly by tapping more computing power from its volunteer nodes. ‘We now have the ability for node operators to contribute computing resources, be rewarded for their contributions, and have the network automatically distribute those resources to the locations and tasks that require processing,’ the company said.”

The blog post referenced above compares this decentralized approach to traditional search-engine infrastructure. An interesting Presearch feature is the row of alternative search options. One can perform a straightforward search in the familiar query box or click a button to directly search sources like DuckDuckGo, YouTube, Twitter, and, yes, Google. Reflecting its blockchain connection, the page also supplies buttons to search Etherscan, CoinGecko, and CoinMarketCap for related topics. Presearch gained 3.8 million registered users between its Testnet launch in October 2020 and the shift to its Mainnet. We are curious to see how fast it will grow from here.

Cynthia Murrell, June 15, 2022

Cheerleading: The PicRights’ Method

May 30, 2022

I read what appears to be a news release designed to promote an outfit with an interesting business model. Navigate to “PicRights Sponsors Upcoming CEPIC Congress in Spain.” the write up explains:

For the fifth consecutive year, PicRights will also sponsor the annual Digital Media Licensing Organization (DMLA) Conference, to be held later this year. Last year’s conference offered sessions with Adobe, Google, Microsoft and Getty, and discussed NFTs, AI, synthetic content, remote production, and other issues shaping today’s creator economy. PicRights was a sponsor of the conference from 2018 through 2021, and was previously a speaker at the 2020 conference.

The news release points out:

Last month, PicRights was a supporter of the 32nd annual MINDS Conference held in Helsinki. The theme of the conference was “Stronger Together – Collaboration and Sharing for Success” and discussed successful partnerships within MINDS and beyond, collaboration with major platforms, newsroom evolution, and the power of diversity and inclusion.

Several questions arose as I thought about this somewhat rah rah-type news story:

  1. What is the false positive rate for the software used by this organization to identify copyright missteps? When was it developed? By whom?
  2. What financial deals are in place for largely reactive and technologically sluggish publishing companies’ whose intellectual property is the subject of legal interactions?
  3. Why are image protected by assorted copyright regulations appearing in a free Web search system like Google-type image search?

I don’t have answers to these questions. It seems to me that some odd synchronized vibration is buzzing among the image indexing outfits, the PicRights-type operations, and the copyright holders.

Is the solution to use “smart software” to delete inclusion of any image which requires a fee for use or the insertion of a message that clearly identifies an image as one which requires a fee to be paid should someone like a veteran’s group, a college newspaper, or a one-person Medium blogger?

I find this harmonic vibration among the rights enforcement folks, the Google-type search systems, and the entity “owning” the rights to a particular image fascinating.

The business model is clever but it appears that additional publicity is needed to make the excellence of the approach more visible.  Rah rah rah.

Stephen E Arnold, May 30, 2022

Controlled Term Lists Morph into Data Catalogs That Are Better, Faster, and Cheaper to Generate

May 24, 2022

Indexing and classifying content is boring. A human subject matter expert asked to extract index terms and assign classification codes work great. But the humanoid SME gets tired and begins assigning general terms from memory. Plus humanoids want health care, retirement benefits, and time to go fishing in the Ozarks. (Yes, the beautiful sunny Ozarks!)

With off-the-shelf smart software available on GitHub or at a bargain price from the ever-secure Microsoft or the warehouse-subleasing Amazon, innovators can use machines to handle the indexing. In order to make the basic into a glam task. Slap on a new bit of jargon, and you are ready to create a data catalog.

16 Top Data Catalog Software Tools to Consider Using in 2022” is a listing of automated indexing and classifying products and services. No humanoids or not too many humanoids needed. The software delivers lower costs and none of the humanoid deterioration after a few hours of indexing. Those software systems are really something: No vacations, no benefits, no health care, and no breaks during which unionization can be discussed.

What’s interesting about the list is that it includes the allegedly quasi monopolistic outfits like Amazon, Google, IBM, Informatica, and Oracle. The write up does not answer the question, “Are the terms and other metadata the trade secret of the customer?” The reason I am curious is that rolling up terms from numerous organizations and indexing each term as originating at a particular company provides a useful data set to analyze for trends, entities, and date and time on the document from which the terms were derived. But no alleged monopoly would look at a cloud customer’s data? Inconceivable.

The list of vendors also includes some names which are not yet among the titans of content processing; for example:

Alation

Alex

Ataccama

Atlan

Boomi

Collibra

Data.world

Erwin

Lumada.

There are some other vendors in the indexing business. You can identify these players by joining NFAIS, now the National Federation of Advanced Information Services. The outfit discarded the now out of favor terminology of abstracting and indexing.  My hunch is that some NFAIS members can point out some of the potential downsides of using smart software to process business and customer information. New terms and jazzy company names can cause digital consternation. But smart software just gets smarter even as it mis-labels, mis-indexes, and mis-understands. No problem: Cheaper, faster, and better. A trifecta. Who needs SMEs to look at an exception file, correct errors, and tune the sysetm? No one!

Stephen E Arnold, May 24, 2022

An Analyst Wrestles with the Palantir Realities

May 23, 2022

Palantir Technologies in my world view is a services and software company positioned as a provider of intelware. Intelware means software and services which allow users to extract high-value information from text, numeric, and possibly image and video data.

Palantir, founded in 2003, has been influenced from its inception by precursor software like the original i2 Ltd. Analyst Notebook and BAE Systems Detica. Both of these systems allowed user to intake “content”, enter the names of people or things, and display the outputs so that the higher-value facts were presented in a useful way; for example, a chart or a relationship graph.

The US government works to learn about new and potentially useful software and systems. Not surprisingly, a government agency showed interest in Palantir’s software when the entrepreneurs involved in the company started describing the Palantir features and functions. Appreciate that in its early years almost two decades ago, the presentations and demonstrations captured what I call “to be” systems; that is, at some point in the future, Palantir’s system and software would be everything that Analyst Notebook, Detica, and the other intelware vendors could offer. The pitch is compelling.

Palantir, now almost two decades old, is a publicly traded company, and it is working overtime to move beyond sales to governments in the US and elsewhere. One of the characteristics of selling intelware to non-governmental organizations is that the capabilities of the system and its use by government clients are often disconcerting to a financial institution, a big hospital chain, or consulting firm focused on real estate.

Furthermore, intelware systems require data. Some data can be easily imported into a system like Palantir’s; for example, plain ASCII text and Excel spreadsheets. Other data are in a format which must be transformed so that Palantir can import the information. Other data present challenges like converting an image with a date and time stamp into an indexed content object. That indexing, to be helpful and to reduce the likelihood of errors, has to be accurate. Some non-text data must be enriched. French content processing experts refer to this enrichment as “fertilization.”

The write up “Palantir: Complete Disaster” includes this statement:

We think there are three possible courses of action in the disaster that has been Palantir, all of which are correct.

Here are the three “courses of action”:

  1. Don’t buy shares in Palantir.
  2. Buy shares, maybe short the stock.
  3. Buy shares and ride out the downturn.

Each of these options ignore two issues. The first is why Palantir is not closing deals and showing a profit. The second is why an intelware company is not able to amp up its sales to government agencies in the US, Western Europe, and selected government agencies elsewhere.

My view is that Palantir is a tough sell for these reasons:

  1. To land a deal, the prospect has to know what the payoff from using the Gotham / Foundry system is. “Intelligence” is a hot concept, but it is a tough sell unless there is a “champion” inside the prospect’s organization to grease the skids.
  2. Competitors offer comparable products for as little as $5,000 per month and some of these competitors bundle third party data which can be fused with the licensee’s data with minimal fiddling with filters and file conversions.
  3. Newer systems are easier to use, include automated workflows which speed analysts, investigators, and and researchers work.

The slow sales of Palantir follow the same type of curve that sales of Autonomy, Fast Search & Software, and many other “information” or “intelligence” focused products have. The initial sales are from government agencies which want better mouse traps. When the intelware does not deliver markedly significant payoffs, the licensees keep looking for better, faster, and cheaper options.

Will Palantir be able to generate a profit and deliver organic growth?

If the trajectory of precursor companies is the path Palantir is on, the answer is, “No.”

Stephen E Arnold, May 23, 2022

Google, Smart Software, and Prime Mover for Hyperbole

May 17, 2022

In my experience, the cost of training smart software is very big problem. The bigness does not become evident until the licensee of a smart system realizes that training the smart software must take place on a regular schedule. Why is this a big problem? The reason is the effort required to assemble valid training sets is significant. Language, data types, and info peculiarities change over time; for example, new content is fed into a smart system, and the system cannot cope with the differences between the training set that was used and the info flowing into the system now. A gap grows, and the fix is to assemble new training data, reindex the content, and get ready to do it again. A failure to keep the smart software in sync with what is processed is a tiny bit of knowledge not explained in sales pitches.

Accountants figure out that money must be spent on a cost not in the original price data. Search systems return increasingly lousy results. Intelligence software outputs data which make zero sense to a person working out a surveillance plan. An art history major working on a PowerPoint presentation cannot locate the version used by the president of the company for last week’s pitch to potential investors.

The accountant wants to understand overruns associated with smart software, looks into the invoices and time sheets, and discovers something new: Smart software subject matter experts, indexing professionals, interns buying third-party content from an online vendor called Elsevier. These are not what CPAs confront unless there are smart software systems chugging along.

The big problem is handled in this way: Those selling the system don’t talk too much about how training is a recurring cost which increases over time. Yep, reindexing is a greedy pig and those training sets have to be tested to see if the smart software gets smarter.

The fix? Do PR about super duper even smarter methods of training. Think Snorkel. Think synthetic data. Think PowerPoint decks filled with jargon that causes clueless MBAs do high fives because the approach is a slam dunk. Yes! Winner!

I read “DeepMind’s Astounding New ‘Gato’ AI Makes Me Fear Humans Will Never Achieve AGI” and realized that the cloud of unknowing has not yet yield to blue skies. The article states:

Just like it took some time between the discovery of fire and the invention of the internal combustion engine, figuring out how to go from deep learning to AGI won’t happen overnight.

No kidding. There are gotchas beyond training, however. I have a presentation in hand which I delivered in 1997 at an online conference. Training cost is one dot point; there are five others. Can you name them? Here’s a hint for another big issue: An output that kills a patient. The accountant understands the costs of litigation when that smart AI makes a close enough for horseshoes output for a harried medical professional. Yeah, go catscan, go.

Stephen E Arnold, May 17, 2022

Issues with the Zuckbook Smart Software: Imagine That

May 10, 2022

I was neither surprised by nor interested in “Facebook’s New AI System Has a ‘High Propensity’ for Racism and Bias.” The marketing hype encapsulated in PowerPoint decks and weaponized PDF files on Arxiv paint fantastical pictures of today’s marvel-making machine learning systems. Those who have been around smart software and really stupid software for a number of years understand two things: PR and marketing are easier than delivering high-value, high-utility systems and smart software works best when tailored and tuned to quite specific tasks. Generalized systems are not yet without a few flaws. Addressing these will take time, innovation, and money. Innovation is scarce in many high-technology companies. The time and money factors dictate that “good enough” and “close enough for horseshoes” systems and methods are pushed into products and services. “Good enough” works for search because no one knows what is in the index. Comparative evaluations of search and retrieval is tough when users (addicts) operate within a cloud of unknowing. The “close enough for horseshoes” produces applications which are sort of correct. Perfect for ad matching and suggesting what Facebook pages or Tweets would engage a person interested in tattoos or fad diets.

The cited article explains:

Facebook and its parent company, Meta, recently released a new tool that can be used to quickly develop state-of-the-art AI. But according to the company’s researchers, the system has the same problem as its predecessors: It’s extremely bad at avoiding results that reinforce racist and sexist stereotypes.

My recollection is that the Google has terminated some of its wizards and transformed these professionals into Xooglers in the blink of an eye. Why? Exposing some of the issues that continue to plague smart software.

Those interns, former college professors, and start up engineers rely on techniques used for decades. These are connected together, fed synthetic data, and bolted to an application. The outputs reflect the inherent oddities of the methods; for example, feed the system images spidered from Web sites and the system “learns” what is on the Web sites. Then generalize from the Web site images and produce synthetic data. The who process zooms along and costs less. The outputs, however, have minimal information about that which is not on a Web site; for example, positive images of a family in a township outside of Cape Town.

The write up states:

Meta researchers write that the model “has a high propensity to generate toxic language and reinforce harmful stereotypes, even when provided with a relatively innocuous prompt.” This means it’s easy to get biased and harmful results even when you’re not trying. The system is also vulnerable to “adversarial prompts,” where small, trivial changes in phrasing can be used to evade the system’s safeguards and produce toxic content.

What’s new? These issues surfaced in the automated content processing in the early versions of the Autonomy Neuro Linguistic Programming approach. The fix was to retrain the system and tune the outputs. Few licensees had the appetite to spend the money needed to perform the retraining and reindexing of the processed content when the search results drifted into weirdness.

Since the mid 1990s, have developers solved this problem?

Nope.

Has the email with this information reached the PR professionals and the art history majors with a minor in graphic design who produce PowerPoints? What about the former college professors and a bunch of interns and recent graduates?

Nope.

What’s this mean? Here’s my view:

  1. Narrow applications of smart software can work and be quite useful; for example, the Preligens system for aircraft identification. Broad applications have to be viewed as demonstrations or works in progress.
  2. The MBA craziness which wants to create world-dominating methods to control markets must be recognized and managed. I know that running wild for 25 years creates some habits which are going to be difficult to break. But change is needed. Craziness is not a viable business model in my opinion.
  3. The over-the-top hyperbole must be identified. This means that PowerPoint presentations should carry a warning label: Science fiction inside. The quasi-scientific papers with loads of authors who work at one firm should carry a disclaimer: Results are going to be difficult to verify.

Without some common sense, the flood of semi-functional smart software will increase. Not good. Why? The impact of erroneous outputs will cause more harm than users of the systems expect. Screwing up content filtering for a political rally is one thing; outputting an incorrect medical action is another.

Stephen E Arnold, May 10, 2022

Kyndi: Advanced Search Technology with Quanton Methods. Yes, Quonton

April 29, 2022

One of my newsfeeds spit out this story: “Kyndi Unveils the Kyndi Natural Language Search Solution – Enables Enterprises to Discover and Deliver the Most Relevant and Precise Contextual Business Information at Unprecedented Speed.” The Kyndi founders appear to be business oriented, not engineering focused. The use of jargon like natural language understanding, contextual information, artificial intelligence, software robots, explainable artificial intelligence, and others is now almost automatic as if generated by smart software, not people who have struggled to make content processing and information retrieval work for users.

The firm’s Web site does not provide much detail about the technical pl8umbing for the company’s search and retrieval system. I took a quick look at the firm’s patents and noted these. I have added bold face to highlight some of  the interesting words in these documents.

  • A method using Birkhoff polytopes and Landau numbers. See US11205135 “Quanton [sic] Representation for Emulating Quantum-lie Computation on Classical Processors,”  granted December 21, 2021. Inventor: Arun Majumdar, possibly in Alexandria, Virginia.
  • A method employing combinatorial hyper maps. See US10985775 “System and Method of Combinatorial Hypermap Based Data Representations and Operations,” Granted April 20, 2021. Inventor: Arun Majumdar, possibly in Alexandria, Virginia. (As a point of interest the document Includes the word bijectively.)
  • A method making use of Q-Medoids and Q-Hashing. See US10747740 “Cognitive Memory Graph Indexing, Storage and Retrieval,” granted August 18, 2020. Inventor: Arun Majumdar, possibly in San Mateo, California.
  • A method using Semantic Boundary Indices and a variant of the VivoMind* Analogy Engine. See US10387784 “Technical and Semantic Signal Processing in Large, Unstructured Data Fields,” granted August 20, 2019. Inventor: Arun Majumdar, possibly in Alexandria, Virginia. *VivoMind was a company started my Arun Majumdar prior to his relationship with Kyndi.
  • A method using rvachev functions and  transfinite interpolations. See US10372724 “Relativistic Concept Measuring System for Data Clustering,” granted August 6, 2019. Inventor: Arun Majumdar, possibly in Alexandria, Virginia.
  • A method using Clifford algebra. See US10120933 “Weighted Subsymbolic Data Encoding,” granted November 6, 2018. Inventor: Arun Majumdar, possibly in Alexandria, Virginia.

The inventor is not listed on the firm’s Web site. Mr. Majumdar’s contributions are significant. The chief technology officer is Dan Gartung, who is a programmer and entrepreneur. However, there does not seem to be an observable link among the founders, the current CTO, and Mr. Majumdar.

The company will have to work hard to capture mindshare from companies like Algolia (now working to reinvent enterprise search), Mindbreeze, Yext, and X1 (morphing into an eDiscovery system it seems), among others. Kyndi has absorbed more than  $20 million plus in venture funding, but a competitor like Lucidworks has captured in the neighborhood of $200 million.

It is worth noting that one facet of the firm’s marketing is to hire the whiz kids from a couple of mid tier consulting firms to explain the firm’s approach to search. It might be a good idea for the analysts from these firms to read the Kyndi patents and determine how the Vivomind methods have been updated and applied to the Kyndi product. A bit of benchmarking might be helpful. For example, my team uses a collection of Google patents and indexes them, runs tests queries, and analyzes the result sets. Almost incomprehensible specialist terminology is one thing, but solid, methodical analysis of a system’s real life performance is another. Precision and recall scores remain helpful, particularly for certain content; for example, pharma research, engineered materials, and nuclear physics.

Stephen E Arnold, April 29, 2022

Enterprise Search Vendor Buzzword Bonanza!

April 25, 2022

Enterprise search vendors are similar to those two Red Bull-sponsored wizards who wanted to change aircraft—whilst in flight. How did that work out? The pilots survived. That aircraft? Yeah, Liberty, Liberty Mutual as the YouTube ads intone.

Enterprise search vendors want to become something different. Typical repositionings include customer support which entails typing in a word and scanning for matches and business intelligence which often means indexing content, matching words and phrases on a list, and generating alerts. There are other variations which include analyzing content and creating a report which tallies text messages from outraged customers.

Let’s check out reality. “Enterprise search” means finding information. Words and phrase are helpful. Users want these systems to know what is needed and then output it without asking the user to do anything. The challenge becomes assigning a jazzy marketing hook to make enterprise search into something more vital, more compelling, and more zippy.

Navigate to “What Should We Remember?” Bonanza. The diagram is a remarkable array of categories and concepts tailor-made for search marketers. Here’s an example of some of the zingy concepts:

  • Zero-risk bias
  • Social comparison
  • Fundamental attribution
  • Barnum effect — Who? The circus person?

Now mix in natural language processing, semantic analysis, entity extraction, artificial intelligence, and — my fave — predictive analytics.

How quickly will outfits in the enterprise search sector gravitate to these more impactful notions? Desperation is a motivating factor. Maybe weeks or months?

Stephen E Arnold, April 25, 2022

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta