Jargon Changes More Rapidly Than Search And Retrieval
July 22, 2022
Oh boy! There is a new term in the search and retrieval lexicon: neural search. While the term sounds like a search engine for telepaths or something a cyborg and/or android would use, Martech Series explained that it is something completely different: “Sinequa Adds Industry-Leading Neural Search Capabilities To Its Search Cloud Platform.”
Sinequa is an enterprise search leader and it recently announced the addition of advanced neural search capabilities to its Search Cloud Platform. The upgrade promises to provide unprecedented relevance, accuracy, etc. Sinequa is the first company to offer neural search in four deep learning language models commercially. The models are pre-trained with a combination of Sinequa’s trademark NLP and semantic search.
Search engines used neural search models for years, but they were not cost-effective for enterprise systems:
“Neural search models have been used in internet searches by Google and Bing since 2019, but computing requirements rendered them too costly and slow for most enterprises, especially at production scale. Sinequa optimized the models and collaborated with the Microsoft Azure and NVIDIA AI/ML teams to deliver a high performance, cost-efficient infrastructure to support intensive Neural Search workloads without a huge carbon footprint. Neural Search is optimized for Microsoft Azure and the latest NVIDIA A10 or A100 Tensor Core GPUs to efficiently process large amounts of unstructured data as well as user queries.”
Wonderful for Sinequa! Search and retrieval, especially in foreign languages are some of the biggest time wasters in productivity. Hopefully, Sinequa actually delivers an industry changing product, otherwise, they simply added more jargon to the tech glossary.
Whitney Grace, July 22, 2022
Commercializing Cyber Crime with Search and Retrieval
July 14, 2022
I read “Ransomware Gangs Offer Ability to Search Stolen Data.” The write up reports:
Bleeping Computer reported today that the ALPHV/BlackCat ransomware gang was the first to offer the feature, announcing that they have created a searchable database with leaks from nonpaying victims. The hackers said that their stolen data had been fully indexed and that the search feature included support for finding information by filename or by content available in documents and images. The BlackCat ransomware gang claims it is offering the search service to make it easier for cybercriminals to find passwords or other confidential information.
Other alleged bad actors are offering a search function as well. These are Lockbit and Karakurt.
Several observations:
- Commercialization of cyber crime has been a characteristic of some of the more forward-leaning bad actors
- The availability of open source search makes it easy to add functionality
- More productization is inevitable; for example, subscriptions to Crime as a Service.
Net net: The focus of crime analysts and investigators may have to embrace enablers like Internet Service Providers, cloud services, and open source code repositories.
Stephen E Arnold, July 14, 2022
Another Plea for Web Search That Sort of Works: Andrew Carnegie, Where Are You?
July 11, 2022
I am not going to do any history. Oh, well. Not really. Does anyone on TikTok know about Andrew Carnegie? Okay, let’s try another angle. How about a semi-rapacious dude with roots in Scotland who wanted to do good. Please, ignore the Carnegie era Monongahela River. The cheerful Mr. Carnegie came up with the idea of a free public library. Looking up information was a useful thing for poor folks and monopolistic steel barons alike. One person sort of fixed the “problem” of information access.
Flash forward to Backrub. Two bright young sprouts realized that a person had a tough time finding relevant information on Lycos and the other search engines available at “dawn” or the Internet. The fix? Take a little bit of Kleinberg, add a pinch of technology, use available computing resources whether others at Stanford University knew or cared, and mix in continuous feedback to a bundle of mostly automatic rules. More links in, good. Not many links in, meh. Then advertising. Yeah, that worked great for some. For others, ho ho ho.
The result is the weaponized findability environment of good old 2022.
What’s the fix? “Why the World Needs a Non-Profit Search Engine” explains that donors contribute money, and an objective Web search system will return relevant results. The write up states:
Sometimes I forget why I’ve taken on this crazy, huge task. Why am I building a search engine? Will it really be better than Google one day? Will people support it? Will people even use it? And then I read something like The Bullshit Web and I remember, that, yes, there is a point. Even if I make the web better for one person, it’s worth it. Because the way things are is just wrong. Search engines are in a unique position to fix the situation. Not only do we create a view on the world’s knowledge, we influence it too. If we promote bullshit-free sites, then people will create more bullshit-free sites. More importantly, search engines are a filter on the world’s knowledge. Do you really want your filter to be “whatever makes $SEARCH_ENGINE more money”, particularly when that means, “show ads instead of search results, and prioritize search results that also make us more money”? We can and should do better.
I want to point out that what may be required is an Andrew Carnegie type who already has money and a guilty conscience. It is a modern perception that if one can get lots and lots of people to contribute money, one can fund anything.
Nice idea. My response? “Where’s the Andrew Carnegie?”
Why?
Traffic means monetization. Do-gooding is walking on the information highway. One has to speed, and speed is infinitely expensive. Ergo: Monetization lies over the horizon.
Stephen E Arnold, July 11, 2022
Akn Unfindable Search Utillity: Wild Spelling and Naming Idea
July 7, 2022
I like to check out new Web search systems. Most are little more than recycled versions of Dogpile.com, one of the most Abe Lincoln metasearch systems. A metasearch system uses hits from other search systems, possibly adds a bit of Vivisimo-type special sauce, and outputs results and rather crazy marketing materials.
The write up “This Badass Tool Makes Advanced YouTube Searches a Breeze” states:
This tool also allows you to perform advanced search on Google, DuckDuckGo, Twitter, and Reddit.
But the article is over the moon about the utility of the system when searching for content in Newton Minnow’s nightmare, YouTube. I learned:
I [the author of the article] think this cool tool is better suited to YouTube.
Let’s try to find the system using its name, ä1. Try plugging the ä1 into Google, and what do you get? I received hits for services wildly unrelated to search and retrieval:
What about Bing, the Microsofties’ wonderful, but small, search system:
Yep, childhood disease.
What about Yandex? No joy.
Let’s search for the ä1 site on the ä1 site. What do we get? Google results and no ä1 search overlay or service.
Net net: Innovators, use names which can be searched. (Not every one knows how to put the a with acne into a search box. Besides, most search systems discard such silliness as dots, checks, and circumflexes. Intellectual niceties are not part of the plan.) Pain in the a$$, not bad a$$ in my opinion.
If you want to try out the all-in-one “system” yourself, here’s the url: https://ä1.com.
Tip: How about a findable name?
Stephen E Arnold, July 7, 2022
Search the Web: Maybe Find a Nugget or Two for Intrepid Researchers?
June 21, 2022
“A Look at Search Engines with Their Own Indexes” has been updated. The article provides a run down of systems and services which offer Web search services.
Some of the factoids in the article are ones often overlooked by many of the “search experts” generating information about how to find information via open sources. Here are a few which deserve more attention from students of search:
- Bing is the most promiscuous supporter of metasearch
- YaCy is included in the “unusable” category; however, it is not. YaCy has some interesting properties of interest to cyber sleuths
- Neeva’s index is exposed as a mix of some original crawl content with Bing results. (Where’s the Google love for a former Googler’s search system.)
- Qwant is exposed for using Bing data
- Exalead, arguably better than Pertimm which influenced Qwant, takes some bullets. But Dassault is into other, more lucrative businesses than “search”
- Kagi is a for fee service which uses its own index and, like other metasearch systems, taps results from Bing and Google. (Is Google excited yet?)
- The Thunderstone service is noted. (How long has Thunderstone been around? Answer: A long time.)
Worth noting the links. Perhaps someone will create a list of the services indexing content for specialized software applications and government agencies. There are hundreds of “data aggregators” but how does one search them for useful results?
I addressed findability issue in my recent OSINT lecture for the National Cyber Crime Conference attendees and in a follow up session for the Mass. Asso. of Crime Analysts.
Stephen E Arnold, June 21, 2022
Decentralized Presearch Moves from Testnet to Mainnet
June 15, 2022
Yet another new platform hopes to rival the king of the search-engine hill. We think this is one to watch, though, for its approach to privacy, performance, and scope of indexing. PCMagazine asks, “The Next Google? Decentralized Search Engine ‘Presearch’ Exits Testing Phase.” The switch from its Testnet at Presearch.org to the Mainnet at Presearch.com means the platform’s network of some 64,000 volunteer nodes will be handling many more queries. They expect to process more than five million searches a day at first but are prepared to scale to hundreds of millions. Writer Michael Kan tells us:
“Presearch is trying to rival Google by creating a search engine free of user data collection. To pull this off, the search engine is using volunteer-run computers, known as ‘nodes,’ to aggregate the search results for each query. The nodes then get rewarded with a blockchain-based token for processing the search results. The result is a decentralized, community-run search engine, which is also designed to strip out the user’s private information with each search request. Anyone can also volunteer to turn their home computer or virtual server into a node. In a blog post, Presearch said the transition to the Mainnet promises to make the search engine run more smoothly by tapping more computing power from its volunteer nodes. ‘We now have the ability for node operators to contribute computing resources, be rewarded for their contributions, and have the network automatically distribute those resources to the locations and tasks that require processing,’ the company said.”
The blog post referenced above compares this decentralized approach to traditional search-engine infrastructure. An interesting Presearch feature is the row of alternative search options. One can perform a straightforward search in the familiar query box or click a button to directly search sources like DuckDuckGo, YouTube, Twitter, and, yes, Google. Reflecting its blockchain connection, the page also supplies buttons to search Etherscan, CoinGecko, and CoinMarketCap for related topics. Presearch gained 3.8 million registered users between its Testnet launch in October 2020 and the shift to its Mainnet. We are curious to see how fast it will grow from here.
Cynthia Murrell, June 15, 2022
Are There Google Wolves in Stealth Privacy Clothing?
June 8, 2022
A growing number of search engines are cropping up that purport to protect one’s privacy. Lukol is one of these. A brief entry at The New Leaf Journal questions that site’s privacy promises in, “Lukol Search Engine Shows Up in Logs.” New Leaf editor Nicholas A Ferrell noticed a paradox: though Lukol bills itself as an “anonymous search engine,” it is also “powered by Google Search.” Further investigation revealed this paragraph in the site’s privacy policy:
“We use cookies to personalise content and ads, and to analyse our traffic. We also share information about your use of our site with our advertising and analytics partners who may combine it with other information you’ve provided to them or they’ve collected from your use of their services. If you wish to opt out of Google cookies you may do so by visiting the Google privacy policy page.”
It seems the word “privacy” does not mean what Lukol thinks it means. Farrell comments:
“So this anonymous search engine stores cookies on your computer to serve you with personalized ‘content and ads’ and it shares information about your use of the site with ‘advertising and analytics partners.’ It then directs you to Google’s privacy policy page for information about how to opt out of Google cookies. While I struggle to see how Lukol is privacy-friendly (much less anonymous), it is a great example for why it is important to look behind catchy promises about privacy and anonymity.”
Agreed. Lukol is basically Google Search with some added manipulations. None of which appear to protect user privacy. Let the searcher beware.
Cynthia Murrell, June 8, 2022
DuckDuckGo: A Duck May Be Plucked
May 25, 2022
Metasearch engines are not understood by most Internet users. Here’s my simplified take: A company thinks it can add value to the results output from an ad-supported search engine. Maybe the search engine is a for-fee outfit? Either way, the metasearch systems gets the okay to send queries and get results. The results stream back to the metasearch outfit and the value-adding takes place.
One of the better metasearch systems was the pre-IBM Vivisimo. This outfit sent out queries to an ad-supported search engine, accepted the results, and then clustered them. The results appeared to the Vivisimo user as a results list with some folders in a panel. The idea was that the user could scan the folders and the results list. The user could decide to click on a folder and see what results it contained or just click on a link. The magic, as I understood it, was that the clustering took place in near real time. Plus, the query on the original Vivisimo pre-IBM system could send the user’s query to multiple Web search engines. The results from each search system would be de-duplicated. An interesting factoid from the 2000s is that search systems returned overlapping results 70 percent of more of the time. Dumping the duplicates was helpful. There were other interesting metasearch systems as well, but I am just using Vivisimo as an example of a pretty good one.
Privacy, like security, is a tricky concept to explain.
Using privacy to sell a free Web search system raises a number of questions; for example:
- What’s privacy in the specific context of the metasearch engine mean?
- Where is the money coming from to keep the lights on at the metasearch outfit?
- What about log files?
- What about legal orders to reveal data about users?
- What’s the quid pro quo with the search engine or engines whose results the metasearch system uses?
- What part of the search chain captures data, inserts trackers, bugs, cookies, etc. into the user’s query?
None of these questions catch the attention of the real news folks nor do most users know what the questions require to answer. The metasearch engines typically do not become chatty Cathies when someone like me shows up to gather information about metasearch systems. I recall the nervousness of the New York City wizard who cooked up Ixquick and the evasiveness of the owner of the Millionshort services.
Now we come to the the notion that a duck can be plucked. My hunch is that plucking a duck is a messy affair both duck and duck plucker.
“DuckDuckGo Browser Allows Microsoft Trackers Due to Search Agreement” presents information which appears to suggest that the “privacy” oriented DuckDuckGo metasearch system is not so private as some believed. The cited article states:
The privacy-focused DuckDuckGo browser purposely allows Microsoft trackers on third-party sites due to an agreement in their syndicated search content contract between the two companies.
You can read the cited article to get more insight into the assertion that DuckDuck has been pluck plucked in the feathered hole of privacy.
Am I surprised? No. Search is without a doubt one of the most remarkable business segments for soft fraud. How do I know? My partners and I created The Point in 1994, and even though you don’t remember it, I sure remember what I learned about finding information online. Lycos (CMGI) bought our curated search business, and I wrote several books about search. You know what? No one wants to think about search and soft fraud. Maybe more people should?
Net net: Free comes at a cost. One does not know what one does not know.
Stephen E Arnold, May 25, 2022
Does Google Have Search Fear?
May 16, 2022
I can hear the Googlers at an search engine optimization conference saying this:
Our recent investments in search are designed to provide a better experience for our users. Our engineers are always seeking interesting, new, and useful ways to make the world’s information more accessible.
What these code words mean to me is:
Yep, the ancient Larry and Sergey thing. Not working. Oh, my goodness. What are we going to do? Buy Neeva, Kagi, Seekr, and Wecript? Let’s let Alphabet invest and we can learn and maybe earn before more people figure out our results are not as good as Bing and DuckDuckGo’s.
Even Slashdot is running items which make clear that Google and search do not warrant the title of “search giant.”
Source: Slashdot at https://bit.ly/3PkBOGt
I crafted this imaginary dialog when I read “This Germany-based AI Startup is Developing the Next Enterprise Search Engine Fueled by NLP and Open-Source.” That write up said:
Deepset, a German startup, is working to add to Natural Language Processing by integrating a language awareness layer into the business tech stack, allowing users to access and interact with data using language. Its flagship product, Haystack, is an open-source NLP framework that enables developers to create pipelines for a variety of search use-cases.
But here’s the snappy part of the article:
The Haystack-based NLP is typically implemented over a text database like Elasticsearch or Amazon’s OpenSearch branch and then connects directly with the end-user application through a REST API. It already has thousands of users and over 100 contributors. It uses transformer models to let developers create a variety of applications, such as production-ready question answering (QA), semantic document search, and summarization. The company has also introduced Deepset Cloud, an end-to-end platform for integrating customized and high-performing NLP-powered search systems into your application.
In theory, this is an open source, cloud centric super app, a meta play, a roll up of what’s needed to make finding information sort of work.
The kicker in the story is this statement:
The Berlin-based company has raised $14M in Series A funding led by GV, Alphabet’s venture capital arm.
Yep, the Google is investing. Why? Check that which applies:
( ) Its own innovation engines are the equivalent of a Ford Pinto racing a Tesla Model S Plaid? Google search is no longer the world’s largest Web site?
( ) Amazon gets more product searches than Google does?
( ) Users are starting to complain about how Google ignores what users key in the search box?
( ) Large sites are not being spidered in a comprehensive or timely manner?
( ) All of the above.
Stephen E Arnold, May 16, 2022
Kyndi: Advanced Search Technology with Quanton Methods. Yes, Quonton
April 29, 2022
One of my newsfeeds spit out this story: “Kyndi Unveils the Kyndi Natural Language Search Solution – Enables Enterprises to Discover and Deliver the Most Relevant and Precise Contextual Business Information at Unprecedented Speed.” The Kyndi founders appear to be business oriented, not engineering focused. The use of jargon like natural language understanding, contextual information, artificial intelligence, software robots, explainable artificial intelligence, and others is now almost automatic as if generated by smart software, not people who have struggled to make content processing and information retrieval work for users.
The firm’s Web site does not provide much detail about the technical pl8umbing for the company’s search and retrieval system. I took a quick look at the firm’s patents and noted these. I have added bold face to highlight some of the interesting words in these documents.
- A method using Birkhoff polytopes and Landau numbers. See US11205135 “Quanton [sic] Representation for Emulating Quantum-lie Computation on Classical Processors,” granted December 21, 2021. Inventor: Arun Majumdar, possibly in Alexandria, Virginia.
- A method employing combinatorial hyper maps. See US10985775 “System and Method of Combinatorial Hypermap Based Data Representations and Operations,” Granted April 20, 2021. Inventor: Arun Majumdar, possibly in Alexandria, Virginia. (As a point of interest the document Includes the word bijectively.)
- A method making use of Q-Medoids and Q-Hashing. See US10747740 “Cognitive Memory Graph Indexing, Storage and Retrieval,” granted August 18, 2020. Inventor: Arun Majumdar, possibly in San Mateo, California.
- A method using Semantic Boundary Indices and a variant of the VivoMind* Analogy Engine. See US10387784 “Technical and Semantic Signal Processing in Large, Unstructured Data Fields,” granted August 20, 2019. Inventor: Arun Majumdar, possibly in Alexandria, Virginia. *VivoMind was a company started my Arun Majumdar prior to his relationship with Kyndi.
- A method using rvachev functions and transfinite interpolations. See US10372724 “Relativistic Concept Measuring System for Data Clustering,” granted August 6, 2019. Inventor: Arun Majumdar, possibly in Alexandria, Virginia.
- A method using Clifford algebra. See US10120933 “Weighted Subsymbolic Data Encoding,” granted November 6, 2018. Inventor: Arun Majumdar, possibly in Alexandria, Virginia.
The inventor is not listed on the firm’s Web site. Mr. Majumdar’s contributions are significant. The chief technology officer is Dan Gartung, who is a programmer and entrepreneur. However, there does not seem to be an observable link among the founders, the current CTO, and Mr. Majumdar.
The company will have to work hard to capture mindshare from companies like Algolia (now working to reinvent enterprise search), Mindbreeze, Yext, and X1 (morphing into an eDiscovery system it seems), among others. Kyndi has absorbed more than $20 million plus in venture funding, but a competitor like Lucidworks has captured in the neighborhood of $200 million.
It is worth noting that one facet of the firm’s marketing is to hire the whiz kids from a couple of mid tier consulting firms to explain the firm’s approach to search. It might be a good idea for the analysts from these firms to read the Kyndi patents and determine how the Vivomind methods have been updated and applied to the Kyndi product. A bit of benchmarking might be helpful. For example, my team uses a collection of Google patents and indexes them, runs tests queries, and analyzes the result sets. Almost incomprehensible specialist terminology is one thing, but solid, methodical analysis of a system’s real life performance is another. Precision and recall scores remain helpful, particularly for certain content; for example, pharma research, engineered materials, and nuclear physics.
Stephen E Arnold, April 29, 2022