Search the Web: Maybe Find a Nugget or Two for Intrepid Researchers?

June 21, 2022

A Look at Search Engines with Their Own Indexes” has been updated. The article provides a run down of systems and services which offer Web search services.

Some of the factoids in the article are ones often overlooked by many of the “search experts” generating information about how to find information via open sources. Here are a few which deserve more attention from students of search:

  1. Bing is the most promiscuous supporter of metasearch
  2. YaCy is included in the “unusable” category; however, it is not. YaCy has some interesting properties of interest to cyber sleuths
  3. Neeva’s index is exposed as a mix of some original crawl content with Bing results. (Where’s the Google love for a former Googler’s search system.)
  4. Qwant is exposed for using Bing data
  5. Exalead, arguably better than Pertimm which influenced Qwant, takes some bullets. But Dassault is into other, more lucrative businesses than “search”
  6. Kagi is a for fee service which uses its own index and, like other metasearch systems, taps results from Bing and Google. (Is Google excited yet?)
  7. The Thunderstone service is noted. (How long has Thunderstone been around? Answer: A long time.)

Worth noting the links. Perhaps someone will create a list of the services indexing content for specialized software applications and government agencies. There are hundreds of “data aggregators” but how does one search them for useful results?

I addressed findability issue in my recent OSINT lecture for the National Cyber Crime Conference attendees and in a follow up session for the Mass. Asso. of Crime Analysts.

Stephen E Arnold, June 21, 2022

Decentralized Presearch Moves from Testnet to Mainnet

June 15, 2022

Yet another new platform hopes to rival the king of the search-engine hill. We think this is one to watch, though, for its approach to privacy, performance, and scope of indexing. PCMagazine asks, “The Next Google? Decentralized Search Engine ‘Presearch’ Exits Testing Phase.” The switch from its Testnet at Presearch.org to the Mainnet at Presearch.com means the platform’s network of some 64,000 volunteer nodes will be handling many more queries. They expect to process more than five million searches a day at first but are prepared to scale to hundreds of millions. Writer Michael Kan tells us:

“Presearch is trying to rival Google by creating a search engine free of user data collection. To pull this off, the search engine is using volunteer-run computers, known as ‘nodes,’ to aggregate the search results for each query. The nodes then get rewarded with a blockchain-based token for processing the search results. The result is a decentralized, community-run search engine, which is also designed to strip out the user’s private information with each search request. Anyone can also volunteer to turn their home computer or virtual server into a node. In a blog post, Presearch said the transition to the Mainnet promises to make the search engine run more smoothly by tapping more computing power from its volunteer nodes. ‘We now have the ability for node operators to contribute computing resources, be rewarded for their contributions, and have the network automatically distribute those resources to the locations and tasks that require processing,’ the company said.”

The blog post referenced above compares this decentralized approach to traditional search-engine infrastructure. An interesting Presearch feature is the row of alternative search options. One can perform a straightforward search in the familiar query box or click a button to directly search sources like DuckDuckGo, YouTube, Twitter, and, yes, Google. Reflecting its blockchain connection, the page also supplies buttons to search Etherscan, CoinGecko, and CoinMarketCap for related topics. Presearch gained 3.8 million registered users between its Testnet launch in October 2020 and the shift to its Mainnet. We are curious to see how fast it will grow from here.

Cynthia Murrell, June 15, 2022

Are There Google Wolves in Stealth Privacy Clothing?

June 8, 2022

A growing number of search engines are cropping up that purport to protect one’s privacy. Lukol is one of these. A brief entry at The New Leaf Journal questions that site’s privacy promises in, “Lukol Search Engine Shows Up in Logs.” New Leaf editor Nicholas A Ferrell noticed a paradox: though Lukol bills itself as an “anonymous search engine,” it is also “powered by Google Search.” Further investigation revealed this paragraph in the site’s privacy policy:

“We use cookies to personalise content and ads, and to analyse our traffic. We also share information about your use of our site with our advertising and analytics partners who may combine it with other information you’ve provided to them or they’ve collected from your use of their services. If you wish to opt out of Google cookies you may do so by visiting the Google privacy policy page.”

It seems the word “privacy” does not mean what Lukol thinks it means. Farrell comments:

“So this anonymous search engine stores cookies on your computer to serve you with personalized ‘content and ads’ and it shares information about your use of the site with ‘advertising and analytics partners.’ It then directs you to Google’s privacy policy page for information about how to opt out of Google cookies. While I struggle to see how Lukol is privacy-friendly (much less anonymous), it is a great example for why it is important to look behind catchy promises about privacy and anonymity.”

Agreed. Lukol is basically Google Search with some added manipulations. None of which appear to protect user privacy. Let the searcher beware.

Cynthia Murrell, June 8, 2022

DuckDuckGo: A Duck May Be Plucked

May 25, 2022

Metasearch engines are not understood by most Internet users. Here’s my simplified take: A company thinks it can add value to the results output from an ad-supported search engine. Maybe the search engine is a for-fee outfit? Either way, the metasearch systems gets the okay to send queries and get results. The results stream back to the metasearch outfit and the value-adding takes place.

One of the better metasearch systems was the pre-IBM Vivisimo. This outfit sent out queries to an ad-supported search engine, accepted the results, and then clustered them. The results appeared to the Vivisimo user as a results list with some folders in a panel. The idea was that the user could scan the folders and the results list. The user could decide to click on a folder and see what results it contained or just click on a link. The magic, as I understood it, was that the clustering took place in near real time. Plus, the query on the original Vivisimo pre-IBM system could send the user’s query to multiple Web search engines. The results from each search system would be de-duplicated. An interesting factoid from the 2000s is that search systems returned overlapping results 70 percent of more of the time. Dumping the duplicates was helpful. There were other interesting metasearch systems as well, but I am just using Vivisimo as an example of a pretty good one.

Privacy, like security, is a tricky concept to explain.

Using privacy to sell a free Web search system raises a number of questions; for example:

  1. What’s privacy in the specific context of the metasearch engine mean?
  2. Where is the money coming from to keep the lights on at the metasearch outfit?
  3. What about log files?
  4. What about legal orders to reveal data about users?
  5. What’s the quid pro quo with the search engine or engines whose results the metasearch system uses?
  6. What part of the search chain captures data, inserts trackers, bugs, cookies, etc. into the user’s query?

None of these questions catch the attention of the real news folks nor do most users know what the questions require to answer. The metasearch engines typically do not become chatty Cathies when someone like me shows up to gather information about metasearch systems. I recall the nervousness of the New York City wizard who cooked up Ixquick and the evasiveness of the owner of the Millionshort services.

Now we come to the the notion that a duck can be plucked. My hunch is that plucking a duck is a messy affair both duck and duck plucker.

DuckDuckGo Browser Allows Microsoft Trackers Due to Search Agreement” presents information which appears to suggest that the “privacy” oriented DuckDuckGo metasearch system is not so private as some believed. The cited article states:

The privacy-focused DuckDuckGo browser purposely allows Microsoft trackers on third-party sites due to an agreement in their syndicated search content contract between the two companies.

You can read the cited article to get more insight into the assertion that DuckDuck has been pluck plucked in the feathered hole of privacy.

Am I surprised? No. Search is without a doubt one of the most remarkable business segments for soft fraud. How do I know? My partners and I created The Point in 1994, and even though you don’t remember it, I sure remember what I learned about finding information online. Lycos (CMGI) bought our curated search business, and I wrote several books about search. You know what? No one wants to think about search and soft fraud. Maybe more people should?

Net net: Free comes at a cost. One does not know what one does not know.

Stephen E Arnold, May 25, 2022

Does Google Have Search Fear?

May 16, 2022

I can hear the Googlers at an search engine optimization conference saying this:

Our recent investments in search are designed to provide a better experience for our users. Our engineers are always seeking interesting, new, and useful ways to make the world’s information more accessible.

What these code words mean to me is:

Yep, the ancient Larry and Sergey thing. Not working. Oh, my goodness. What are we going to do? Buy Neeva, Kagi, Seekr, and Wecript? Let’s let Alphabet invest and we can learn and maybe earn before more people figure out our results are not as good as Bing and DuckDuckGo’s.

Even Slashdot is running items which make clear that Google and search do not warrant the title of “search giant.”

image

Source: Slashdot at https://bit.ly/3PkBOGt

I crafted this imaginary dialog when I read “This Germany-based AI Startup is Developing the Next Enterprise Search Engine Fueled by NLP and Open-Source.” That write up said:

Deepset, a German startup, is working to add to Natural Language Processing by integrating a language awareness layer into the business tech stack, allowing users to access and interact with data using language. Its flagship product, Haystack, is an open-source NLP framework that enables developers to create pipelines for a variety of search use-cases.

But here’s the snappy part of the article:

The Haystack-based NLP is typically implemented over a text database like Elasticsearch or Amazon’s OpenSearch branch and then connects directly with the end-user application through a REST API. It already has thousands of users and over 100 contributors. It uses transformer models to let developers create a variety of applications, such as production-ready question answering (QA), semantic document search, and summarization. The company has also introduced Deepset Cloud, an end-to-end platform for integrating customized and high-performing NLP-powered search systems into your application.

In theory, this is an open source, cloud centric super app, a meta play, a roll up of what’s needed to make finding information sort of work.

The kicker in the story is this statement:

The Berlin-based company has raised $14M in Series A funding led by GV, Alphabet’s venture capital arm.

Yep, the Google is investing. Why? Check that which applies:

(  ) Its own innovation engines are the equivalent of a Ford Pinto racing a Tesla Model S Plaid? Google search is no longer the world’s largest Web site?

(  ) Amazon gets more product searches than Google does?

( ) Users are starting to complain about how Google ignores what users key in the search box?

( ) Large sites are not being spidered in a comprehensive or timely manner?

( ) All of the above.

Stephen E Arnold, May 16, 2022

Kyndi: Advanced Search Technology with Quanton Methods. Yes, Quonton

April 29, 2022

One of my newsfeeds spit out this story: “Kyndi Unveils the Kyndi Natural Language Search Solution – Enables Enterprises to Discover and Deliver the Most Relevant and Precise Contextual Business Information at Unprecedented Speed.” The Kyndi founders appear to be business oriented, not engineering focused. The use of jargon like natural language understanding, contextual information, artificial intelligence, software robots, explainable artificial intelligence, and others is now almost automatic as if generated by smart software, not people who have struggled to make content processing and information retrieval work for users.

The firm’s Web site does not provide much detail about the technical pl8umbing for the company’s search and retrieval system. I took a quick look at the firm’s patents and noted these. I have added bold face to highlight some of  the interesting words in these documents.

  • A method using Birkhoff polytopes and Landau numbers. See US11205135 “Quanton [sic] Representation for Emulating Quantum-lie Computation on Classical Processors,”  granted December 21, 2021. Inventor: Arun Majumdar, possibly in Alexandria, Virginia.
  • A method employing combinatorial hyper maps. See US10985775 “System and Method of Combinatorial Hypermap Based Data Representations and Operations,” Granted April 20, 2021. Inventor: Arun Majumdar, possibly in Alexandria, Virginia. (As a point of interest the document Includes the word bijectively.)
  • A method making use of Q-Medoids and Q-Hashing. See US10747740 “Cognitive Memory Graph Indexing, Storage and Retrieval,” granted August 18, 2020. Inventor: Arun Majumdar, possibly in San Mateo, California.
  • A method using Semantic Boundary Indices and a variant of the VivoMind* Analogy Engine. See US10387784 “Technical and Semantic Signal Processing in Large, Unstructured Data Fields,” granted August 20, 2019. Inventor: Arun Majumdar, possibly in Alexandria, Virginia. *VivoMind was a company started my Arun Majumdar prior to his relationship with Kyndi.
  • A method using rvachev functions and  transfinite interpolations. See US10372724 “Relativistic Concept Measuring System for Data Clustering,” granted August 6, 2019. Inventor: Arun Majumdar, possibly in Alexandria, Virginia.
  • A method using Clifford algebra. See US10120933 “Weighted Subsymbolic Data Encoding,” granted November 6, 2018. Inventor: Arun Majumdar, possibly in Alexandria, Virginia.

The inventor is not listed on the firm’s Web site. Mr. Majumdar’s contributions are significant. The chief technology officer is Dan Gartung, who is a programmer and entrepreneur. However, there does not seem to be an observable link among the founders, the current CTO, and Mr. Majumdar.

The company will have to work hard to capture mindshare from companies like Algolia (now working to reinvent enterprise search), Mindbreeze, Yext, and X1 (morphing into an eDiscovery system it seems), among others. Kyndi has absorbed more than  $20 million plus in venture funding, but a competitor like Lucidworks has captured in the neighborhood of $200 million.

It is worth noting that one facet of the firm’s marketing is to hire the whiz kids from a couple of mid tier consulting firms to explain the firm’s approach to search. It might be a good idea for the analysts from these firms to read the Kyndi patents and determine how the Vivomind methods have been updated and applied to the Kyndi product. A bit of benchmarking might be helpful. For example, my team uses a collection of Google patents and indexes them, runs tests queries, and analyzes the result sets. Almost incomprehensible specialist terminology is one thing, but solid, methodical analysis of a system’s real life performance is another. Precision and recall scores remain helpful, particularly for certain content; for example, pharma research, engineered materials, and nuclear physics.

Stephen E Arnold, April 29, 2022

Web Search Alternatives Compete with Gusto

April 22, 2022

Search and information blog DKB shares a roundup of interesting search systems in, “The Next Google.” Are we confident any of these will be the next Google? Nope. But there are several our readers might find useful. While relatively popular Google alternatives like DuckDuckGo and Bing are based on the Google model, the apps on this list take their own paths. The write-up tells us:

“The next Google can’t just be an input box that spits out links. We need new thinking to create something much better than what came before. In the last few years, different groups of people came to the same conclusion, and started working on the next generation of search engines. For this new generation, privacy is necessary, and invasive ads are not an option. But that’s where the commonalities end. Beyond that, they’ve all taken the idea of a search engine in very different directions. … This new wave of search engines is only just getting started. Many of them have only recently launched. Even if they aren’t perfect yet, the paths they’re exploring can lead to promising new innovation in the stagnant search space.”

First is Kagi, which emphasizes customization. Users decide how they want information presented and can refine the sources the search taps into. Then there is Neeva, which takes searches beyond the web and into one’s personal resources, like email and a wide array of online file storage systems. You.com tries to match each query with the source most relevant to the type of question, while Andi takes a little time to pinpoint the best answer and deliver it with the feel of a real conversation. Finally, Brave Search boasts its own independent index that does not rely on Google or Bing for results, an unusual achievement indeed. See the write-up for more information on each of these systems. No, Google is not going to be replaced across the Web any time soon. But some readers may find an option here that could replace it in their own browsers, at least some of the time.

Cynthia Murrell, April 22, 2022

Nuclia: The Solution to the Enterprise Search Problem?

April 21, 2022

I read an interesting article called “Spanish Startup Nuclia Gets $5.4M to Advance Unstructured Data Search.” The article includes an illustration, presumably provided by Nuclia, which depicts search as a super app accessed via APIs.

image

Source: Silicon Angle and possibly Nuclia.com. Consult the linked story to see the red lines zip around without bottlenecks. (What? Bottlenecks in content processing, index updating, and query processing. Who ever heard of such a thing?)

Here are some of the highlights — assertions is probably a better word — about the Nuclia technology:

  • The system is “AI powered.”
  • Nuclia can “connect to any data source and automatically index its content regardless of what format or even language it is in.”
  • The system can “discover semantic results, specific paragraphs in text and relationships between data. These capabilities can be integrated in any application with ease.”
  • Nuclia can “detect images within unstructured datasets.”
  • The cloud-based service can “say one video is X% similar to another one, and so on.”

What makes the Nuclia approach tick? There are two main components:

  • The Nuclia vector database which is available via GitHub
  • The application programming interface.

The news hook for the search story is that investors have input $5.4 million in seed funding to the company.

Algolia wants to reinvent search. Maybe Nuclia has? Google is search, but it may be intrigued with the assertions about vector embeddings and finding similarities which may be otherwise overlooked. The idea is that the ad for Liberty Mutual might be displayed in YouTube videos about seized yachts by business wizards on one or more lists of interesting individuals. Elastics may want to poke around Nuclia in a quest for adding some new functionality to its search system.

Enterprise search seems to be slightly less dormant than it has been.

Stephen E Arnold, April 21, 2022

Google Web Search Quality

April 20, 2022

The cat is out of the bag. The Reddit threat “Does Anyone Else Think Google Search Quality Has Gone Downhill Fast?” provides an interesting series of comments about “quality.”

The notion of “search quality” in the good old days involved gathering a corpus of text. The text was indexed using a system; for example, Smart or maybe Personal Bibliographic software. Test queries would be created in order to determine how the system displayed search results. The research minded person would then examine the corpus and determine if the result set returned the best matches. There are tricks those skilled in the art could use to make the test queries perform. One would calculate precision and recall. Bingo metrics. Now here’s the good part. Another search system would be used to index the content; for example, something interesting like the “old” Sagemaker, the mainframe fave IBM STAIRS III, or Excalibur. The performance of the second system would be compared to the first system. One would do this over time and generate precision and recall scores which could be compared. We used to use a corpus of Google patents, and I remember that Perfect Search (remember that one, gentle reader) outperformed a number of higher profile and allegedly more advanced systems.

I am not sure Reddit posts are into precision and recall, but the responses to the question about degradation of Google search quality is fascinating. Those posting are not too happy with what Google delivers and how the present day Googley search and retrieval system works. Thank you, Prabhakar Raghavan, former search wizard executive at Verity (wow, that was outstanding) and the individual who argued with a Bear Stearns’ managing director and me about how much better Yahoo’s semantic technology was that Google’s. Raghavan was at Yahooooo then and we know how wonderful Yahoo search was!)

Hewer’s a rundown of some of the issues identified in the Reddit thread:

  • From PizzaInteraction: “always laugh when I enter like 4 search terms and all the results focus on just one of the terms.”
  • Healthy-Contest-1605: “Every algorithm is being gamed to have their trash come out in top.”
  • Cl0udSurfer: “the usual tricks like adding quotes around required words, or putting a dash in front of words that should be excluded don’t work anymore.”

Net net: This is the Verity-Yahoo trajectory. Precision and recall? Ho ho ho. What about disclosing when a source was indexed and updated? What about Boolean operators? What about making as much money as possible so one can go to a high school reunion and explain the wonderfulness one’s cleverness? What happened to Louis Monier, Sanjay Ghemawat, and the Backrub crowd?

Stephen E Arnold, April 20, 2022

Google Responds to Amazon Product Search Growth

April 20, 2022

Here is a new feature from Google, dubbed Lens, we suspect was designed to win back product-search share from Amazon. TechCrunch reveals, “Google’s New ‘Multisearch’ Feature Lets You Search Using Text and Images at the Same Time.” The mobile-app feature, now running as a beta in the US, is available on Android and iOS. As one would expect, it allows one to ask questions or refine search results for a photo or other image. Writer Aisha Malik reports:

“Google told TechCrunch that the new feature currently has the best results for shopping searches, with more use cases to come in the future. With this initial beta launch, you can also do things beyond shopping, but it won’t be perfect for every search. In practice, this is how the new feature could work. Say you found a dress that you like but aren’t a fan of the color it’s available in. You could pull up a photo of the dress and then add the text ‘green’ in your search query to find it in your desired color. In another example, you’re looking for new furniture, but want to make sure it complements your current furniture. You can take a photo of your dining set and add the text ‘coffee table’ in your search query to find a matching table. Or, say you got a new plant and aren’t sure how to properly take care of it. You could take a picture of the plant and add the text ‘care instructions’ in your search to learn more about it.”

Malik notes this feature is great for times when neither an image nor words by themselves produce great Google results—a problem the platform has wrestled with. Lens employs the company’s latest ready-for-prime-time AI tech, but the developers hope to go further and incorporate their budding Multitask Unified Model (MUM). See the write up for more information, including a few screenshots of Lens at work.

Cynthia Murrell, April 20, 2022

Next Page »

  • Archives

  • Recent Posts

  • Meta