ZincSearch: An Alternative to Elasticsearch

December 16, 2022

Recently launched ZincSearch is an Elasticsearch alternative worth looking into, despite the fact that several features are not yet fully formed. The nascent enterprise search engine promises lower complexity and lower costs. The About Us page describes its edge search and an experimental stateless server that can be scaled horizontally. The home page emphasizes:

“ZincSearch is built for Full Text Search: ZincSearch is a search engine that can be used for any kind of text data. It can be used for logs, metrics, events, and more. It allows you to do full text search among other things. e.g. Send server logs to ZincSearch for them or you can push your application data and provide full text search or you can build a search bar in your application using ZincSearch.

    • Easy to Setup & Operate: ZincSearch provides the easiest way to get started with log capture, search and analysis. It has simple APIs to interact and integrates with leading log forwarders allowing you to get operational in minutes.
    • Low resource requirements: It uses far less CPU and RAM compared to alternatives allowing for lower cost to run. Developers can even run it on their laptops without ever noticing its resource utilization. …
    • Schemaless Indexes: No need to work hard to define schema ahead of time. ZincSearch automatically discovers schema so you can focus on search and analysis.
    • Aggregations: Do faceted search and analyze your data.”

ZincSearch would not attract many conversions if it made migration difficult, so of course it is compatible with the Elasticsearch API. To a point, anyway—the application is still working on an Elasticsearch-compatible query API. ZincSearch can store data in S3 and MinIO, though that capacity is currently in an experimental phase. Sounds promising; we look forward to seeing how ZincSearch looks a year or so from now.

A blog post by ZincSearch creator Prabhat Sharma not only discusses his reasons for making his solution but also gives a useful summary of enterprise search in general. The startup is based in San Francisco.

Cynthia Murrell, December 16, 2022

Open Source Desktop Search Tool Recoll

December 13, 2022

Anyone searching for an alternative desktop search option might consider Recoll, an open source tool based on the Xapian search engine library. The latest version, 1.33.3, was released just recently. The landing page specifies:

“Recoll finds documents based on their contents as well as their file names.

The software is free on Linux, open source, and licensed under the GPL. Detailed features and application requirements for supported document types.”

Recoll began as a tool to augment the search functionality of Linux’ desktop environment, a familiar pain point to users of that open source OS. Since it has expanded to Windows and Mac, users across the OS spectrum can try Recoll. Check it out, dear reader, if you crave a different desktop search solution.

Cynthia Murrell, December 13, 2022

Sonic: an Open Source Elasticsearch Alternative for Lighter Backends

November 10, 2022

When business messaging platform Crisp launched in 2015, it did so knowing its search functionality was lacking. Unfortunate, but least the company never pretended otherwise. The team found Elasticsearch was not scalable for its needs, and the SQL database it tried proved ponderous. Finally, in 2019, the solution emerged. Cofounder Valerian Saliou laid out the specifics in his blog post, “Announcing Sonic: A Super-Light Alternative to Elasticsearch.” He wrote:

“This was enough to justify the need for a tailor-made search solution. The Sonic project was born.

What is Sonic? Sonic can be found on GitHub as Sonic, a Fast, lightweight & schema-less search backend. Quoting what Sonic is from the GitHub page of the project: ‘Sonic is a fast, lightweight and schema-less search backend. It ingests search texts and identifier tuples, that can then be queried against in microseconds time. Sonic can be used as a simple alternative to super-heavy and full-featured search backends such as Elasicsearch in some use-cases. Sonic is an identifier index, rather than a document index; when queried, it returns IDs that can then be used to refer to the matched documents in an external database.’ Sonic is built in Rust, which ensures performance and stability. You can host it on a server of yours, and connect your apps to it over a LAN via Sonic Channel, a specialized protocol. You’ll then be able to issue search queries and push new index data from your apps — whichever programming language you work with. Sonic was designed to be fast and lightweight on resources.”

Not only do Crisp users get the benefits of this tool, but it is also available as open-source software. A few features of note include auto complete and typo correction, compatibility with Unicode, and user friendly libraries. See the detailed write-up for the developers’ approach to Sonic, the benefits and limitations, and implementation notes.

Cynthia Murrell, November 10, 2022

Sepana: A Web 3 Search System

November 8, 2022

Decentralized search is the most recent trend my team and I have been watching. We noted “Decentralized Search Startup Sepana Raises $10 Million.” The write up reports:

Sepana seeks to make web3 content such as DAOs and NFTs more discoverable through its search tooling.

What’s the technical angle? The article points out:

One way it’s doing this is via a forthcoming web3 search API that aims to enable any decentralized application (dapp) to integrate with its search infrastructure. It claims that millions of search queries on blockchains and dapps like Lens and Mirror are powered by its tooling.

With search vendors working overtime to close deals and keep stakeholders from emulating Vlad the Impaler, some vendors are making deals with extremely interesting companies. Here’s a question for you? “What company is Elastic’s new best friend?” Elasticsearch has been a favorite of many companies. However, Amazon nosed into the Elastic space. Furthermore, Amazon appears to be interested in creating a walled garden protected by a moat around its search technologies.

One area for innovation is the notion of avoiding centralization. Unfortunately online means that centralization becomes an emergent property. That’s one of my pesky Arnold’s Laws of Online. But why rain on the decentralized systems parade?

Sepana’s approach is interesting. You can get more information at https://sepana.io. Also you can check out Sepana’s social play at https://lens.sepana.io/.

Stephen E Arnold, November 8, 2022

Vectara: Another Run Along a Search Vector

November 4, 2022

Is this the enterprise search innovation we have been waiting for? A team of ex-Googlers have used what they learned about large language models (LLMs), natural language processing (NLP), and transformer techniques to launch a new startup. We learn about their approach in VentureBeat‘s article, “Vectara’s AI-Based Neural Search-as-a-Service Challenges Keyword-Based Searches.” The platform combines LLMs, NLP, data integration pipelines, and vector techniques into a neural network. The approach can be used for various purposes, we learn, but the company is leading with search. Journalist Sean Michael Kerner writes:

“[Cofounder Amr] Awadallah explained that when a user issues a query, Vectara uses its neural network to convert that query from the language space, meaning the vocabulary and the grammar, into the vector space, which is numbers and math. Vectara indexes all the data that an organization wants to search in a vector database, which will find the vector that has closest proximity to a user query. Feeding the vector database is a large data pipeline that ingests different data types. For example, the data pipeline knows how to handle standard Word documents, as well as PDF files, and is able to understand the structure. The Vectara platform also provides results with an approach known as cross-attentional ranking that takes into account both the meaning of the query and the returned results to get even better results.”

We are reminded a transformer puts each word into context for studious algorithms, relating it to other words in the surrounding text. But what about things like chemical structures, engineering diagrams, embedded strings in images? It seems we must wait longer for a way to easily search for such non-linguistic, non-keyword items. Perhaps Vectara will find a way to deliver that someday, but next it plans to work on a recommendation engine and a tool to discover related topics. The startup, based in Silicon Valley, launched in 2020 under the “stealth” name Zir AI. Recent seed funding of $20 million has enabled the firm to put on its public face and put out this inaugural product. There is a free plan, but one must contact the company for any further pricing details.

Cynthia Murrell, November 4, 2022

Startup Vectara Takes Search Up Just a Notch

October 25, 2022

Is this the enterprise search innovation we have been waiting for? A team of ex-Googlers have used what they learned about large language models (LLMs), natural language processing (NLP), and transformer techniques to launch a new startup. We learn about their approach in VentureBeat‘s article, “Vectara’s AI-Based Neural Search-as-a-Service Challenges Keyword-Based Searches.” The platform combines LLMs, NLP, data integration pipelines, and vector techniques into a neural network. The approach can be used for various purposes, we learn, but the company is leading with search. Journalist Sean Michael Kerner writes:

“[Cofounder Amr] Awadallah explained that when a user issues a query, Vectara uses its neural network to convert that query from the language space, meaning the vocabulary and the grammar, into the vector space, which is numbers and math. Vectara indexes all the data that an organization wants to search in a vector database, which will find the vector that has closest proximity to a user query. Feeding the vector database is a large data pipeline that ingests different data types. For example, the data pipeline knows how to handle standard Word documents, as well as PDF files, and is able to understand the structure. The Vectara platform also provides results with an approach known as cross-attentional ranking that takes into account both the meaning of the query and the returned results to get even better results.”

We are reminded a transformer puts each word into context for studious algorithms, relating it to other words in the surrounding text. But what about things like chemical structures, engineering diagrams, embedded strings in images? It seems we must wait longer for a way to easily search for such non-linguistic, non-keyword items. Perhaps Vectara will find a way to deliver that someday, but next it plans to work on a recommendation engine and a tool to discover related topics. The startup, based in Silicon Valley, launched in 2020 under the “stealth” name Zir AI. Recent seed funding of $20 million has enabled the firm to put on its public face and put out this inaugural product. There is a free plan, but one must contact the company for any further pricing details.

Cynthia Murrell, October 25, 2022

Wonderful Statement about Baked In Search Bias

October 12, 2022

I was scanning the comments related to the HackerNews’ post for this article: “Google’s Million’s of Search Results Are Not Being Served in the Later Pages Search Results.”

Sailfast made this comment at this link:

Yeah – as someone that has run production search clusters before on technologies like Elastic / open search, deep pagination is rarely used and an extremely annoying edge case that takes your cluster memory to zero. I found it best to optimize for whatever is a reasonable but useful for users while also preventing any really seriously resource intensive but low value queries (mostly bots / folks trying to mess with your site) to some number that will work with your server main node memory limits.

The comment outlines a facet of search which is not often discussed.

First, the search plumbing imposes certain constraints. The idea of “all” information is one that many carry around like a trusted portmanteau. What are the constraints of the actual search system available or in use?

Second, optimization is a fancy word that translates to one or more engineers deciding what to do; for example, change a Bayesian prior assumption, trim content based on server latency, filter results by domain, etc.

Third, manipulation of the search system itself by software scripts or “bots” force engineers to figure out what signals are okay and which are not okay. It is possible to inject poisoned numerical strings or phrases into a content stream and manipulate the search system. (Hey, thank you, search engine optimization researchers and information warfare professionals. Great work.)

When I meet a younger person who says, “I am a search expert”, I just shake my head. Even open source intelligence experts display that they live in a cloud of unknowing about search. Most of these professionals are unaware that their “research” comes from Google search and maps.

Net net: Search and retrieval systems manifest bias, from the engineers, from the content itself, from the algorithms, and from user interfaces themselves. That’s why I say in my lectures, “Life is easier if one just believes everything one encounters online.” Thinking in a different way is difficult, requires specialist knowledge, and a willingness to verify… everything.

Stephen E Arnold, October 12, 2022

Elastic: Bouncing Along

October 12, 2022

It seems like open-source search is under pressure. We learn from SiliconAngle that “Elastic Delivers Strong Revenue Growth and Beats Expectations, but Its Stock is Down.” For anyone unfamiliar with Elastic, writer Mike Wheatley describes the company’s integral relationship with open-source software:

“The company sells a commercial version of the popular open-source Elasticsearch platform. Elasticsearch is used by enterprises to store, search and analyze massive volumes of structured and unstructured data. It allows them to do this very quickly, in close to real time. The platform serves as the underlying engine for millions of applications that have complex search features and requirements. In addition to Elasticsearch, Elastic also sells application observability tools that help companies to track network performance, as well as threat detection software.”

Could it be that recent concerns about open-source security issues are more important to investors than fiscal success? The write-up shares some details from the company’s press release:

“The company reported a loss before certain costs such as stock compensation of 15 cents per share, coming in ahead of Wall Street analysts’ consensus estimate of a 17-cent-per-share loss. Meanwhile, Elastic’s revenue grew by 30% year-over-year, to $250.1 million, beating the consensus estimate of $246.2 million. On a constant currency basis, Elastic’s revenue rose 34%. Altogether, Elastic posted a net loss of $69.6 million, more than double the $34.4 million loss it reported in the year-ago period.”

Elastic emphatically accentuates the positive—like the dramatic growth of its cloud-based business and its flourishing subscription base. See the source article or the press release for more details. We are curious to see whether the company’s new chief product officer Ken Exner can find a way to circumvent open-source’s inherent weaknesses. Exner used to work at Amazon overseeing AWS Developer Tools. Founded in 2012, Elastic is based in Mountain View, California.

Cynthia Murrell, October 12, 2022

Waking Up to a Basic Fact of Online: Search and Retrieval Is Terrible

October 10, 2022

I read “Why Search Sucks.” The metadata for the article is, and I quote:

search-web-email-google-streaming-online-shopping-broken-2022-4

I spotted the article in a newsfeed, and I noticed it was published in April 2022 maybe? Who knows. Running a query on Bing, Google and Yandex  for “Insider why search sucks” yielded links to the original paywalled story. The search worked. The reason has more to do with search engine optimization, Google prioritization of search-related information, and the Sillycon Valley source.

Why was there no “$” to indicate a paywall. Why was the data of publication not spelled out in the results? I have no idea. Why one result identified Savanna Durr as the author and the article itself said Adam Rogers was the author?

So for this one query and for billions of users of free, ad-supported Web search engines work so darned well? Free and good enough are the reasons I mention. (Would you believe that some Web search engines have a list of “popular” queries, bots that look at Google results, and workers who tweak the non Google systems to sort of like Google? No. Hey, that’s okay with me.)

The cited article “Why Search Sucks” takes the position that search and retrieval is terrible. Believe me. The idea is not a new one. I have been writing about information access for decades. You can check out some of this work on the Information Today Web site or in the assorted monographs about search that I have written. A good example is the three editions of the “Enterprise Search Report.” I have been consistent in my criticism of search. Frankly not much has changed since the days of STAIRS III and the Smart System. Over the decades, bells and whistles have been added, but to find what one wants online requires consistent indexing, individuals familiar with sources and their provenance, systems which allow the user to formulate a precise query, and online systems which do not fiddle the results. None of these characteristics is common today unless you delve into chemical structure search and even that is under siege.

The author of the “Why Search Sucks” article focuses on some use cases. These are:

  • Email search
  • Social media search (Yep, the Zuckbook properties and the soon to be a Tesla fail whale)
  • Product search (Hello, Amazon, are you there?
  • Streaming search.

The write up provides the author’s or authors’ musings about Google and those who search. The comments are interesting, but none moves the needle.

Stepping back from the write up, I formulated several observations about the write up and the handling of search and its suckiness.

First, search is not a single thing. Specific information retrieval systems and methods are needed for certain topics and specific types of content. I referenced chemical structures intentionally because the retrieval systems must accept visual input, numerical input, words, and controlled term names. A quite specific search architecture and user training are required to make certain queries return useful results. Give Inconel a whirl if you have access to a structured search system. The idea that there is a “universal search” is marketing and just simple minded. Believe it or not one of today’s Googlers complained vociferously on a conference call with a major investment bank about my characterization of Google and the then almost useless Yahoo search.

Second, the pursuit of “good enough” is endemic among researchers and engineers in academic institutions and search-centric vendors. Good enough means that the limits of user capability, system capacity, budget, and time are balanced. Why not fudge how many relevant results exist for a user looking for a way to convert a link into a dot point on a slide in a super smart and busy executive’s PowerPoint for a luncheon talk tomorrow? Trying to deliver something works and meets measurable standards of precision and recall is laughable to some in the information retrieval “space” today.

Third, the hope that “search startups” will deliver non-sucking search is amusing. Smart people have been trying to develop software which delivers on point results with near real time information for more than 50 years. The cost and engineering to implement this type of system is losing traction in the handful of organizations capable of putting up the money, assembling the technical team, and getting the plumbing working is shrinking. Start ups. Baloney.

Net net: I find it interesting that more articles express dismay and surprise that today’s search and retrieval systems suck. After more than half a century of effort, that’s where we are. Fascinating it is that so many self proclaimed search experts are realizing that their self positioning might be off by a country mile.

Stephen E Arnold, October 10, 2022

Looria: Promising Content Processing Method Applied to a Reddit Corpus

September 14, 2022

I have seen a number of me-too product search systems. I burned out on product search after a demonstration of the Endeca wine selector and the SLI Systems’ product search. I thought Google’s Froogle had promise; the GOOG’s Catalog Search was interesting but — well — the PDF thing. There was a flirting with other systems, including the Amazon product search. (Someone told me that this service is A9. Yeah, that’s super but just buy ads and find something vaguely related to what one wants. The margins on ads are slightly better than Kroger’s selling somewhat bland cookies for $4.99 when Walgreen’s (stocked by Kroger) sells the same cookie for $1.00. Nice, right?

I want to point you to Looria, which provides what appears to be a free and maybe demonstration of its technology. The system ingests some Reddit content. The content is parsed, processed, and presented in an interface which combines some Endeca-like categories, text extraction, some analytics, and stuff like a statement about whether a Reddit comment is positive or negative.

There are about a dozen categories in this system (checked today, September 9, 2022). Categories include Pets, Travel, and other “popular” things about which to comment on Reddit without straying into perilous waters or portals of fascination for teenaged youth.

This is worth checking out. The Looria approach has a number of non Reddit use cases. This service looks quite interesting.

Stephen E Arnold, September 14, 2022

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta