Elastic: Making Improvements

August 27, 2020

Elasticsearch is one of the most popular open-source enterprise search platforms. While Elasticsearch is free for developers to download, Elastic offers subscriptions for customer support and enhanced software. Now the company offers some new capabilities and features, HostReview reveals in, “Elastic Announces a Single, Unified Agent and New Integrations to Bring Speed, Scale, and Simplicity to Users Everywhere.” The press release tells us:

“With this launch, portions of Elastic Workplace Search, part of the Elastic Enterprise Search solution, have been made available as part of the free Basic distribution tier, enabling organizations to build an intuitive internal search experience without impacting their bottom line. Customers can access additional enterprise features, such as single sign-on capabilities and enhanced support, through a paid subscription tier, or can deploy as a managed service on Elastic Cloud. This launch also marks the first major beta milestone for Elastic in delivering comprehensive endpoint security fully integrated into the Elastic Stack, under a unified agent. This includes malware prevention that is provided under the free distribution tier. Elastic users gain third-party validated malware prevention on-premises or in the cloud, on Windows and macOS systems, centrally managed and enabled with one click.”

The upgrades are available across the company’s enterprise search, observability, and security solutions as well as Elastic Stack and Elastic Cloud. (We noted Elastic’s welcome new emphasis on security last year.) See the write-up for the specific updates and features in each area. Elasticsearch underpins operations in thousands of organizations around the world, including the likes of Microsoft, the Mayo Clinic, NASA, and Wikipedia. Founded in 2012, Elastic is based in Silicon Valley. They also happen to be hiring for many locations as of this writing, with quite a few remote (“distributed”) positions available.

Cynthia Murrell, August 27, 2020

Zero Search Results = Useful Information

August 26, 2020

I saw a notice for a conference called “Activate.” Zippy title. What caught my attention was the title of a talk; specifically, “Implementing a Deep Learning Search Engine.” The technology appears to be the open source Solr search system. As you know, dig into Solr and what do you find? Lucene. The hay day of enterprise search has gone. Perhaps another harvest will come? But after the implosion of the promises made by Fulcrum, Verity, Autonomy, Fast, Convera, and Entopia, I am not sure search has credibility.

Don’t get me wrong. Search is a major part of companies; for example, Salesforce bought Diffeo, which was an interesting search system. Elastic is, of course, the commercial firm selling support for the open source Elasticsearch system. There are unusual systems as well; for example, the quirky Qwant, which has some Pertimm inside.

But consider this description of the talk for the Activate conference delivered by two wizards (well, maybe apprentice wizards) from the Lucidworks outfit:

Recent advances in Deep Learning brings us the possibility to get improvements in almost any domain. Search Engines aren’t an exception. Semantic search, visual search, “zero results” queries, recommendations, chatbots etc. – this is just a shortlist of topics that can benefit from Deep Learning based algorithms. But more powerful methods are also more expensive, so they require addressing the variety of scalability challenges. In this talk, we will go through details of how we implement Deep Learning Search Engine at Lucidworks: what kind of techniques we use to train robust and efficient models as well as how we tackle scalability difficulties to get the best query time performance. We will also demo several use-cases of how we leverage semantic search capabilities to tackle such challenges as visual search and “zero results” queries in eCommerce.

Three points:

  1. Deep learning is one of those buzzwords that recyclers of open source technology slap on a utility function like search. What search vendor does not include smart software, semantics, and more Gartner-infused techno babble? Not many.
  2. Short cuts for training smart software for machine learning is indeed important. However, the approach which strikes me as interesting is the one taken by the ever-pragmatic AWS system pushed along by the Bezos bulldozer. AWS wants to make training a matter of buying commodity solutions of data off the shelf. Presumably the approach works like one of those consumer soap tablets I have seen in our local grocery store. Buy, rip, and wash. Bingo! Clean ML. Grubbing in data is time consuming, expensive, and oh-so-easy to get wrong.
  3. The goal of “zero results” in eCommerce or any other domain is not exactly a challenge. Zero results deliver data. I know that an objective system displays only the objects matching my query. Not any longer. Synonym expansion, predictive analytics, clustering, and other numerical processes are going to show me something. Too bad that the “something” is usually not what I want.
  4. For special cases like ecommerce, instead of a list of crazy options, why not ask the user, “Do you want to see what products other people purchased when searching for X?” Choice is sometimes helpful.

Is this important? To me, yes. To most others, no.

The problem with making information easy is everywhere today. From individuals who disbelieve verifiable information like the earth is spheroid to the wisdom of demanding no law enforcement. Yeah, that will work.

Some quick facts to put this Lucidworks’ assertion in perspective. The company has ingested more than $209 million since 2007. I did some advice giving to the first president of Lucidworks, then called Lucid Imagination. I did some advice giving for another semi-lucid president. None of that advice resonated because recycling jargon does not generate sustainable revenues.

The point is that jazzy words and crazy ideas like “zero results” are bad are part of the problem search vendors face. Today’s search systems have drifted from displaying results which match a user’s query to dumping baloney on the display.

It is easier to yip yap with buzzwords that deal with some of the painful realities of information retrieval. Deep learning? Yeah, that will help the person locate that PowerPoint… not.

Stephen E Arnold, August 26, 2020

A Librarian Looks at Google Dorking

August 24, 2020

In order to find solutions for their jobs, many people simply conduct a Google search. Google searching for solutions is practiced by teachers to executives to even software developers. Software developers spend an inordinate amount of their time searching for code libraries and language tutorials. One developer named Alec had the brilliant idea to create “dorking.” What is dorking?

“Use advanced Google Search to find any webpage, emails, info, or secrets

cost: $0

time: 2 minutes

Software engineers have long joked about how much of their job is simply Googling things

Now you can do the same, but for free”

Dorking is free! That is great! How does it work? Dorking is a tip guide using Boolean operators and other Google advanced search options to locate information. Dorking, however, does need a bit of coding knowledge to understand how it works.

Most some of these tips can be plugged into a Google search box, such as finding similar sites and find specific pages that must include a phrase in the Title text. Others need that coding knowledge to make them work. For example finding every email on a Web page requires this:

image

Yep, dorking for everyone.

After a few practice trials, these dorking tips are sure to work for even the most novice of Googlers. It will also make anyone, not just software developers, appear like experts. As a librarian, why not assign field types and codes, return Boolean logic, and respect existing Google operators. Putting a word in quotes and then getting a result without the word is — how should I frame it. I know — dorky.

Whitney Grace, MLS, August 24, 2020

Surprising Google Data

August 20, 2020

DarkCyber is not sure if these data are accurate. We have had some interesting interactions with NordVPN, and we are skeptical about this outfit. Nevertheless, let’s look beyond a dicey transaction with the NordVPN outfit and focus on the data in “When Looking for a VPN, Chinese Citizens Search for Google.”

The article asserts:

New research by NordVPN reveals that when looking for VPN services on Baidu, the local equivalent of Google, the Chinese are mostly trying to get access to Google – in fact, 40,35% of all VPN service-related searches have to do with Google. YouTube comes second on the list, accounting for 31,58% of all searches. Other research by NordVPN has shown that YouTube holds the most desired restricted content, with 82,7% of Internet users worldwide searching for how to unblock this video sharing platform.

If valid, these data suggest that Google’s market magnetism is powerful. Perhaps a type of quantum search entanglement?

Stephen E Arnold, August 20, 2020

SlideShare: Some Work to Do

August 12, 2020

DarkCyber noted “Scribd Acquires Presentation Sharing Service SlideShare from LinkedIn.” In 2004, one could locate presentations on Google by searching for the extension ppt and its variants. In 2006, SlideShare became available. Then something happened. PowerPoints became more difficult to locate. When an online search pointed to a PowerPoint deck, the content was:

  1. Marketing fluff
  2. Incorrectly rendered with weird typography and wonky graphics
  3. Corrupted files.

What about today? DarkCyber’s most recent foray into the slide deck content wilderness produced zero; for example, SlideShare search produced identical pages of search results. The query retrieved slide decks on unrelated topics. Even worse, a query would result in SlideShare’s sending email upon email pointing to other slide decks. The one characteristic of these related slide deck was/is that they were unrelated to the information we sought.

There are online presentation services. There are open source presentation tools like SoftMaker’s. There is the venerable Keynote which never quite converts a PowerPoint file correctly.

Is there a future in a searchable collection of slide decks? In theory, yes. In reality, the cost of finding, indexing, and making searchable presentations faces some big hurdles; for example:

  1. Many organizations — for example, DARPA — standardize on PDF file formats. These are okay, but indexing these can be an interesting challenge
  2. Some presenters put their talks in the cloud, hoping that an Internet connection will allow their slides to display
  3. The Zoom world puts PowerPoints and other presentation materials on the user’s computer, never to make it into a more publicly accessible repository.

Like the dream of collecting conferences, presentations, and poster sessions, some content remains beyond the reach of researchers and analysts. The desire to get anyone looking for a slide deck to subscribe to a service gives operators of this service a chance to engage in spreadsheet fever. Here’s how this works? If there are X researchers, and we get Y percent of them. We can charge Z per year? By substituting guesstimates for the variables, the service becomes a winner.

The reality is that finding information in slide decks is more difficult today than it was in 2004. Access to information is becoming more difficult. DarkCyber would like to experience a SlideShare with useful content, more effective search and retrieval, and far less one page duplicates of ads for books.

Someday. Maybe?

Stephen E Arnold, August 12, 2020

NetDocuments Employs BA Insight Tech for Enterprise Search

August 10, 2020

For a secure, cloud-based data solution, many law firms, legal departments, and compliance teams turn to NetDocuments. Now the platform has adopted technology from a familiar name to simplify its clients’ access to information. A post at PRWeb reveals, “NetDocuments Introduces NetKnowledge Enterprise Search Powered by BA Insight.” We find it interesting that the 16-year-old BA Insight is licensing its askable-knowledge system to create the new tool, NetKnowledge. The press release describes the system’s advantages:

“Eliminate Downloading and Indexing Data for Search: No longer does content within NetDocuments need to be downloaded and indexed to be part of an organization’s enterprise search. Simply search within the NetDocuments platform, and NetKnowledge will find relevant data–along with information from other sources —and present it to users.

“Enforce Access Controls on Sensitive Information: Sensitive information may need to be restricted to certain individuals, but that data also needs to be available to others via enterprise search. NetKnowledge respects data restriction policies at the source and will only present data to individuals with proper access rights.

“Manage Large and Disparate Data Sets Across the Organization: NetKnowledge helps organizations bring all its data together to form a single source of truth, so users do not have to perform multiple searches in different places to get the information they need.”

Founded in 2004, BA Insight is based in Boston, Massachusetts. The company is dedicated to making information easier to find for organizations of all stripes. NetDocuments is headquartered in Lehi, Utah. The company was founded in 1999 and acquired by Clearlake Capital Group in 2017.

Cynthia Murrell, August 10, 2020

Search Engines: Plumbing Becomes a Thing Again

August 10, 2020

Two search related items.

The first is Hndex. If you want to locate articles posted to HackerNews, a tech-oriented headline aggregation site, you have an option. This is an example of what might be labeled a “site specific search” solution: One site, search it. Navigate to https://hndex.org and plug in a search term. We entered a query for “enterprise search” and retrieved on point results. The comments are available; however, these are not indexed. Click the “cached” button, and you can view the original article. Click the “comments” button and you can view the comments. HackerNews provides its own search service, which is weirdly located at the bottom of the page. DarkCyber will reserve further comments until we have experimented with the system for a few days.

The second is Infinity Search, another metasearch engine positioned as a free Web search system. DarkCyber finds metasearch engines interesting, but these often pretend to be running their own crawlers. To Infinity Search’s credit the company states:

When you search for something on our site, we take the results from other search engines and our own indexes, organize it, and display it directly to you without logging any information about you.

Metasearch systems have to deduplicate results lists and find a way to remain in the good graces of companies running primary Web crawlers. Disclaimer: My son worked for Vivisimo (now the heart and soul of one of IBM’s marketing confections. He has moved to other adventures, but I remember our talks about the issues metasearch presents. For example, latency, screwed up query interpolation, and wonky deduplication which deduplicates useful results out of the results list. I think Vivisimo lives on in Yippy.com, but I am not a fan of metasearch systems which recycle others’ indexes and remain vulnerable to partners who pull out of deals, thus putting a dent in results.

Stephen E Arnold, August 10, 2020

Why Enterprise Search Remains a Problem

August 8, 2020

I read “Let’s Build a Full-Text Search Engine.” The write up does a reasonable job of walking through the basics of building a search engine. The focus is full text search, but I think in terms of an organization and its content. As a result, the system summarized will not handle video, images, and other types of content. The code examples are clear, and I liked the straightforward approach.

However, there is a potential bump in the information superhighway. Here’s a Venn diagram from the article. Notice the work you have to do to find documents with small, wild cat?

image

If I search for “smith”, “order”, “tile” — I want only the documents in which the Boolean AND is applied by default. I want Smith’s orders for tile. I have to call the person. I don’t want to go on scavenger hunt. (There are other minor nits too, but the AND’ing thing is huge to me.)

Stephen E Arnold, August 6, 2020

Do Not Gamble. Own the Casino. The Google Way?

August 3, 2020

I read “Google’s Top Search Result?” What a surprise? No, not the fact that Google present Google-centric results at the top of mobile search results. The surprise is that until July 28, 2020, no one knew that Google’s magical algorithmic, math-is-objective, super duper relevance scooper got more Google goodies than any other “content producer.” Amazing.

In the good old days of big desktop anchor computers and monitors, there was screen real estate. Google filled the screen with objective results and, of course, some advertisements.

That was then; this is now. Mobile screens are mostly squint-generators. In order to be seen and generate clicks, the Google has to work overtime.

The challenges include:

  • Traffic, eyeballs, and individuals who will go ga-ga over that which is Googley.
  • Sizzle that will burn the greedy fingertips of competitors who want to be placed front and center.
  • Useful information for consumers. Yep, what Google displays eliminates the need to think. Advertisers who want to be listed on a Google Map. Something can be worked out.

A number of organizations have groused about Google’s magical algorithmic, math-is-objective, super duper relevance scooper.

What’s fascinating is that it has taken two decades for some people to understand the wisdom embedded in the observation, “Own the casino.”

Pretty good advice and someone at the GOOG took it.

Stephen E Arnold, August 3, 2020

Search and Predicting Behavior

August 3, 2020

DarkCyber is interested in predictive analytics. Bayesian and other “statistical methods” are a go-to technique, and they find their way into many of the smart software systems. Developers rarely explain that systems share many features and functions. Marketers, usually kept in the dark like mushrooms, are free to formulate an interesting assertion or two.

I read “Google Searches During Pandemic Hint at Future Increase in Suicide,” and I was not sure about the methodology. Nevertheless, the write up provides some insight into what can be wiggled from Google search data.

Specifically Columbia University experts have concluded that financial distress is “strongly linked to suicide.”

Okay.

I learned:

The researchers used an algorithm to analyze Google trends data from March 3, 2019, to April 18, 2020, and identify proportional changes over time in searches for 18 terms related to suicide and known suicide risk factors.

What algorithm?

The method is described this way:

The proportion of queries related to depression was slightly higher than the pre-pandemic period, and moderately higher for panic attack.

Perhaps the researchers looked at the number of searches and noted the increase? So comparing raw numbers? Tenure tracks and grants await! Because that leap between search and future behavior…

Stephen E Arnold, August 3, 2020

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta