Search System Bayard

November 1, 2019

Looking for an open source search and retrieval tool written in Rust and built on top of Tantivy (Lucene?). Point your browser to Github and grab the files. The read me file highlights these features:

  • Full-text search/indexing
  • Index replication
  • Bringing up a cluster
  • Command line interface.

DarkCyber has not tested it, but a journalist contacted us on October 31, 2019, and was interested in the future of search. I pointed out that there are free and open source options.

What people want to buy, however, is something that does not alienate two thirds of the search system’s users the first day the software is deployed.

Surprised? You may not know what you don’t know, but, gentle reader, you are an exception.

Stephen E Arnold, November 1, 2019

Metasearch Engine Changes Hands

October 28, 2019

In 1998 a Wall Street professionals founded Ixquick. As I recall, the developer was David Bodnick. Like other search developers, selling was better than pumping ads and trying to compete in the world of the digital library card catalog. Ixquick’s buyer was Surfboard Holding BV.

Metasearch engines like DuckDuckGo sends queries to other search engines and present a list of semi-deduplicated results. Dogpile and Vivisimo were other metasearch engines. The Ixquick twist was privacy. I don’t want to go into the notion of privacy in an ad supported search system in this item.

DarkCyber noted a Reddit post that reveals System1 (Privacy One Group) now owns the service. Note the word privacy. As I said, I am not going to explain for the umpteenth time why free Web search or free services of any type may have a different notion of privacy than someone in Harrod’s Creek, Kentucky.

Should I explain the issues related to metasearch systems? Nope. Just like the privacy thing. No one understands and no one cares.

Stephen E Arnold, October 28, 2019

Google NLP Search: Fortune Loves It. Simple Queries Reveal Shortcomings

October 25, 2019

I read “Google Says Its Latest Tech Tweak Provides Better Search Results. Here’s How.” DarkCyber enjoys Fortune Magazine’s how to explanations. They are just. So. Wonderful.

We learned:

Google’s goal is to make it easier for users, who often don’t know how to enter queries for the information they want. Since its search engine debuted in 1997, Google has focused on getting its technology to better understand natural language to produce relevant results even in cases where users enter a misspelled word or a query that is off target. With the latest change, Google will also now consider the sequential order in which words are placed in a search, instead of returning results based on a “mixed bag” of keywords.

Yes, but what about tuning search to advertising? What about ignoring bound phrases? What about Boolean logic? What about words like “terminal” which have different, often difficult to disambiguate meanings?

Fortune jumps over these questions.

Try this query on the “new” Google?

What companies compete with Subsentio?

What about this one?

Amazon law enforcement products

Not what I had in mind. I was thinking about QLDB and digital currency deanonymization.

Sorry, Google. Not yet. Personalization does not work either, by the way. (You know. Examine the search history, etc. etc.)

Fortune, check out where Google’s ad revenue comes from. Just a small clue to put Google search in its context.

Stephen E Arnold, October 25, 2019

Dumais on Search: Bell Labs Roots Are Thriving

October 23, 2019

We just love a genuine Search guru, and Dr. Susan Dumais is one of the best. The illustrious Dr. Dumais is now a Microsoft Technical Fellow and Deputy Lab Director of MDR AI. If you wanted to know the history of information retrieval, she would be the one to hear tell about it—and now you can, courtesy of the Microsoft Research Podcast. Both the 38-minute podcast itself and a transcript are posted at, “HCI, IR and the Search for Better Search with Dr. Susan Dumais.” The good doctor describes what motivates her in her work:

“I think there are two commonalities and themes in my work. One is topical. So, as you said, I’m really interested in understanding problems from a very user-centric point-of-view. I care a lot about people, their motivations, the problems they have. I also care about solving those problems with new algorithms, new techniques and so on. So, a lot of my work involves this intersection of people and technology, thinking about how work practices co-evolve with new technological developments. And so thematically, that’s an area that I really like. I like this ability to go back and forth between understanding people, how they think, how they reason, how they learn, how they find information, and finding solutions that work for them. In the end, if something doesn’t work for people, it doesn’t work. In addition to topically, I approach problems in a way that is motivated, oftentimes, by things that I find frustrating. We may talk a little bit later about my work in latent semantic indexing, but that grew out of a frustration with trying to learn the Unix operating system. Work I’ve done on email spam, grew out of a frustration in mitigating the vast amount of junk that I was getting. So, I tend to be motivated by problems that I have now, or that I anticipate that our customers, and people will have in general, given the emerging technology trends.”

She and host Gretchen Huizinga go on to discuss the evolution of search technology over the last twenty years, beginning with the first HTML page crawlers that indexed but a couple thousand queries per day. They also cover Dumais’ work over the years to build bridges, provide context in search, and bring changing content into the equation. We hope you will check out the intriguing and informative interview for yourself, dear reader.

Cynthia Murrell, October 23, 2019

Algolia: Cash Funding Hits $184 Million

October 15, 2019

Exalead was sucked into Dassault Systèmes. Then former Exaleaders abandoned ship. Algolia benefited from some Exalead experience. But unlike Exalead, Algolia embraced venture funding with cash provided by Accel, Point Nine Capital, Storm Ventures, and Y Combinator, among others.

DarkCyber noted “Algolia Finds $110M from Accel and Salesforce for Its Search-As-a-Service, Used by Slack, Twitch and 8K Others.” The write up reports that the company has “closed a Series C of $110 million, money that it plans to invest in R&D around its search technology, including doubling down on voice, and further global expansion in Europe, North America and Asia Pacific.”

The write up adds:

Having Salesforce as a strategic backer in this round is notable: the CRM giant currently does not have a native search product in its wide range of cloud-based services for enterprises, instead opting for endorsed integrations with third parties, such as Algolia competitor Coveo. The plan will be to further integrate with Salesforce although no products to speak of as of yet.

The challenge will be to go where few search and retrieval systems have gone before.

Some people have forgotten the disappointments and questionable financial tricks promising search vendors delivered to stakeholders and customers.

With venture firms looking for winners, returns of 20 percent will not deliver what the sources of the funds expect. The good old days of a 17X return may have cooled, but generating an 8X or 12X return may be a challenge.

Why?

In the course of our researching and writing the enterprise search report in 2003 to 2006 and out and our subsequent work, several “themes” or “learnings” surfaced:

  1. Good enough search is now the order of the day; that is, an organization-wide search system does not meet the needs of many operating units. Examples range from the legal department to research and development to engineering and the drawings plus data embedded in product manufacturing systems to information under security umbrellas with real time data and video content objects. Therefore, the “one solution” approach dissipates like morning fog.
  2. Utility search from outfits like Amazon are “good enough.” This means that a developer using Amazon blockchain services and workflow tools may use the search functions available from Amazon. Maybe Amazon will buy Algolia, but for the foreseeable future, search is a tag-along function, not a driver of the big money apps which Amazon is aiming toward.
  3. Search, regardless of vendor, must spend significant sums to enrich the functions of the system. Natural language processing, predictive analytics, entity extraction, and other desired functions are moving targets. Adding and tuning these capabilities becomes expensive. And it the experiences of Autonomy and Fast Search & Transfer are representative, the costs become difficult to control.

DarkCyber hopes that Algolia can adapt to these research factoids. If not, search and retrieval may be rushing toward a disconnect between revenues, sustainable profits, and investor expectations.

The wheel of fortune is spinning. Where will it stop? On a winner or a loser? This is a difficult question to answer, and one which Attivio, BA-Insight, Coveo, Elastic, IBM Watson, Lucidworks, Microsoft, Sinequa, Voyager Search, and others have been trying to answer with millions of dollars, thousands of engineering hours, and massive investments in marketing. I am not including the search vendors positioned as policeware and intelware; for example, BAE NetReveal, Diffeo, LookingGlass, Palantir Technologies, and Shadowdragon, among others.

Worth monitoring the trajectory of Algolia.

Stephen E Arnold, October 15, 2019

Amazon: Elasticsearch Bounced and Squished

October 14, 2019

DarkCyber noted “AWS Elasticsearch: A Fundamentally-Flawed Offering.” The write up criticizes Amazon’s implementation of Elasticsearch. Amazon hired some folks from Lucidworks a few years ago. But under the covers, Lucene thrums along within Amazon and a large number of other search-and-retrieval companies, including those which present themselves as policeware. There are many reasons: [a] good enough, [b] no one company fixes the bugs, [c] good enough, [d] comparatively cheap, [e] good enough. Oh, one other point: Not under the control of one company like those good, old fashioned solutions like STAIRS III, Fulcrum (remember that?), or Delphes (the francophone folks).

This particular write up is unlikely to earn a gold star from Amazon’s internal team. The Spun.io essay states:

I’m currently working on a large logging project that was initially implemented using AWS Elasticsearch. Having worked with large-scale mainline Elasticsearch clusters for several years, I’m absolutely stunned at how poor Amazon’s implementation is and I can’t fathom why they’re unable to fix or at least improve it.

I think the tip off is the phrase “how poor Amazon’s implementation is…”

The section Amazon Elasticsearch Operation provides some color to make vivid the author’s viewpoint; for example:

On Amazon, if a single node in your Elasticsearch cluster runs out of space, the entire cluster stops ingesting data, full stop. Amazon’s solution to this is to have users go through a nightmare process of periodically changing the shard counts in their index templates and then reindexing their existing data into new indices, deleting the previous indices, and then reindexing the data again to the previous index name if necessary. This should be wholly unnecessary, is computationally expensive, and requires that a raw copy of the ingested data be stored along with the parsed record because the raw copy will need to be parsed again to be reindexed. Of course, this also doubles the storage required for “normal” operation on AWS. [Emphasis in the original essay.]

The wrap up for the essay is clear from this passage:

I cannot fathom how Amazon decided to ship something so broken, and how they haven’t been able to improve the situation after over two years.

DarkCyber’s team formulated several observations. Let’s look at these in the form of questions and trust that some young sprites will answer them:

  1. Will Amazon make its version of Elasticsearch proprietary?
  2. Are these changes designed to “pull” developers deeper into the AWS platform, making departure more difficult or impossible for some implementations?
  3. Are the components the author of the essay finds objectionable designed to generate more revenue for Amazon?

Stephen E Arnold, October 14, 2019

Real Life Q and A for Information Access Allegedly Arrives

October 14, 2019

DarkCyber noted “Promethium Tool Taps Natural Language Processing for Analytics.” The write up, which may be marketing oriented, asserts:

software, called Data Navigation System, was designed to enable non-technical users to make complex SQL requests using plain human language and ease the delivery of data.

The company developing the system is Promethium, founded in 2018, may have delivered what users have long wanted: Ask the computer a question and get a usable, actionable answer. If the write up is accurate, Promethium has achieved with $2.5 million in funding a function that many firms have pursued.

The article reports:

After users ask a question, Promethium locates the data, demonstrates how it should be assembled, automatically generates the SQL statement to get the correct data and executes the query. The queries run across all databases, data lakes and warehouses to draw actionable knowledge from multiple data sources. Simultaneously, Promethium ensures that data is complete while identifying duplications and providing lineage to confirm insights. Data Navigation System is offered as SaaS in the public cloud, in the customer’s virtual private cloud or as an on-premises option.

More information is available at the firm’s Web site.

Stephen E Arnold, October 14, 2019

A List of Enterprise Search Vendors

October 7, 2019

DarkCyber does not follow the enterprise search sector. In fact, two of the flagships from the 2000s found themselves caught in embarrassing financial missteps. Why? It certainly suggests that making big bucks from a search and retrieval service is difficult.

We came across a Web site called Trust Radius. This site has a section devoted to enterprise search. What we found interesting is that the site lists what seem to be the key players in the sector today. With most LE and intel policeware platforms relying on open source search like Lucene, DarkCyber was quite surprised with the line up of systems and the information provided by Trust Radius.

Here’s the list of vendors in alphabetical order, a method of presenting information which is not in favor with some whiz kids:

3RDi Search

Aderant Handshake (knowledge management for law firms)

Agree Ya Site Administrator

Algolia

Amazon Cloud Search (Lucene)

Apache Lucene

Apache Solr

Expert Systems Cogito Discover

Constructor.io Search

Coveo

Customer Matrix (customer support)

Dassault Systems Exalead (Exalead)

Dieselpoint

Elasticsearch (Elastic)

Fabasoft Mindbreeze

Fabasoft Mindbreeze Inspire

Google Search Appliance (discontinued)

IBM Watson (once Omnifind)

IBM Watson Discovery for Salesforce

IBM Watson Explorer

IManage Insight (Interwoven, Autonomy, HP, now a standalone)

Inbenta Enterprise Search

Lookeen Desktop Search (listed as Enterprise Search however)

Lucidworks Fusion ($100 million in funding)

Maana

Microfocus IDOL (Autonomy to HP to HPE to Microfocus)

Microsoft Azure (Fast Search & Transfer)

Microsoft Bing Search

Perceptive Search (ISYS Search Software to Lexmark to Highland)

Rocket NXT Enterprise Search (Aerotext)

Rockset

Searchify

Search Spring (product search)

Search Tap

Search Unify

Sinequa

SLI Systems (e commerce)

Swiftype

Synacor Video Search & Discovery

TeraText Searchable Archive for Files and Email (SAIC)

Zakta

What DarkCyber finds interesting is the omission of outfits like Oracle Endeca, Antidot, and Blossom. Also, of this listing of 41 “search systems” there are multiple enterprise search products from single companies like IBM and Microsoft. There are also e-commerce search systems and systems which do not handle enterprise content because the service supports desktops. There are two “no longer around” products and a weird blend of search utilities with text processing features. In short, this list is illustrative of the chaos, confusion, and craziness that makes some information technology professionals to buy a solution that just delivers key word and some option features.

DarkCyber believes that Amazon’s approach is likely to gain traction. That’s bad news for most of the companies on this list, particularly search vendors who manage to confuse individuals or the smart software used to create this list at Trust Radius.

It seems that the message from this list is that search is a bit of a dog’s breakfast—just as it has been for decades.

Stephen E Arnold, October 7, 2019

 

 

 

Today in Subjective Search: What Are You Not Allowed to Know

October 2, 2019

When you review information, is that information comprehensive, complete, and objectively displayed?

No.

No.

No.

Let’s look at three examples.

First, Boris Johnson allegedly uses certain words to skew search results. This is the allegation of Remoaning Myrtle. You can find the assertion at this link. Does this mean that wordsmithing now fiddles search results on Bing, Google, and Yandex? Interesting question about an interesting person’s ability to use language as a weapon.

Second, Twitter has introduced new filters. “Twitter Rolls Out Filter for Potentially Offensive DMs” reports:

Twitter is quickly acting on plans to filter potentially offensive direct messages. It’s rolling out the filter to all users on Android, iOS and the web. As during the test, there isn’t much mystery to how this works. If a message contains questionable language or is likely spam, it’ll be tucked away in an “additional messages” folder.

Third, “YouTube Moderation Bots Punish Videos Tagged as ‘Gay’ or ‘Lesbian,’ Study Finds” bluntly asserts:

A new investigation from a coalition of YouTube creators and researchers is accusing YouTube of relying on a system of “bigoted bots” to determine whether certain content should be demonetized, specifically LGBTQ videos.

DarkCyber finds it interesting that shaping or alleged shaping of search results is now garnering attention. Researchers looking for historical information may discover that “old” information is either unindexed or not online. Investigators and analysts looking for facts like Cisco’s acquisition of certain firms requires manual review of SEC documents. Individuals looking for information about CMS contractors conducting medical fraud information may find that these data are very, very difficult to locate.

Why?

Reasons vary.

It is important for those who assert that “my team consists of expert online researchers” may be fooling themselves.

Stephen E Arnold, October 2, 2019

Dumbing Down Search and Making More Money?

September 27, 2019

Google makes changes that benefit Google. Forbes Magazine, the capitalist tool, however, does not understand this simple fact about the world’s largest online advertising outfit.

“Google Makes It More Difficult To Find Old Images” points out that the ad giant made it more difficult for 99 percent of Google Image search users to locate “old” images. Most of Google’s advanced search features don’t get much click love.

As a result, why make the feature available? The benefit of making Google Image Search dumbed is related to several factors tangential; my thought is:

  • Legal hassles related to making images findable
  • Cost reduction. If content is not searched, why spend money verifying links and storing pointers
  • Ads. Clicking a Web page for an image can display a current ad. Clicking an old picture like the one below is unlikely to provide an ad payout for the GOOG.

image

There are some options:

  • Use Google search operators like those on this list
  • Include a date in the image search string; for example, IBM mainframe 1964
  • Use the Google advanced image search form which is at this link.

What’s Forbes’ take?

I reached out to Google for comment on this story. I have yet to hear back and will update this article if I do.

Yep, the capitalist tool.

Stephen E Arnold, September 27, 2019

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta