Search Vendor Comparison
January 30, 2020
The Finland-based AddSearch published a comparison of its search and retrieval service with two competitors: Algolia and Swiftype. Each of these is a for-fee solution. The write up appeared in 2019, but DarkCyber wanted to call attention to the article because it does a good job of outlining some of the main characteristics of commercial search solutions. You can locate the article by Anna Pogrebniak in the AddSearch Blog.
Kenny Toth, January 30, 2020
Verizon Tries Its Hand at… Search
January 29, 2020
Verizon is hopping on the privacy bandwagon but how do we know we can trust it? ArsTechnica reports, “Verizon Offers No-Tracking Search Engine, Promises to Protect Your Privacy.” The company’s new platform is called OneSearch. It gets its results from Bing, but imposes privacy features on them. While still running contextual advertising, of course, but ones that rely on keywords, not tracking. This search service is in addition to, not replacing, the existing Yahoo search engine. (Verizon bought Yahoo in 2017.) Yahoo search also gets its results from Bing, but makes no promises about privacy.
Here are the promises we get from OneSearch: no cookie tracking, retargeting, or personal profiling; no sharing of personal data with advertisers; and no storing search histories. When users click on the “Advanced Privacy Mode” button, the platform encrypts their search terms and URL and sets the results link to expire in an hour. However, writer Jon Brodkin reports:
“The Verizon search engine homepage says, ‘OneSearch doesn’t use cookies. Period.’ Chrome detected that OneSearch did set one cookie on my computer, so that statement seems to be exaggerated. The EFF’s Privacy Badger detected a potential tracker that’s tied to the u.yimg.com domain, indicating a connection between OneSearch and Yahoo’s image service. What Verizon apparently means is that it doesn’t use cookies to build ad-targeting profiles. Verizon uses your IP address to determine your ‘general location,’ helping it deliver location-specific search results. Verizon said that ‘We only ever infer location data up to the city level of specificity for search localization purposes.’”
The write-up also lists in more detail the steps OneSearch performs for each query. We are also reminded of less-than-stellar performance of Verizon’s media division thus far. See the article for those details.
So, can we trust Verizon to deliver on its privacy vows? Brodkin notes several events that may give us pause: In 2016, the company paid a $1.35 million fine and agreed to change their ways over “supercookies;” it, along with T-Mobile, Sprint, and AT&T, was caught selling mobile customers’ location data to third-party brokers; and Verizon regularly opposes government regulations that would require carriers to protect customer privacy. The write-up suggests DuckDuckGo and Startpage as alternatives for anyone hesitant to take Verizon at its word.
Cynthia Murrell, January 29, 2020
Calling Out Search: Too Little, Too Late
January 20, 2020
The write up’s title is going to be censored in DarkCyber. We are not shrinking violets, but we think that stop word lists do exist. Problem? Buzz your favorite ad supported search vendor and voice your complaints.
The write is “How Is Search So #%&! Bad? A ‘Case Study’.” The author appears to be frustrated with the outputs of ad supported and probably other types of seemingly “free” search systems providing links to Web content. This is what some people call “open source intelligence online”. There are other information resources available, but most of the consumer oriented, eyeball hungry vendors ignore i2p, forums with minimal traffic, what some experts call the Dark Web, and even some government information services. How many people pay any attention to the US National Archives? Be honest in your assessment.
Here’s a passage we noted:
Google Search is ridiculously, utterly bad.
This seems clear.
The write up provides some examples, but I anticipate that some other people have found that the connection between a user’s query and the Google search outputs is tenuous at best. One criticism DarkCyber has of the write up is that it mentions Google, shifts to Reddit, and then to metadata. The key point for us was the focus on time.
Now time is an interesting issue in indexing. Years ago I did a research project on the “meaning” of “real time” in online services. I think my research team identified five or six different types of time. I will skip the nuances we identified and focus only on the data or freshness of an item in a results list.
Let’s by sympathetic to the indexing company. Here’s why:
First, many documents do not provide an explicit date in the text of the article. In Beyond Search and DarkCyber, you will notice that we provide the author’s name and a day and data at which the article was posted. Many write ups on the open Web don’t bother. In fact, there will be no easy way to date the time the author posted the story within the content displayed in a browser. Don’t you love news releases which do not include a date, time, and time zone?
Second, many write ups include dates and times in the text of an article. For example, the reference to Day 2 of the recent CES trade show may include the explicit date January 8, 2020, for a product announcement. The approach is similar to using CES without spelling out “Consumer Electronics Show.” Buy, hey, these folks are busy, and everyone in the know understands the what and when, right?
Third, auto-assigned dates by operating systems may be “correct” when a file or content object is created. But what happens when a file or drive is restored? The original dates and metadata may be replaced with the time stamp of the restore. What about date last accessed or date last changed? Too much detail. Yada yada.
Fourth, time sorting is possible. Google invested in Recorded Future (now part of Insight). I had heard that someone at the GOOG thought Recorded Future’s time functions were nifty. Guess not. Google did not implement more sophisticated time functions in any service other than those related to advertising. For the great unwashed masses of those who don’t work at Google, tough luck I supposed.
Fifth, when was the content first indexed. More significantly, when was the content last updated. Important? May be, gentle reader. May be.
There are several other conditions as well. For the purposes of a blog post, I want to make clear: The person who is annoyed with search should have been annoyed decades ago. These time problems are not new, and they are persistent.
The author with a penchant for tardy profanity stated:
Part of the issue in this specific case is that they’ve started ignoring settings for displaying results from specific time periods. It’s definitely not the whole issue though, and not something new or specific to phone searches. Now, I’ve always been biased towards the new – books, tech, everything, but I can’t help but feel that a lot of things which were done pretty well before are done worse today. We do have better technology, yet we somehow build inferior solutions with it all too often. Further, if they had the same bias of showing me only recent results I’ll understand it better, but that’s not even the case. And yes, I get that the incentives of users and providers don’t align perfectly, that Google isn’t your friend, etc. But what is DDG’s excuse? As for the Case Study part, and me saying this isn’t simply a rant – I lied, hence the quotation marks in the title. Don’t trust everything you read, especially the goddamn dates on your search results.
The write up omits a few other minor problems with modern search and retrieval systems. Yep, this includes Reddit, LinkedIn, and a bunch of others. Let me provide a few dot points:
- Poorly implemented Boolean search
- Zero information about what’s in an index
- Zero information about what’s excluded from and index and why
- Minimal auto linking to information about an “author” or the “source” of the content
- No data to make a precision or recall calculation possible and reproducible
- No data to make it possible to determine overlap among Web indexes. Analyses must be brute forced. Due to the volatility, latency, and editorial vagaries of ad supported Web search systems, data are mostly suggestive.
Why? Why are none of these dot points operative?
Answer: Too expensive, too hard, not appropriate for our customers, and “What are you talking about? We never heard of half these issues you identified.”
Net net: Years ago I wrote an article for Searcher Magazine, edited at the time by Barbara Quint, a bit of an expert in online information retrieval. She worked at RAND for a number of years as an information expert. She said, “Do you really want me to use the title ‘Search Sucks’ on your article.” I told her, use whatever title you want. But if you agree with me, go with “sucks.” She used “sucks”. Let’s see that was a couple of decades ago.
Did anyone care? Nope. Does anyone care today? Nope. There you go.
Stephen E Arnold, January 20, 2020
DuckDuckGo Lands for European Search Users
January 14, 2020
I read “DuckDuckGo Beats Microsoft Bing In Google’s New Android Search Engine Ballot.” There have been numerous reports about this decision.
Digital Information World is a representative write up in today’s world of Google EU analysis. DarkCyber noted:
The introduction of this “choice screen” seems to be a clear response to the antitrust ruling from the European Union during last March and how Google was fined $5 billion by EU regulators. According to them, Google was playing illegally in tying up the search engine to its browser for mobile OS.
Okay. But how does a search engine get listed? We learned:
you can expect Google to not show search engines which are popular but the ones whose providers are willing to pay well.
The write up includes a run down of what search options will be displayed in each EU country. The ones we spotted are:
- DuckDuckGo
- GMX
- Info.com
- Privacy Wall
- Qwant
- Yandex.
Bing is a no show as are Giburu, iSeek, Mojeek, Yippy, and others. It is worth noting that some of these outfits are metasearch engines. This means that the systems send queries to Bing, Google, and other services and aggregate the results. Dogpile and Vivisimo were metasearch engines. DuckDuckGo and Ixquick (StartPage) are metasearch engines`. The reason metasearch is available boils down to cost. It is very expensive to index the public Web.
The DarkCyber team formulated a few hypotheses about the auction, the limitations on default search engines, and the dominance of Google search in Europe; for example, Google accounts for more than 95 percent of the search traffic in Denmark. The same situation exists in Germany and other EU countries.
Will these choices make any difference? Sure, for small outfits like DuckDuckGo any increase in traffic is good news. But will the choices alter Google’s lock on search queries from Europe?
Not a chance.
Does anyone in the EU government know? Probably not. Do these people care? Not to much.
Remember one of my Laws of Information: Online generates natural monopolies. Here’s another Law: User behavior is almost impossible to change once mental memory locks in.
So Google gets paid and keeps on trucking.
Stephen E Arnold, January 14, 2020
Enterprise Search and the AI Autumn
January 13, 2020
DarkCyber noted this BBC write up: “Researchers: Are We on the Cusp of an AI Winter?” Our interpretation of the Beeb story can be summarized this way:
“Yikes. Maybe this stuff doesn’t work very well?”
The Beeb explains in Queen’s English based on quotes of experts:
Gary Marcus, an AI researcher at New York University, said: “By the end of the decade there was a growing realization that current techniques can only carry us so far.”
He [Gary Marcus and AI wizard at NYU] thinks the industry needs some “real innovation” to go further. “There is a general feeling of plateau,” said Verena Rieser, a professor in conversational AI at Edinburgh’s Herriot Watt University. One AI researcher who wishes to remain anonymous said we’re entering a period where we are especially skeptical about AGI.
Well, maybe.
But the enterprise search cheerleaders have not gotten the memo. The current crop of “tap your existing unstructured information” companies assert that artificial intelligence infuses their often decades old systems with zip.
The story is being believed by venture outfits. The search for the next big thing is leading to making sense of unstructured text. After all, the world is awash in unstructured text. Companies have to solve this problem or red ink and extinction are just around the corner.
Net net: AI is a collection of tools, some useful, some not too useful. Enterprise search vendors are looking for a way to make sales to executives who don’t know or don’t care about past failures to index unstructured text on a company wide basis with a single system.
Stephen E Arnold, January 13, 2020
Search Your Computer
January 13, 2020
On January 10, 2020, one of the DarkCyber team needed to locate a file on a Windows 10 machine. Windows 10 search was okay, but it generated false drops and took too long.
DarkCyber tried to get its copy of ISYS Desktop Search 8 to work, but that was a non starter. We had given up on Copernic a couple of versions ago. The DTSearch trial had expired as had a couple of New Age search systems vendors had provided to us to test; for example, X1, Vound and Perfect Search, among others. Elastic was overkill. Yikes.
We then checked our files for “desktop search” and located links to these articles:
- Microsoft Windows 10 Search Indexer Diagnostics
- WizFile Is an Ultra-Fast Windows Search Tool
- UltraSearch, Fast Windows File Finder
- Everything Desktop Search
- FileSearchy Is a Fast Windows Search Alternative
- Windows Search Replacement Fileseek
- SearchMyFiles, A Versatile Desktop Search for Windows
We found a couple of these programs useful. In fact, the Everything software, version 1.4 did the trick for us.
We wanted to thank Martin Brinkmann for his articles which provided useful links and helpful information to us. Good job!
Stephen E Arnold, January 13, 2020
Lucidworks: Beyond Search for Sure
January 9, 2020
Lucid Imagination experienced what DarkCyber recalls as a bit of turmoil. From the git go, there was tension in the open sourcey ranks. One of the founders was unceremoniously given an opportunity to find his future elsewhere. Then there was the game of Revolving Door Presidents. Next was the defection of some lucid thinkers to Amazon, not in Seattle but just up the 101 to some non descript buildings. Like a law of nature another round of presidential revolving doors. Along the way, more investors wrote checks for what was an open source play based on Lucene/Solr. (I know that writing the two “names” together does not capture the grandiosity of the conception of community supported search and the privately held companies efforts to create a huge, billion dollar information access business. Sigh.
Now Lucidworks (which I automatically interpret as the phrase “Lucidworks. Really?”) has acquired an eCommerce vendor. Hello, what’s happening Magento, Mercado, Shopify, and Amazon. Yep, Amazon. But doesn’t Amazon have search too? Trivial point. Lucidworks is going to turn the $200 million in investment capital, an interface scripting engine, open source software, and Cirrus10 (an ecommerce service provider) into billions. Yes, billions!
According to “Lucidworks Acquires Cirrus10, Global Ecommerce Service Provider, to Deepen Domain Expertise and Become a Leader in Digital Commerce Solutions” states:
Lucidworks, leader in AI-powered search, acquires Cirrus10, ecommerce solutions expert with more than 100 ecommerce customers. Lucidworks and Cirrus10 have worked together as partners for the past two years and now combine their domain expertise to provide more targeted solutions for different domains in the fast-moving ecommerce market.
The Yahoo news story points out that Lucidworks’ secret sauce is a system:
produces relevant results, recommends products that meet customer goals, and predicts shopper intent to create a more engaging experience.
And don’t forget artificial intelligence. AI! Obviously.
But whose AI? The answer appears to be AI from Cirrus10. DarkCyber noted this statement from a co founder of the ecommerce service provider:
“Fusion is the world’s only platform for extensible AI-driven search. Fusion elevated our service offerings by giving us a framework for exploring machine learning with our customers, and using it, we can build personalized and scalable relevancy models without a black box or army of data scientists. By combining Lucidworks search and AI expertise with our deep experience in the ecommerce space we can cement our role as digital commerce solution leaders.”—Peter Curran, Cirrus10
What appears to be the business strategy for Lucidworks is to get something that generates sustainable revenue, allows the company to upsell Cirrus10’s customers, and differentiate Lucidworks from the competitors in plain old search.
There are competitors; for example, outfits with venture capital backers demanding results (Algolia, Coveo). Also, open sourcey solutions (Drupal Commerce, Magento Community Edition) and small, feisty outfits like SLI Systems and EasyAsk). Note: This is a partial list. I almost forget companies like Amazon, eBay, and Google.
DarkCyber interprets the “beyond search” phrase as an attempt to make a 12 year old company into a revenue and profit machine.
DarkCyber, which is an annex to our blog Beyond Search, wishes the clear thinkers a great 2020. The question “Lucidworks. Really?” could be answered as long as AI, NLP, machine learning, open source, and synergy produce a winner, not a horse designed by a committee.
Stephen E Arnold, January 9, 2019
Are Media Worthless? Matt Taibbi Says Yes
January 3, 2020
Robert Steele, a former US spy whom I know, and also the top reviewer for non-fiction books in English, has published Review: Hate Inc. Why Today’s Media Makes Us Despise One Another by Matt Taibbi and given the book five stars, calling it “”totally brilliant.”
I was drawn to this statement in Steele’s review:
There will come a time, guaranteed, when Americans pine for a powerful neither-party-aligned news network, to help make sense of things.
Steele’s review appears to provide a concise summary of the book that those who worry about accuracy, data integrity, ethics, and the concept of social value should find interesting. Steele concludes the review by noting:
The same is true of the intelligence community, and the academy, of non-profits and governments. Keep the money moving, never mind the facts.
Facts? Are facts irrelevant? Steele and Taibbi appear to agree that facts remain important. Dissenters: Possibly the “media?”
Stephen E Arnold, January 3, 2020
Cognitive Search: A Silver Bullet?
January 2, 2020
Search is a basic function that requires tinkering to make it intuitive and a useful tool for enterprise systems. In the past, most out of the box search solutions stink and require augmentations from the IT department to work. Enterprise search, however, has dramatically improved and that makes a slow news day for search experts. Most headlines based enterprise search include the latest buzz topics, like, “Significance Of Cognitive Search In The Enterprise” posted on Analytics Insight.
Cognitive search is apparently the newest thing. It is basically enterprise search injected with machine learning/artificial intelligence steroids. An undeniable truth is that enterprise systems are pulling their data across many systems, on site and in the cloud. A good search tool will crawl each dataset and return the most accurate results. Cognitive search uses AI to make search smarter aka “more cognitive,” which basically means the search tool learns from search queries, make search suggestions, and offer predictions. The official jargon sounds smarter:
“Cognitive search is associated with the concept of machine learning, where a computer system processes new insights and convert the way it reacts based on the newly gained data. By using the form of AI, it provides more in-depth search outcomes based on local information, previous search history and other variables. It also brings more specific results to an end-user as the cognitive system learns how an individual or system acts these searches.
This makes the cognitive search method a variable implementation into an enterprise’s network search capability.”
In other words, based off the latest technology craze enterprise search is going to become smarter and more intuitive for users! Blah, blah, semantic search, blah, blah, search engines, blah, blah, algorithms. It is the same “new and improved” spiel that comes every year.
Whitney Grace, January 2, 2020
PubMed: Some Tweaks
December 27, 2019
PubMed.gov is an old school online information service. The user types in one or more terms, and the system generates a list of results. Controlled terms work better than “free text” guesses.
According to “Announcing the New PubMed”:
The National Library of Medicine (NLM) is replacing the current version of the PubMed database with a newly re-designed version. The new version is now live and can be found at https://pubmed.ncbi.nlm.nih.gov/.
The appearance of the site has been updated. To one of the DarkCyber team members, the logo was influenced by PayPal’s design motif. Clicking for pages of results has been supplanted by the infinite scroll. Personally, I prefer to know how many pages of results have been found for a particular query. But, just tell me, “Hey, boomer, you are stupid.” I get it.
The write up does not comment upon backlog, changes in editorial policy, and cleaning citations to weed out those which are essentially marketing write ups or articles with non reproducible results, wonky statistics, or findings unrelated to the main job of medicine. But you can use the service on a mobile phone.
Stephen E Arnold, December 27, 2019