Making Search Fair: An Interesting Idea
September 11, 2020
Search rankings on Google, Bing, and various other search engines have not been fair for years. SEO tricks fall flatter than a pancake and the best way to get to the top of Google search results is with ads. The only thing Google does that is somewhat decent is that it marks paid ads in search results. EurekAlert! shares that there is a “New Tool Improves Fairness Of Online Search Rankings.”
Cornell University researchers developed a new tool, FairCo, that improves the fairness of online rankings that does not sacrifice relevance or usefulness. The idea behind the tool was that users only look at the first page of search results and miss other relevant results. This otherwise creates bias in the results. FairCo works similar to making a decision when you have all the facts:
“ ‘If you could examine all your choices equally and then decide what to pick, that may be considered ideal. But since we can’t do that, rankings become a crucial interface to navigate these choices,’ said computer science doctoral student Ashudeep Singh, co-first author of “Controlling Fairness and Bias in Dynamic Learning-to-Rank,” which won the Best Paper Award at the Association for Computing Machinery SIGIR Conference on Research and Development in Information Retrieval. ‘For example, many YouTubers will post videos of the same recipe, but some of them get seen way more than others, even though they might be very similar,” Singh said. “And this happens because of the way search results are presented to us. We generally go down the ranking linearly and our attention drops off fast.’”
FairCo is supposed to give the same exposure to all results from a search and ignores preferential treatment. This eliminates the unfairness in current search algorithms which are notorious for being biased.
With the amount of biased media outlets and disinformation spreading across the Internet and social media platforms, FairCo could help eliminate this problem. The problem would be getting large companies like Google and Facebook to adopt the tool, but if Cornell researchers received an injection of Google or Facebook money to expand FairCo it might work. However, paid ads always trump search results.
Whitney Grace, September 11, 2020
A Push for ISYS Search. Sorry, Lexmark. Oh, Right, Hyland
September 9, 2020
Those 1980s and enterprise search were a combo. Ian Davies’ search and retrieval system was very good. In fact, a long time ago I visited the old Crow’s Nest offices and sold a small job. After all, how many people from rural Kentucky end up in Sydney wanting to talk about search? Answer: Not too many. I wrangled an invitation because I complained about how the system displayed PDF files in a results list.
Flash forward and ISYS Search moved some operations to the US. Eventually the excitement waned and ISYS Search became a property of Lexmark. Lexington, Kentucky, had spawned a weird enterprise wide content management system which fetched a pretty price, and I assumed that Lexmark wanted its own content-centric technology. The wheels of time turned like a grindstone and Lexmark was caught between the business ends of the grist wheel. ISYS Search was now getting long in the tooth, and the company was sold to Hyland, also in the content management business. At this time, Lexmark had looped the loop from IBM to Chinese ownership.
Hence I was surprised to read “Why a Government Agency Needs Enterprise Search in the Modern World.” This was a message ISYS in the late 1980s was emitting as it marketed its system to law enforcement agencies with reasonable success. The write up by the Hyland’s Australia manager states:
Enterprise Search is becoming an essential ‘uber tool’ for content organization and discovery. More than just adding yet another layer of applications to the department’s arsenal of tools, enterprise search allows for the organized creation, indexing and retrieval of data – both structured and unstructured – through one simple interface.
In an interview with me in 2008, Ian Davies said:
What defined search back then was the significance of the need — users were after information that truly was mission critical. Now , juxtapose that with today, where search has expanded to address usability and the need to leverage corporate knowledge. What we have is a keen demand for mission critical search and retrieval, content processing, and analysis. In addition, there are large numbers of organizations that are trying to make the best use of the information in digital form. Mission-critical search manipulates information to identify a criminal, which may be a matter of public safety, or extract the key fact from information related to a legal matter. Essential search helps employees find the answer or the information needed to do their work today. Both drive the growth of ISYS. I don’t see either need diminishing going forward.
Similar? Yep.
Observations:
- Enterprise search is a challenge and shall remain so
- The lingo used to explain enterprise search is almost timeless
- The technology and its “promises” have persisted at ISYS for more than two decades.
Why hasn’t ISYS generated greater traction? Why has the core plumbing remained the same for decades? Those are important questions because they reveal much about the enterprise search sector which seems like an easy way to generate oodles of cash.
One issue is that enterprise search, like most policeware and intelware systems as well, is that the market sector is a very difficult one. One of the most popular enterprise search systems is, for instance, open source and free of license fees. That’s new. The sales pitch and arguments for paying for search are not.
Stephen E Arnold, September 9, 2020
Is YouTube Search Broken? Does Anyone Care?
September 5, 2020
“The Ultimate List of YouTube Channels to Boost your Web Development and Programming Skills” illustrates one of the ways in which YouTube search does not work. The write up is a compendium of YouTube presenters with information of interest to Web developers. The list consists of more than 80 YouTube “channels” with useful information. The list is curated; that is, one or more individuals dug through the digital swamp to locate content on the topic and of value to the list compilers.
Can this list or an approximation of it be produced using the YouTube search system?
The answer is, “No.”
Before offering some observations, let me offer an illustration the DarkCyber team encountered about eight days ago. One of my group wanted to do a review of free video editing software. She ran into a “dark pattern” problem and ended up trying to contact the company’s technical support, obtain information about fixing the issue, and moving forward with the review. The company (an outfit called FXHome) finally roused itself and refunded the money. No explanation about the problem was offered.
As part of that interaction, another team member went looking for information about FXHome on Web search engines. One interesting source was YouTube. We quickly learned that the YouTube search engine cannot display a comprehensive list of results with date and time stamps. My researcher reported, “YouTube doesn’t work.”
You may have had your own experiences with YouTube, but I think most people just take what the recommendation system offers.
Observations:
- For a company in the search business, YouTube search seems flawed
- Locating videos on a topic or by a company is next to impossible
- When lists are displayed, vital information about date, time, and running time are not presented.
YouTube generates a ton of money for Alphabet Google. It also opens the door to curated lists like the one cited above. The YouTube search function does generate frustration.
Stephen E Arnold, September 5, 2020
Autonomy: One Chapter Closes but the Saga Continues
August 27, 2020
Just a quick pointer to Reuters, the trusted source (that’s what the Thomson Reuters outfit says, believe me) story “Ex-Autonomy CFO’s Conviction for Hewlett-Packard Fraud Is Upheld by U.S. Appeals Court” about an Autonomy executive. The news report states that Autonomy’s CFO is in deeper legal hot water. Sushovan Hussain was convicted in April 2018 on a number of charges, including wire and security fraud. DarkCyber still marvels that Hewlett Packard, the Board of Directors, auditors, and third party advisors applied “warp speed,” to use a popular phrase, to buy the search and content processing company for $11.1 billion. One fact is unchallengeable: This legal process is moving along at turtle speed. Is the HP Autonomy saga well suited for a Quibi video?
Stephen E Arnold, August 27, 2020
Elastic: Making Improvements
August 27, 2020
Elasticsearch is one of the most popular open-source enterprise search platforms. While Elasticsearch is free for developers to download, Elastic offers subscriptions for customer support and enhanced software. Now the company offers some new capabilities and features, HostReview reveals in, “Elastic Announces a Single, Unified Agent and New Integrations to Bring Speed, Scale, and Simplicity to Users Everywhere.” The press release tells us:
“With this launch, portions of Elastic Workplace Search, part of the Elastic Enterprise Search solution, have been made available as part of the free Basic distribution tier, enabling organizations to build an intuitive internal search experience without impacting their bottom line. Customers can access additional enterprise features, such as single sign-on capabilities and enhanced support, through a paid subscription tier, or can deploy as a managed service on Elastic Cloud. This launch also marks the first major beta milestone for Elastic in delivering comprehensive endpoint security fully integrated into the Elastic Stack, under a unified agent. This includes malware prevention that is provided under the free distribution tier. Elastic users gain third-party validated malware prevention on-premises or in the cloud, on Windows and macOS systems, centrally managed and enabled with one click.”
The upgrades are available across the company’s enterprise search, observability, and security solutions as well as Elastic Stack and Elastic Cloud. (We noted Elastic’s welcome new emphasis on security last year.) See the write-up for the specific updates and features in each area. Elasticsearch underpins operations in thousands of organizations around the world, including the likes of Microsoft, the Mayo Clinic, NASA, and Wikipedia. Founded in 2012, Elastic is based in Silicon Valley. They also happen to be hiring for many locations as of this writing, with quite a few remote (“distributed”) positions available.
Cynthia Murrell, August 27, 2020
Zero Search Results = Useful Information
August 26, 2020
I saw a notice for a conference called “Activate.” Zippy title. What caught my attention was the title of a talk; specifically, “Implementing a Deep Learning Search Engine.” The technology appears to be the open source Solr search system. As you know, dig into Solr and what do you find? Lucene. The hay day of enterprise search has gone. Perhaps another harvest will come? But after the implosion of the promises made by Fulcrum, Verity, Autonomy, Fast, Convera, and Entopia, I am not sure search has credibility.
Don’t get me wrong. Search is a major part of companies; for example, Salesforce bought Diffeo, which was an interesting search system. Elastic is, of course, the commercial firm selling support for the open source Elasticsearch system. There are unusual systems as well; for example, the quirky Qwant, which has some Pertimm inside.
But consider this description of the talk for the Activate conference delivered by two wizards (well, maybe apprentice wizards) from the Lucidworks outfit:
Recent advances in Deep Learning brings us the possibility to get improvements in almost any domain. Search Engines aren’t an exception. Semantic search, visual search, “zero results” queries, recommendations, chatbots etc. – this is just a shortlist of topics that can benefit from Deep Learning based algorithms. But more powerful methods are also more expensive, so they require addressing the variety of scalability challenges. In this talk, we will go through details of how we implement Deep Learning Search Engine at Lucidworks: what kind of techniques we use to train robust and efficient models as well as how we tackle scalability difficulties to get the best query time performance. We will also demo several use-cases of how we leverage semantic search capabilities to tackle such challenges as visual search and “zero results” queries in eCommerce.
Three points:
- Deep learning is one of those buzzwords that recyclers of open source technology slap on a utility function like search. What search vendor does not include smart software, semantics, and more Gartner-infused techno babble? Not many.
- Short cuts for training smart software for machine learning is indeed important. However, the approach which strikes me as interesting is the one taken by the ever-pragmatic AWS system pushed along by the Bezos bulldozer. AWS wants to make training a matter of buying commodity solutions of data off the shelf. Presumably the approach works like one of those consumer soap tablets I have seen in our local grocery store. Buy, rip, and wash. Bingo! Clean ML. Grubbing in data is time consuming, expensive, and oh-so-easy to get wrong.
- The goal of “zero results” in eCommerce or any other domain is not exactly a challenge. Zero results deliver data. I know that an objective system displays only the objects matching my query. Not any longer. Synonym expansion, predictive analytics, clustering, and other numerical processes are going to show me something. Too bad that the “something” is usually not what I want.
- For special cases like ecommerce, instead of a list of crazy options, why not ask the user, “Do you want to see what products other people purchased when searching for X?” Choice is sometimes helpful.
Is this important? To me, yes. To most others, no.
The problem with making information easy is everywhere today. From individuals who disbelieve verifiable information like the earth is spheroid to the wisdom of demanding no law enforcement. Yeah, that will work.
Some quick facts to put this Lucidworks’ assertion in perspective. The company has ingested more than $209 million since 2007. I did some advice giving to the first president of Lucidworks, then called Lucid Imagination. I did some advice giving for another semi-lucid president. None of that advice resonated because recycling jargon does not generate sustainable revenues.
The point is that jazzy words and crazy ideas like “zero results” are bad are part of the problem search vendors face. Today’s search systems have drifted from displaying results which match a user’s query to dumping baloney on the display.
It is easier to yip yap with buzzwords that deal with some of the painful realities of information retrieval. Deep learning? Yeah, that will help the person locate that PowerPoint… not.
Stephen E Arnold, August 26, 2020
A Librarian Looks at Google Dorking
August 24, 2020
In order to find solutions for their jobs, many people simply conduct a Google search. Google searching for solutions is practiced by teachers to executives to even software developers. Software developers spend an inordinate amount of their time searching for code libraries and language tutorials. One developer named Alec had the brilliant idea to create “dorking.” What is dorking?
“Use advanced Google Search to find any webpage, emails, info, or secrets
cost: $0
time: 2 minutes
Software engineers have long joked about how much of their job is simply Googling things
Now you can do the same, but for free”
Dorking is free! That is great! How does it work? Dorking is a tip guide using Boolean operators and other Google advanced search options to locate information. Dorking, however, does need a bit of coding knowledge to understand how it works.
Most some of these tips can be plugged into a Google search box, such as finding similar sites and find specific pages that must include a phrase in the Title text. Others need that coding knowledge to make them work. For example finding every email on a Web page requires this:
Yep, dorking for everyone.
After a few practice trials, these dorking tips are sure to work for even the most novice of Googlers. It will also make anyone, not just software developers, appear like experts. As a librarian, why not assign field types and codes, return Boolean logic, and respect existing Google operators. Putting a word in quotes and then getting a result without the word is — how should I frame it. I know — dorky.
Whitney Grace, MLS, August 24, 2020
Surprising Google Data
August 20, 2020
DarkCyber is not sure if these data are accurate. We have had some interesting interactions with NordVPN, and we are skeptical about this outfit. Nevertheless, let’s look beyond a dicey transaction with the NordVPN outfit and focus on the data in “When Looking for a VPN, Chinese Citizens Search for Google.”
The article asserts:
New research by NordVPN reveals that when looking for VPN services on Baidu, the local equivalent of Google, the Chinese are mostly trying to get access to Google – in fact, 40,35% of all VPN service-related searches have to do with Google. YouTube comes second on the list, accounting for 31,58% of all searches. Other research by NordVPN has shown that YouTube holds the most desired restricted content, with 82,7% of Internet users worldwide searching for how to unblock this video sharing platform.
If valid, these data suggest that Google’s market magnetism is powerful. Perhaps a type of quantum search entanglement?
Stephen E Arnold, August 20, 2020
SlideShare: Some Work to Do
August 12, 2020
DarkCyber noted “Scribd Acquires Presentation Sharing Service SlideShare from LinkedIn.” In 2004, one could locate presentations on Google by searching for the extension ppt and its variants. In 2006, SlideShare became available. Then something happened. PowerPoints became more difficult to locate. When an online search pointed to a PowerPoint deck, the content was:
- Marketing fluff
- Incorrectly rendered with weird typography and wonky graphics
- Corrupted files.
What about today? DarkCyber’s most recent foray into the slide deck content wilderness produced zero; for example, SlideShare search produced identical pages of search results. The query retrieved slide decks on unrelated topics. Even worse, a query would result in SlideShare’s sending email upon email pointing to other slide decks. The one characteristic of these related slide deck was/is that they were unrelated to the information we sought.
There are online presentation services. There are open source presentation tools like SoftMaker’s. There is the venerable Keynote which never quite converts a PowerPoint file correctly.
Is there a future in a searchable collection of slide decks? In theory, yes. In reality, the cost of finding, indexing, and making searchable presentations faces some big hurdles; for example:
- Many organizations — for example, DARPA — standardize on PDF file formats. These are okay, but indexing these can be an interesting challenge
- Some presenters put their talks in the cloud, hoping that an Internet connection will allow their slides to display
- The Zoom world puts PowerPoints and other presentation materials on the user’s computer, never to make it into a more publicly accessible repository.
Like the dream of collecting conferences, presentations, and poster sessions, some content remains beyond the reach of researchers and analysts. The desire to get anyone looking for a slide deck to subscribe to a service gives operators of this service a chance to engage in spreadsheet fever. Here’s how this works? If there are X researchers, and we get Y percent of them. We can charge Z per year? By substituting guesstimates for the variables, the service becomes a winner.
The reality is that finding information in slide decks is more difficult today than it was in 2004. Access to information is becoming more difficult. DarkCyber would like to experience a SlideShare with useful content, more effective search and retrieval, and far less one page duplicates of ads for books.
Someday. Maybe?
Stephen E Arnold, August 12, 2020
NetDocuments Employs BA Insight Tech for Enterprise Search
August 10, 2020
For a secure, cloud-based data solution, many law firms, legal departments, and compliance teams turn to NetDocuments. Now the platform has adopted technology from a familiar name to simplify its clients’ access to information. A post at PRWeb reveals, “NetDocuments Introduces NetKnowledge Enterprise Search Powered by BA Insight.” We find it interesting that the 16-year-old BA Insight is licensing its askable-knowledge system to create the new tool, NetKnowledge. The press release describes the system’s advantages:
“Eliminate Downloading and Indexing Data for Search: No longer does content within NetDocuments need to be downloaded and indexed to be part of an organization’s enterprise search. Simply search within the NetDocuments platform, and NetKnowledge will find relevant data–along with information from other sources —and present it to users.
“Enforce Access Controls on Sensitive Information: Sensitive information may need to be restricted to certain individuals, but that data also needs to be available to others via enterprise search. NetKnowledge respects data restriction policies at the source and will only present data to individuals with proper access rights.
“Manage Large and Disparate Data Sets Across the Organization: NetKnowledge helps organizations bring all its data together to form a single source of truth, so users do not have to perform multiple searches in different places to get the information they need.”
Founded in 2004, BA Insight is based in Boston, Massachusetts. The company is dedicated to making information easier to find for organizations of all stripes. NetDocuments is headquartered in Lehi, Utah. The company was founded in 1999 and acquired by Clearlake Capital Group in 2017.
Cynthia Murrell, August 10, 2020