Google and Search Results: A Stay at Home Mother Explains

October 1, 2020

DarkCyber has a sneaking suspicion that Google wants to deliver the answers to users’ queries in a manner which:

  • Prevents a user from obtaining non-Google “approved” information
  • Requires zero latency between presenting an answer to a query and a click on an advertiser’s message
  • Appeals to a statistically significant percentage of users who accept the precept “Google makes one’s research easy”.

Other people do not agree with DarkCyber; for example, Google executives testifying before Congress or Googlers who are paid to explain how wonderful Google really, really is.

Google Wants to Eliminate Search Engine. Introducing Semantic Search” is an interesting and possibly disconcerting write up. One of the DarkCyber researchers noted for me this passage:

The experts at Google want to eliminate the one thing that Google does best – searching.

Since Google is perceived as search, what’s up? What’s up is that Google wants to deliver the “correct” answer directly to a thumb typing user or an impressionable child using a Chromebook and Google approved information to learn.

The write up explains in cheery stay-at-home mom panache:

With semantic searching, the algorithm working behind the search engine will understand the meaning of the search term and hence provide meaningful results, saving users a lot of hassle and a lot of time. In short, the new search is going to allow users to smart search for everything on the web.

Yep, smart search. Everything. The Web.

Sounds perfect, particularly for Google and its ad-centric approach to services.

Plus, users benefit because search engine optimization will no longer force the ever-smart Google search system to display irrelevant results:

Google is just preventing website owners to dig out the most-searched for keywords and then bulk them on to their websites.

DarkCyber finds the “just” an interesting word. Google just wants to make users better informed. How thoughtful. Research becomes little more than accepting what Google determines is optimal. Why read? Why compare? Why analyze? Google knows best: Best in terms of controlling access to information, shaping perceptions, and selling ads. Yes, that “best” may mean that an advertiser paid to get the click.

The DarkCyber researcher put an exclamation mark next to this passage:

In order to calm website owners down, Google has provided that the new algorithm is going to consist of an improved form of the same algorithm which will provide an opportunity to work towards legitimate optimization instead of spamming.

Yes, be calm. Accept what is delivered.

Stephen E Arnold, October 1, 2020

Microsoft Bing: Assertions Versus Actual Search Results

September 25, 2020

DarkCyber read “Introducing the Next Wave of AI at Scale innovations in Bing.” The write up explains a number of innovations. These enhancements will make finding information via Bing easier, better, faster, and generally more wonderful.

The main assertions DarkCyber noted are:

Smarter suggestions. The idea is that one does not know how to create a search query. Bing will know what the user wants.

More ideas. Bing will display questions other people (presumably just like me) ask. Bing keeps track and shows the popular questions. Yep, popular.

Translations. Send a query with mixed languages, and Bing will answer in your language. No more of that copying and pasting into Google Translate or Freetranslations.org.

Highlighting. This is Bing’s yellow marker. The system will highlight what you need to read. The method? “A zero-shot fashion.” No, DarkCyber does not know what this means. But one can ask Bing, right?

Let’s give Bing a whirl and run the same query against Googzilla.

Here’s a DarkCyber Bing query related to research we are now doing:

Black Sage open source

And here’s the result:

image

Black Sage is an integrator engaged in the development of counter unmanned aerial systems. The firm’s marketing collateral emphasizes that its platform is open. DarkCyber wants to know if the system uses open source methods for compromising a targeted UAS (drone). Bing focuses on a publishing company.

Now Google:

image

The first result from the Google is a pointer to the company. The remainder of the results are crazy and wacky like the sneakers Mr. Brin wore to Washington about a decade ago to meet elected officials. Crazy? Nope, Sillycon Valley.

DarkCyber uses both Bing and Google. Why did Google produce something sort of related to our query and Bing missed the corn hole entirely?

The answer is that Bing does not process a user’s search history as effectively as the Google. All the fancy words from Microsoft cannot alter a search result. DarkCyber is amused by Google and Microsoft. We are skeptical of each system.

Key points:

  • Microsoft is chasing technology instead of looking for efficient ways to tailor results to a user.
  • Microsoft wants to prove that its approach is more knowledge-centric. Google just wants to sell ads. Giving people something they have already seen is fine with Mother Google.
  • Microsoft, like Google, has lost sight of the utility of providing “stupid mode” and “sophisticated mode” for users. Let users select how a query should be matched to the content in the index.

To sum up, Google has a global share of Web search in the 85 percent range. Bing is an also participated player. Perhaps a less academic approach, deeper index, and functional user controls would be helpful?

Stephen E Arnold, September 25, 2020

Web Scraping: Better Than a Library for Thumbtypers

September 22, 2020

Modern research. The thumbtyper way.

Nature explains the embrace of a technology that, when misused, causes concern in the post, “How We Learnt to Stop Worrying and Love Web Scraping.” The efficiency and repeatability of automation are a boon to researchers Nicholas J. DeVito, Georgia C. Richards, and Peter Inglesby, who write:

“You will end up with a sharable and reproducible method for data collection that can be verified, used and expanded on by others — in other words, a computationally reproducible data-collection workflow. In a current project, we are analyzing coroners’ reports to help to prevent future deaths. It has required downloading more than 3,000 PDFs to search for opioid-related deaths, a huge data-collection task. In discussion with the larger team, we decided that this task was a good candidate for automation. With a few days of work, we were able to write a computer program that could quickly, efficiently and reproducibly collect all the PDFs and create a spreadsheet that documented each case. … [Previously,] we could manually screen and save about 25 case reports every hour. Now, our program can save more than 1,000 cases per hour while we work on other things, a 40-fold time saving. It also opens opportunities for collaboration, because we can share the resulting database. And we can keep that database up to date by re-running our program as new PDFs are posted.”

The authors explain how scraping works to extract data from web pages’ HTML and describe how to get started. One could adopt a pre-made browser extension like webscraper.io or write a customized scraper—a challenging task but one that gives users more control. See the post for details on that process.

With either option, we are warned, there are several considerations to keep in mind. For some projects, those who possess the data have created an easier way to reach it, so scraping would be a waste of time an effort. Conversely, other websites hold their data so tightly it is not available directly in the HTML or has protections built in, like captchas. Those considering scraping should also take care to avoid making requests of a web server so rapidly that it crashes (an accidental DoS attack) or running afoul of scraping rules or licensing and copyright restrictions. The researchers conclude by encouraging others to adopt the technique and share any custom code with the community.

Cynthia Murrell, September 22, 2020

Podcast Search: Illuminating the Rich Media Darkness

September 22, 2020

Search for podcasts is broken. We learn of a possible first step toward a fix from Podnews in the brief write-up, “The Podfather Launches a New, Open Podcast Directory.” James Cridland writes:

“‘The digital ad space is watching as the bottom falls out of their data collection methods. But how exactly does Apple’s Age of Privacy impact podcasting?’ – in today’s Sounds Profitable, our new adtech newsletter, with Podsights.

“Adam Curry has launched a new, open podcast directory for app developers, working with developer Dave Jones. Speaking on a new podcast, Podcasting 2.0, Curry and Jones worry that ‘Apple is starting to tinker with their directory’, and say that the company is ‘a very centralized private entity that is controlling pretty much what everybody considers the default yellow pages for podcasting.’ His alternative, The Podcast Index, promises that the ‘core, categorized index will always be available for free, for any use’. You can sign up to be a developer on their developer portal. We support this initiative. As of today, Podnews uses The Podcast Index for our main podcast search.”

The index is a simple type-and-search format. It seems to work acceptably well on Podnews’ database, though it could use a little relevance refinement. Will the open directory attract developers and reach the larger segment? We hope this or another solution is implemented soon.

Cynthia Murrell, September 22, 2020

Making Search Fair: An Interesting Idea

September 11, 2020

Search rankings on Google, Bing, and various other search engines have not been fair for years. SEO tricks fall flatter than a pancake and the best way to get to the top of Google search results is with ads. The only thing Google does that is somewhat decent is that it marks paid ads in search results. EurekAlert! shares that there is a “New Tool Improves Fairness Of Online Search Rankings.”

Cornell University researchers developed a new tool, FairCo, that improves the fairness of online rankings that does not sacrifice relevance or usefulness. The idea behind the tool was that users only look at the first page of search results and miss other relevant results. This otherwise creates bias in the results. FairCo works similar to making a decision when you have all the facts:

“ ‘If you could examine all your choices equally and then decide what to pick, that may be considered ideal. But since we can’t do that, rankings become a crucial interface to navigate these choices,’ said computer science doctoral student Ashudeep Singh, co-first author of “Controlling Fairness and Bias in Dynamic Learning-to-Rank,” which won the Best Paper Award at the Association for Computing Machinery SIGIR Conference on Research and Development in Information Retrieval. ‘For example, many YouTubers will post videos of the same recipe, but some of them get seen way more than others, even though they might be very similar,” Singh said. “And this happens because of the way search results are presented to us. We generally go down the ranking linearly and our attention drops off fast.’”

FairCo is supposed to give the same exposure to all results from a search and ignores preferential treatment. This eliminates the unfairness in current search algorithms which are notorious for being biased.

With the amount of biased media outlets and disinformation spreading across the Internet and social media platforms, FairCo could help eliminate this problem. The problem would be getting large companies like Google and Facebook to adopt the tool, but if Cornell researchers received an injection of Google or Facebook money to expand FairCo it might work. However, paid ads always trump search results.

Whitney Grace, September 11, 2020

A Push for ISYS Search. Sorry, Lexmark. Oh, Right, Hyland

September 9, 2020

Those 1980s and enterprise search were a combo. Ian Davies’ search and retrieval system was very good. In fact, a long time ago I visited the old Crow’s Nest offices and sold a small job. After all, how many people from rural Kentucky end up in Sydney wanting to talk about search? Answer: Not too many. I wrangled an invitation because I complained about how the system displayed PDF files in a results list.

Flash forward and ISYS Search moved some operations to the US. Eventually the excitement waned and ISYS Search became a property of Lexmark. Lexington, Kentucky, had spawned a weird enterprise wide content management system which fetched a pretty price, and I assumed that Lexmark wanted its own content-centric technology. The wheels of time turned like a grindstone and Lexmark was caught between the business ends of the grist wheel. ISYS Search was now getting long in the tooth, and the company was sold to Hyland, also in the content management business. At this time, Lexmark had looped the loop from IBM to Chinese ownership.

Hence I was surprised to read “Why a Government Agency Needs Enterprise Search in the Modern World.” This was a message ISYS in the late 1980s was emitting as it marketed its system to law enforcement agencies with reasonable success. The write up by the Hyland’s Australia manager states:

Enterprise Search is becoming an essential ‘uber tool’ for content organization and discovery. More than just adding yet another layer of applications to the department’s arsenal of tools, enterprise search allows for the organized creation, indexing and retrieval of data – both structured and unstructured – through one simple interface.

In an interview with me in 2008, Ian Davies said:

What defined search back then was the significance of the need — users were after information that truly was mission critical.  Now , juxtapose that with today, where search has expanded to address usability and the need to leverage corporate knowledge. What we have is a keen demand for mission critical search and retrieval, content processing, and analysis. In addition, there are large numbers of organizations that are trying to make the best use of the information in digital form. Mission-critical search manipulates information to identify a criminal, which may be a matter of public safety, or extract the key fact from information related to a legal matter. Essential search helps employees find the answer or the information needed to do their work today. Both drive the growth of ISYS. I don’t see either need diminishing going forward.

Similar? Yep.

Observations:

  • Enterprise search is a challenge and shall remain so
  • The lingo used to explain enterprise search is almost timeless
  • The technology and its “promises” have persisted at ISYS for more than two decades.

Why hasn’t ISYS generated greater traction? Why has the core plumbing remained the same for decades? Those are important questions because they reveal much about the enterprise search sector which seems like an easy way to generate oodles of cash.

One issue is that enterprise search, like most policeware and intelware systems as well, is that the market sector is a very difficult one. One of the most popular enterprise search systems is, for instance, open source and free of license fees. That’s new. The sales pitch and arguments for paying for search are not.

Stephen E Arnold, September 9, 2020

Is YouTube Search Broken? Does Anyone Care?

September 5, 2020

The Ultimate List of YouTube Channels to Boost your Web Development and Programming Skills” illustrates one of the ways in which YouTube search does not work. The write up is a compendium of YouTube presenters with information of interest to Web developers. The list consists of more than 80 YouTube “channels” with useful information. The list is curated; that is, one or more individuals dug through the digital swamp to locate content on the topic and of value to the list compilers.

Can this list or an approximation of it be produced using the YouTube search system?

The answer is, “No.”

Before offering some observations, let me offer an illustration the DarkCyber team encountered about eight days ago. One of my group wanted to do a review of free video editing software. She ran into a “dark pattern” problem and ended up trying to contact the company’s technical support, obtain information about fixing the issue, and moving forward with the review. The company  (an outfit called FXHome) finally roused itself and refunded the money. No explanation about the problem was offered.

As part of that interaction, another team member went looking for information about FXHome on Web search engines. One interesting source was YouTube. We quickly learned that the YouTube search engine cannot display a comprehensive list of results with date and time stamps. My researcher reported, “YouTube doesn’t work.”

You may have had your own experiences with YouTube, but I think most people just take what the recommendation system offers.

Observations:

  1. For a company in the search business, YouTube search seems flawed
  2. Locating videos on a topic or by a company is next to impossible
  3. When lists are displayed, vital information about date, time, and running time are not presented.

YouTube generates a ton of money for Alphabet Google. It also opens the door to curated lists like the one cited above. The YouTube search function does generate frustration.

Stephen E Arnold, September 5, 2020

Autonomy: One Chapter Closes but the Saga Continues

August 27, 2020

Just a quick pointer to Reuters, the trusted source (that’s what the Thomson Reuters outfit says, believe me) story “Ex-Autonomy CFO’s Conviction for Hewlett-Packard Fraud Is Upheld by U.S. Appeals Court”  about an Autonomy executive. The news report states that Autonomy’s CFO is in deeper legal hot water. Sushovan Hussain was  convicted in April 2018 on a number of charges, including wire and security fraud. DarkCyber still marvels that Hewlett Packard, the Board of Directors, auditors, and third party advisors applied “warp speed,” to use a popular phrase, to buy the search and content processing company for $11.1 billion. One fact is unchallengeable: This legal process is moving along at turtle speed. Is the HP Autonomy saga well suited for a Quibi video?

Stephen E Arnold, August 27, 2020

Elastic: Making Improvements

August 27, 2020

Elasticsearch is one of the most popular open-source enterprise search platforms. While Elasticsearch is free for developers to download, Elastic offers subscriptions for customer support and enhanced software. Now the company offers some new capabilities and features, HostReview reveals in, “Elastic Announces a Single, Unified Agent and New Integrations to Bring Speed, Scale, and Simplicity to Users Everywhere.” The press release tells us:

“With this launch, portions of Elastic Workplace Search, part of the Elastic Enterprise Search solution, have been made available as part of the free Basic distribution tier, enabling organizations to build an intuitive internal search experience without impacting their bottom line. Customers can access additional enterprise features, such as single sign-on capabilities and enhanced support, through a paid subscription tier, or can deploy as a managed service on Elastic Cloud. This launch also marks the first major beta milestone for Elastic in delivering comprehensive endpoint security fully integrated into the Elastic Stack, under a unified agent. This includes malware prevention that is provided under the free distribution tier. Elastic users gain third-party validated malware prevention on-premises or in the cloud, on Windows and macOS systems, centrally managed and enabled with one click.”

The upgrades are available across the company’s enterprise search, observability, and security solutions as well as Elastic Stack and Elastic Cloud. (We noted Elastic’s welcome new emphasis on security last year.) See the write-up for the specific updates and features in each area. Elasticsearch underpins operations in thousands of organizations around the world, including the likes of Microsoft, the Mayo Clinic, NASA, and Wikipedia. Founded in 2012, Elastic is based in Silicon Valley. They also happen to be hiring for many locations as of this writing, with quite a few remote (“distributed”) positions available.

Cynthia Murrell, August 27, 2020

Zero Search Results = Useful Information

August 26, 2020

I saw a notice for a conference called “Activate.” Zippy title. What caught my attention was the title of a talk; specifically, “Implementing a Deep Learning Search Engine.” The technology appears to be the open source Solr search system. As you know, dig into Solr and what do you find? Lucene. The hay day of enterprise search has gone. Perhaps another harvest will come? But after the implosion of the promises made by Fulcrum, Verity, Autonomy, Fast, Convera, and Entopia, I am not sure search has credibility.

Don’t get me wrong. Search is a major part of companies; for example, Salesforce bought Diffeo, which was an interesting search system. Elastic is, of course, the commercial firm selling support for the open source Elasticsearch system. There are unusual systems as well; for example, the quirky Qwant, which has some Pertimm inside.

But consider this description of the talk for the Activate conference delivered by two wizards (well, maybe apprentice wizards) from the Lucidworks outfit:

Recent advances in Deep Learning brings us the possibility to get improvements in almost any domain. Search Engines aren’t an exception. Semantic search, visual search, “zero results” queries, recommendations, chatbots etc. – this is just a shortlist of topics that can benefit from Deep Learning based algorithms. But more powerful methods are also more expensive, so they require addressing the variety of scalability challenges. In this talk, we will go through details of how we implement Deep Learning Search Engine at Lucidworks: what kind of techniques we use to train robust and efficient models as well as how we tackle scalability difficulties to get the best query time performance. We will also demo several use-cases of how we leverage semantic search capabilities to tackle such challenges as visual search and “zero results” queries in eCommerce.

Three points:

  1. Deep learning is one of those buzzwords that recyclers of open source technology slap on a utility function like search. What search vendor does not include smart software, semantics, and more Gartner-infused techno babble? Not many.
  2. Short cuts for training smart software for machine learning is indeed important. However, the approach which strikes me as interesting is the one taken by the ever-pragmatic AWS system pushed along by the Bezos bulldozer. AWS wants to make training a matter of buying commodity solutions of data off the shelf. Presumably the approach works like one of those consumer soap tablets I have seen in our local grocery store. Buy, rip, and wash. Bingo! Clean ML. Grubbing in data is time consuming, expensive, and oh-so-easy to get wrong.
  3. The goal of “zero results” in eCommerce or any other domain is not exactly a challenge. Zero results deliver data. I know that an objective system displays only the objects matching my query. Not any longer. Synonym expansion, predictive analytics, clustering, and other numerical processes are going to show me something. Too bad that the “something” is usually not what I want.
  4. For special cases like ecommerce, instead of a list of crazy options, why not ask the user, “Do you want to see what products other people purchased when searching for X?” Choice is sometimes helpful.

Is this important? To me, yes. To most others, no.

The problem with making information easy is everywhere today. From individuals who disbelieve verifiable information like the earth is spheroid to the wisdom of demanding no law enforcement. Yeah, that will work.

Some quick facts to put this Lucidworks’ assertion in perspective. The company has ingested more than $209 million since 2007. I did some advice giving to the first president of Lucidworks, then called Lucid Imagination. I did some advice giving for another semi-lucid president. None of that advice resonated because recycling jargon does not generate sustainable revenues.

The point is that jazzy words and crazy ideas like “zero results” are bad are part of the problem search vendors face. Today’s search systems have drifted from displaying results which match a user’s query to dumping baloney on the display.

It is easier to yip yap with buzzwords that deal with some of the painful realities of information retrieval. Deep learning? Yeah, that will help the person locate that PowerPoint… not.

Stephen E Arnold, August 26, 2020

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta