Google Avoids a Prince Andrew Like Interview

December 5, 2019

Yep, prime time. After a football game. A hard hitting interview with a PR message. You can read and watch some of the talk in the self referential, Google indexing friendly news story “How Does YouTube Handle the Site’s Misinformation, Conspiracy Theories and Hate? YouTube’s Mission Is to Give Everyone a Voice, But the Site’s Open Platform Has Opened the Door to Hate. YouTube CEO Susan Wojcicki Tells Lesley Stahl What the Company’s Doing about It.

Now that’s a headline.

A couple words in this 43 word headline caught my attention. First the word “mission”, the phrase “everyone a voice,” “tells,” and “company’s doing.” Interview? More like a sales pitch, perhaps?

Very PR like. Almost Prince Andrewish.

The main point: The quantity of video is a problem, an excuse. The numbers sure sound impressive.

What’s the fix? Limit the video uploads. Presto. No more cost challenges. Editorial guidelines. Responsibility. Conformance to the laws of nation states. (Think how silly Apple looks changing the maps to make the Russian government happy. Crimea? Just upload a YouTube video and make your voice heard, right?)

Users are, after all is said and done, the point of the service.

Here’s a telling comment in response to a question about providing a YouTube video of murders in New Zealand:

Susan Wojcicki: This event was unique because it was really a made-for-Internet type of crisis. Every second there was a new upload. And so our teams around the world were working on this to remove this content. We had just never seen such a huge volume.

Yep, unique plus the fact that Google/YouTube was obviously unprepared. Like this sign:

Image result for plan ahead

Prince Andrewish or not? You decide:

  1. A lack of awareness of the situation Google sustains?
  2. Is there a “certain blindness” to examine the content findable by individuals who know where to look for stolen software’s unlock codes, images of interest to bad actors, and content designed to promote activities which can harm a person?
  3. Is a slow waltz required to mute the perception that advertising revenue is more important than Austrian concert master virtues?
  4. Why not explain that the goal of YouTube is engagement is to keep children and the young at heart clicking, viewing, and sticking. The more engagement, the greater the real estate for ads. (See item 3 above, please.)

Prince Andrew’s train wreck interview underscored his interesting behavior and caused the Queen to take away his flag. (Yep, he had his own flag!) Google still has its flag for the sovereign state of Google, just a new and beloved leader.

Did the CBS news team get a Google mouse pad before leaving the Google office? Probably but the big numbers about the YouTube videos may have left the team addled. Big is good. PR is gooder. Google advertising? The absolute goodest.

This interview was not Andrewesque; it was Googley.

Stephen E Arnold, December 5, 2019

Curious about Semantic Search the SEO Way?

November 12, 2019

DarkCyber is frequently curious about search: Semantic, enterprise, meta, multi-lingual, Boolean, and the laundry list of buzzwords marshaled to allow a person to find an answer.

If you want to get a Zithromax Z-PAK of semantic search talk, navigate to ‘Semantic Search Guide.” One has to look closely at the url to discern that this “objective” write up is about search engine optimization or SEO. DarkCyber affectionately describes SEO as the “relevance” killer, but that’s just our old-fashioned self refusing to adapt to the whizzy new world.

The link will point to a page with a number of links. These include:

  • Target audience and contributions
  • The knowledge graph explained
  • The evolution of search
  • Using Google’s entity search tool
  • Getting a Wikipedia listing

DarkCyber took a look at the “Evolution of Search” segment. We found it quirky but interesting. For example, we noted this passage:

Now we turn to the heart of full-text search. SEOs tend to dwell on the indexing part of search or the retrieval part of the search, called the Search Engine Results Pages (SERPs, for short). I believe they do this because they can see these parts of the search. They can tell if their pages have been crawled, or if they appear. What they tend to ignore is the black box in the middle. The part where a search engine takes all those gazillion words and puts them in an index in a way that allows for instant retrieval. At the same time, they are able to blend text results with videos, images and other types of data in a process known as “Universal Search”. This is the heart of the matter and whilst this book will not attempt to cover all of this complex subject, we will go into a number of the algorithms that search engines use. I hope these explanations of sometimes complex, but mostly iterative algorithms appeal to the marketer inside you and do not challenge your maths skills too much. If you would like to take these ideas in in video form, I highly recommend a video by Peter Norvig from Google in 2011: https://www.youtube.com/watch?v=yvDCzhbjYWs

Oh, well. This is one way to look at universal search. But Google has silos of indexes. The system after 20 plus years does not federate results across indexes. Semantic search? Yeah, right. Search each index, scan results, cut and paste, and then try to figure out the dates and times. Semantic search does not do time particularly well.

Important. Not to the SEO. Search babble may be more compelling.

If this approach is your cup of tea, inLinks has the hot water you need to understand why finding information is not what it seems.

Stephen E Arnold, November 12, 2019

IBM Watson to the Rescue of Truth: Facts? Not Necessary

November 7, 2019

Could IBM Watson Fix Facebook’s ‘Truth Problem’?” stopped me in my daily quest for truth, justice, and the American way of technology. The write up dangles some clickbait in front of the Web indexing crawlers. Once stopped by IBM Watson, Facebook, and Truth, the indexers indexed but I read the story.

I printed it out and grabbed by trusty yellow highlighter. I like yellow because it reminds me of an approach which combines some sensational hooks with a bit of American marketing.

For instance this passage warranted a small checkmark:

Facebook is between a rock and a hard place because “the truth” is often subjective, where what is true to one party is equally false to the other.

I like the word subjective, and I marveled at the turn of phrase in this fresh wordsmithing: “between a rock and a hard place.” Okay, a dilemma or a situation created when a company does what it can to generate revenue while fending off those who would probe into its ethical depths.

This statement warranted a yellow rectangle:

Since Facebook itself is perceived as being biased (or perhaps the news sources it hosts are), a solution from them would be suspect regardless of whether it was AI-based or, assuming such a thing was financially viable (which I doubt it is), human-driven.  But IBM may have a solution that could work here.

Yes, a hypothetical: IBM Watson, a somewhat disappointing display of the once proud giant’s Big Blueness, is a collection of software, methods, training processes, and unfulfilled promises by avid IBM marketers. I grant that a bright person or perhaps a legion of wizards laboring under the pressures of an academic overlord or a government COTAR possibly, maybe, or ought to be able to build a system to recognize content which is “false.” Defining the truth certainly seems possible with time, money, and the “right” people. But can IBM Watson or any of today’s smart software and wizards pull off this modest task? If the solution were available, wouldn’t it be in demand, deployed, and detailed. TV programs, streaming video, tweets, and other information objects could be identified, classified, and filtered. Easy, right?

I then used my yellow marker to underline words, place a rectangle around the following text, and I added an exclamation point for good measure. Here’s the passage:

IBM also has the most advanced, scalable, deployable AI in the market with Watson. They recognized the opportunity to have an enterprise-class AI long before anyone else, and they have demonstrated human-like competence both with Jeopardy and with a debate against a live professional debater a few years ago.  I attended that debate and was impressed that Watson not only was better with the facts, it was better with humor. It lost the debate, but it was arguably the audience’s favorite.

Yes, assertions without facts, no data, no outputs, no nothing. Just “has the most advanced, scalable, deployable AI in the market.” The only hitch in this somewhat over-the-top generalization is, “It [Watson] lost the debate.”

But what warranted the exclamation mark was “it [IBM Watson] was better with humor.” Yep, smart software has a sense of humor at IBM.

This write up raises several questions. I will bring these up with my team at lunch today:

  1. Why are publications like Datamation running ads in the form of text? Perhaps, like Google Ads, a tiny label could be affixed so I can avoid blatant PR.
  2. Why is IBM insisting it has technology that “could” do something. I had a grade school teacher named Miss Bray who repeated endlessly, “Avoid woulda, coulda, shoulda.” What IBM could do is irrelevant. What IBM is doing is more important. Talking about technology is not the same as applying it and generating revenue growth, sustainable revenue, and customers who cannot stop yammering about how wonder a product or service is. For example, I hear a great deal about Amazon. I don’t here much about IBM.
  3. What is the “truth” in this write up. IBM Watson won Jeopardy. (TV shows do post production.) I am not convinced that the investment IBM made in setting up Watson to “win” returned more than plain old fashioned advertising. The reality is that the “truth” of this write up is very Facebook like.

To sum up, clicks and PR are more important than data, verifiable case examples, and financial reports. IBM, are you listening? Right, IBM is busy in court and working to put lipstick on its financials. IBM marketers, are you listening? Right, you don’t listen, but you send invoices I assume. Datamation, are there real stories you will cover which are not recycled collateral and unsupported assertions? Right, you don’t care either it seems. You ran this story which darn near exhausted by yellow marker’s ink.

Stephen E Arnold, November 7, 2019

Coveo: A 15 Year old $1 Billion Start Up Unicorn in Canada!

November 6, 2019

I read “Coveo Raises US$172M at $1B+ Valuation for AI-Based Enterprise Search and Personalization.” The write up states:

Search and personalization services continue to be a major area of investment among enterprises, both to make their products and services more discoverable (and used) by customers, and to help their own workers get their jobs done, with the market estimated to be worth some $100 billion annually. Today, one of the big startups building services in this area raised a large round of growth funding to continue tapping that opportunity.

Like Elastic, Algolia, and LucidWorks, Coveo is going to have to generate sufficient revenues to pay back its investors. Perhaps the early supporters have cashed out, but the new money is betting on the future.

Coveo was founded in Quebec City more than a decade ago. The desktop search company Copernic spun off Coveo in 2004. The original president was Laurent Simoneau. Mr. Tetu is an investor with great confidence in enterprise software, and he has become the “founder”, according to the write up. In April 2018, Coveo obtain about $100 million from Evergreen Coast Capital.

DarkCyber recalls that Coveo has moved from Microsoft-centric search to search as a service to customer experience and now personalization.

In 2005, I wrote this about the upsides of the Coveo approach in the Enterprise Search Report I compiled for an outfit lost to memory:

Coveo is a reasonably-priced, stable product. Any organization with Microsoft search will improve access to information with a system like Coveo’s. Microsoft SharePoint customers will want to do head-to-head comparisons with other “solutions” to Microsoft’s native search solution. Coveo has a number of features that make it a worth contender. Other benefits of the Coveo approach include:

  • Web-based administration tool allows straightforward configuration and monitoring of the system.
  • Automatic indexing of new and updated documents in near real-time.
  • Includes linguistic and statistical technologies that can identify the key concepts and the key sentences of indexed documents. Provides automated document summaries for faster reading and filtering.
  • Groups information sources into collections for field-specific searches.
  • The product is attractively priced.
  • Tightly integrated with other Microsoft products and Windows-based security regimes.
  • Customer base has grown comparatively quite rapidly and customers tend to speak well of the product.

I noted these considerations:

The software is Windows-centric – both in terms of its own software as well as document security settings it tracks – which may be an issue with certain types of organizations. You will have to assign permissions to index to allow the ASP.NET worker process user to access the index. The task is simplified, but it can be overlooked. Administrative controls are presented without calling attention to actions that require particular attention. Coveo is still however able to search content on any operating system, application, or server. Other drawbacks of the Coveo search system include:

  • There is limited software development support to allow customization or extensions of the core technology to other applications, although the company is expanding the product’s reach through Dot Net-based APIs.
  • When the system is installed and its defaults accepted, the “Everyone” group is enabled. Administrators will want to customize this setting. A wizard would be a useful option for organizations new to enterprise search.
  • No native taxonomy support, except through partner Entrieva.
  • Achieving scalability beyond hundreds of millions of documents requires appropriate resources.

My final take on the company was:

Coveo Enterprise Search meets many distinct needs of the small and medium-sized business that has standardized on the Microsoft platform, while still providing a few critical advanced search capabilities. Perhaps more importantly, CES minimizes search training, system maintenance, and other cost “magnets” that typically accompany an enterprise search deployment.
Like a handful of other products in this report, you can test Coveo out first, via a free download of a document-limited version.

The challenge for Copernic is to make enough sales and to generate robust sustainable income. This is the uphill run that Algolia, Elastic, LucidWorks, and probably a number of other enterprise search vendors face. Perhaps an outfit like Xerox will buy up, which would be one way to get the investors their money?

DarkCyber wishes Coveo the best. But a start up unicorn? No, that is not exactly correct for a 15 year old outfit. This push to make the investors smile is not for the faint hearted or those who have a solid grasp of the formidable enterprise search options available today. Plus there are outfits like Diffeo and other next generation information access systems available for free (Eleasticsearch) or bundled with other sophisticated information management tools (Amazon, search, managed blockchain, workflows, and a clever approach to vendor lock in.)

One tip: Don’t visit Quebec City in February during a snow storm.

Stephen E Arnold, November 6, 2019

Stephen E Arnold, November 5, 2019

Search System Bayard

November 1, 2019

Looking for an open source search and retrieval tool written in Rust and built on top of Tantivy (Lucene?). Point your browser to Github and grab the files. The read me file highlights these features:

  • Full-text search/indexing
  • Index replication
  • Bringing up a cluster
  • Command line interface.

DarkCyber has not tested it, but a journalist contacted us on October 31, 2019, and was interested in the future of search. I pointed out that there are free and open source options.

What people want to buy, however, is something that does not alienate two thirds of the search system’s users the first day the software is deployed.

Surprised? You may not know what you don’t know, but, gentle reader, you are an exception.

Stephen E Arnold, November 1, 2019

Dumais on Search: Bell Labs Roots Are Thriving

October 23, 2019

We just love a genuine Search guru, and Dr. Susan Dumais is one of the best. The illustrious Dr. Dumais is now a Microsoft Technical Fellow and Deputy Lab Director of MDR AI. If you wanted to know the history of information retrieval, she would be the one to hear tell about it—and now you can, courtesy of the Microsoft Research Podcast. Both the 38-minute podcast itself and a transcript are posted at, “HCI, IR and the Search for Better Search with Dr. Susan Dumais.” The good doctor describes what motivates her in her work:

“I think there are two commonalities and themes in my work. One is topical. So, as you said, I’m really interested in understanding problems from a very user-centric point-of-view. I care a lot about people, their motivations, the problems they have. I also care about solving those problems with new algorithms, new techniques and so on. So, a lot of my work involves this intersection of people and technology, thinking about how work practices co-evolve with new technological developments. And so thematically, that’s an area that I really like. I like this ability to go back and forth between understanding people, how they think, how they reason, how they learn, how they find information, and finding solutions that work for them. In the end, if something doesn’t work for people, it doesn’t work. In addition to topically, I approach problems in a way that is motivated, oftentimes, by things that I find frustrating. We may talk a little bit later about my work in latent semantic indexing, but that grew out of a frustration with trying to learn the Unix operating system. Work I’ve done on email spam, grew out of a frustration in mitigating the vast amount of junk that I was getting. So, I tend to be motivated by problems that I have now, or that I anticipate that our customers, and people will have in general, given the emerging technology trends.”

She and host Gretchen Huizinga go on to discuss the evolution of search technology over the last twenty years, beginning with the first HTML page crawlers that indexed but a couple thousand queries per day. They also cover Dumais’ work over the years to build bridges, provide context in search, and bring changing content into the equation. We hope you will check out the intriguing and informative interview for yourself, dear reader.

Cynthia Murrell, October 23, 2019

Amazon: Elasticsearch Bounced and Squished

October 14, 2019

DarkCyber noted “AWS Elasticsearch: A Fundamentally-Flawed Offering.” The write up criticizes Amazon’s implementation of Elasticsearch. Amazon hired some folks from Lucidworks a few years ago. But under the covers, Lucene thrums along within Amazon and a large number of other search-and-retrieval companies, including those which present themselves as policeware. There are many reasons: [a] good enough, [b] no one company fixes the bugs, [c] good enough, [d] comparatively cheap, [e] good enough. Oh, one other point: Not under the control of one company like those good, old fashioned solutions like STAIRS III, Fulcrum (remember that?), or Delphes (the francophone folks).

This particular write up is unlikely to earn a gold star from Amazon’s internal team. The Spun.io essay states:

I’m currently working on a large logging project that was initially implemented using AWS Elasticsearch. Having worked with large-scale mainline Elasticsearch clusters for several years, I’m absolutely stunned at how poor Amazon’s implementation is and I can’t fathom why they’re unable to fix or at least improve it.

I think the tip off is the phrase “how poor Amazon’s implementation is…”

The section Amazon Elasticsearch Operation provides some color to make vivid the author’s viewpoint; for example:

On Amazon, if a single node in your Elasticsearch cluster runs out of space, the entire cluster stops ingesting data, full stop. Amazon’s solution to this is to have users go through a nightmare process of periodically changing the shard counts in their index templates and then reindexing their existing data into new indices, deleting the previous indices, and then reindexing the data again to the previous index name if necessary. This should be wholly unnecessary, is computationally expensive, and requires that a raw copy of the ingested data be stored along with the parsed record because the raw copy will need to be parsed again to be reindexed. Of course, this also doubles the storage required for “normal” operation on AWS. [Emphasis in the original essay.]

The wrap up for the essay is clear from this passage:

I cannot fathom how Amazon decided to ship something so broken, and how they haven’t been able to improve the situation after over two years.

DarkCyber’s team formulated several observations. Let’s look at these in the form of questions and trust that some young sprites will answer them:

  1. Will Amazon make its version of Elasticsearch proprietary?
  2. Are these changes designed to “pull” developers deeper into the AWS platform, making departure more difficult or impossible for some implementations?
  3. Are the components the author of the essay finds objectionable designed to generate more revenue for Amazon?

Stephen E Arnold, October 14, 2019

Amazon Policeware: One Possible Output

October 1, 2019

Investigations focus on entities and timelines. The context includes the legal wrapper, procedures, impressions, and similar information usually resident in investigators and their colleagues.

Why gather data unless there is a payoff. The payoff from data in terms of Amazon’s policeware includes these upsides:

  • Data which informs new products and services, especially those signals for latent demand
  • Raw material for analytical processes such as those performed by superordinate Amazon Web Services
  • Outputs which have market magnetism; that is, the product is desirable and LE and intel customers want to buy it.

This illustration which I have taken from my October 2, 2019, TechnoSecurity lecture and from my Amazon policeware webinar illustrates three points:

First, raw data are acquired by Amazon. The sources are diverse and some are unique to Amazon; for example, individual and enterprise purchasing data.

Second, the AWS policeware platform which performs normalization, indexing, and analysis from historic and real time data flows; for example, what books did an individual purchase and when.

Third, an output in the form of a profile or report about a person of interest.

image

© Stephen E Arnold 2019

I know the image is difficult to read. There are two ways to address this issue. You can attend my lectures at the San Antonio conference or you can sign up for my Amazon policeware webinar.

No Epstein supporters, fans, and acquaintances should express interest in my research. Sorry. I am old fashioned.

Stephen E Arnold, October 1, 2019

Google Search Index: Losing Relevance

September 25, 2019

Google’s search and indexing algorithms work twenty four hours, seven days a week. As the world’s most popular and, arguably, powerful search engine, Google does encounter hiccups. Your Story shares how Google’s index went down in April 2019 and how they addressed it in the article, “Google’s Loss of Parts Of The Search Index.”

In April, Google temporary lost parts of its search index, then the following month new content was not being indexed. More problems occurred in August, but Google repaired the issue. The glitches arose when Google was implementing updates that resulted in losing pieces of the provisioning systems. When the issue was reported, Google quickly fixed it again.

More problems are still popping up:

“While there were problems with the Search Index, Search Console was also affected. Because some data comes from the search index. As soon as Google had to return to a previous version of the Search Index, it also stopped updating the Search Console data foundation. That was the reason for the plateaus in the reports of some users. Thus, some users were initially confused; The reason was that Google had to postpone the Search Console update by a few days.

Other bugs on Google have sometimes been independent of Search Index issues. For example, problems with the indexing of new News Content. In addition, some URLs began to direct Googlebot to pages that were not directly related. But even these inconveniences could be resolved quickly.”

Does anyone else see the pattern here?

Whitney Grace, September 25, 2019

Questionable Journals Fake Legitimacy

September 13, 2019

The problem of shoddy or fraudulent research being published as quality work continues to grow, and it is becoming harder to tell the good from the bad. Research Stash describes “How Fake Scientific Journals Are Bypassing Detection Filters.” In recent years, regulators and the media have insisted scientific journals follow certain standards. Instead of complying, however, some of these “predatory” journals have made changes that just make them look like they have mended their ways. The write-up cites a study out of the Gandhinagar Institute of Technology in India performed by Naman Jain, a student of Professor Mayank Singh. Writer Dinesh C Sharma reports:

“The researchers took a set of journals published by Omics, which has been accused of publishing predatory journals, with those published by BMC Publishing Group. Both publish hundreds of open access journals across several disciplines. Using data-driven analysis, researchers compared parameters like impact factors, journal name, indexing in digital directories, contact information, submission process, editorial boards, gender, and geographical data, editor-author commonality, etc. Analysis of this data and comparison between the two publishers showed that Omics is slowly evolving. Of the 35 criteria listed in the Beall’s list and which could be verified using the information available online, 22 criteria are common between Omics and BMC. Five criteria are satisfied by both the publishers, while 13 are satisfied by Omics but not by BMC. The predatory publishers are changing some of their processes. For example, Omics has started its online submission portal similar to well-known publishers. Earlier, it used to accept manuscripts through email. Omics dodges most of the Beall’s criteria to emerge as a reputed publisher.”

Jain suggests we update the criteria for identifying quality research and use more data analytics to identify false and misleading articles. He offers his findings as a starting point, and we are told he plans to present his research at a conference in November.

Cynthia Murrell, September 13, 2019

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta