Search: Useless Results Finally Recognized?

August 22, 2019

I cannot remember how many years ago it was since I wrote “Search Sucks” for Barbara Quint, the late editor of Searcher. I recall her comment to me, “Finally, someone in the industry speaks out.”

Flash forward a decade. I can now repeat her comment to me with some minor updating: “Finally someone recognized by the capitalist tool, Forbes Magazine, recognizes that search sucks.

The death of search was precipitated by several factors. Mentioning these after a decade of ignoring Web search still makes me angry. The failure of assorted commercial search vendors, the glacial movement of key trade associations, and the ineffectuality of search “experts” still makes me angry.

Image result for fake information

There are other factors contributing to the sorry state of Web search today. Note: I am narrowing my focus to the “free” Web search systems. If I have the energy, I may focus on the remarkable performance of “enterprise search.” But not today.

Here are the reasons Web search fell to laughable levels of utility:

  1. Google adopted the GoTo / Overture / Yahoo approach to determining relevance. This is the pay-to-play model.
  2. Search engine optimization “experts” figured out that Google allowed some fiddling with how it determined “relevance.” Google and other ad supported search systems then suggested that those listings might decay. The fix? Buy ads.
  3. Users who were born with mobile phones and flexible fingers styled themselves “search experts” along with any other individual who obtains information by looking for “answers” in a “free” Web search system.
  4. The willful abandonment of editorial policies, yardsticks like precision and recall, and human indexing guaranteed that smart software would put the nails in the coffin of relevance. Note: artificial intelligence and super duped automated indexing systems are right about 80 percent of the time when hammering scientific, technical, and engineering information. Toss is blog posts, tweets, and Web content created by people who skipped high school English and the accuracy plummets. Way down, folks. Just like facial recognition systems.

The information presented in “As Search Engines Increasingly Turn To AI They Are Harming Search” is astounding. Not because it is new, but because it is a reflection of what I call the Web search mentality.

Here’s an example:

Yet over the past few years, search engines of all kinds have increasingly turned to deep learning-powered categorization and recommendation algorithms to augment and slowly replace the traditional keyword search. Behavioral and interest-based personalization has further eroded the impact of keyword searches, meaning that if ten people all search for the same thing, they may all get different results. As search engines depreciate traditional raw “search” in favor of AI-assisted navigation, the concept of informational access is being harmed and our digital world is being redefined by the limitations of today’s AI.

The problem is not artificial intelligence.

Read more

Trovicor: A Slogan as an Equation

August 2, 2019

We spotted this slogan on the Trovicor Web site:

The Trovicor formula: Actionable Intelligence = f (data generation; fusion; analysis; visualization)

The function consists of four buzzwords used by vendors of policeware and intelware:

  • Data generation (which suggests metadata assigned to intercepted, scraped, or provided content objects)
  • Fusion (which means in DarkCyber’s world a single index to disparate data)
  • Analysis (numerical recipes to identify patterns or other interesting data
  • Virtualization (use of technology to replace old school methods like 1950s’ style physical wire taps, software defined components, and software centric widgets).

The buzzwords make it easy to identify other companies providing somewhat similar services.

Trovicor maintains a low profile. But obtaining open source information about the company may be a helpful activity.

Stephen E Arnold, August 2, 2019

Need a Summary of a Web Page?

July 28, 2019

DarkCyber prefers to read articles. There are people like MBAs, engineers, and accountants who have a need for getting information fast. Minimal words, maximum optimization.


No poetry for these specialists. If you find yourself pressed for eyeball time, navigate to the Hacker Yogi. Paste your url or text in the appropriate box and the free service will spit out a usable abstract. We tried it on some of our DarkCyber posts. Worked well. We plugged in a paywall WSJ article, and the Hacker Yogi refused to summarize the content.

Useful for those who want a summary.

Stephen E Arnold, July 28, 2019


Need a Machine Learning Algorithm?

July 17, 2019

r entry

The Web site published “101 Machine Learning Algorithms for Data Science with Cheat Sheets.” The write up recycles information from DataScienceDojo, and some of the information looks familiar. But lists of algorithms are not original. They are useful. What sets this list apart is the inclusion of “cheat sheets.”

What’s a cheat sheet?

In this particular collection, a cheat sheet looks like this:

r entry example

You can see the entry for the algorithm: Bernoulli Naive Bayes with a definition. The “cheat sheet” is a link to a python example. In this case, the example is a link to an explanation on the Chris Albon blog.

What’s interesting is that the 101 algorithms are grouped under 18 categories. Of these 18, Bayes and derivative methods total five.

No big deal, but in my lectures about widely used algorithms I highlight 10, mostly because it is a nice round number. The point is that most of the analytics vendors use the same basic algorithms. Variations among products built on these algorithms are significant.

As analytics systems become more modular — that  is, like Lego blocks — it seems that the trajectory of development will be to select, preconfigure thresholds, and streamline processes in a black box.

Is this good or bad?

It depends on whether one’s black box is a dominant solution or platform?

Will users know that this almost inevitable narrowing has upsides and downsides?


Stephen E Arnold, July 17, 2019

New Jargon: Consultants, Start Your Engines

July 13, 2019

I read “What Is “Cognitive Linguistics“? The article appeared in Psychology Today. Disclaimer: I did some work for this outfit a long time ago. Anybody remember Charles Tillinghast, “CRM” when it referred to people, not a baloney discipline for a Rolodex filled with sales lead, and the use of Psychology Today as a text in a couple of universities? Yeah, I thought not. The Ziff connection is probably lost in the smudges of thumb typing too.

Onward: The write up explains a new spin on psychology, linguistics, and digital interaction. The jargon for this discipline or practice, if you will is:

Cognitive Linguistics

I must assume that the editorial processes at today’s Psychology Today are genetically linked to the procedures in use in — what was it, 1972? — but who knows.

excited fixed

Here’s the definition:

The cognitive linguistics enterprise is characterized by two key commitments. These are:
i) the Generalization Commitment: a commitment to the characterization of general principles that are responsible for all aspects of human language, and
ii) the Cognitive Commitment: a commitment to providing a characterization of general principles for language that accords with what is known about the mind and brain from other disciplines. As these commitments are what imbue cognitive linguistics with its distinctive character, and differentiate it from formal linguistics.

If you are into psychology and figuring out how to manipulate people or a Google ranking, perhaps this is the intellectual gold worth more than stolen treasure from Montezuma.

Several observations:

  1. I eagerly await an estimate from IDC for the size of the cognitive linguistics market, and I am panting with anticipation for a Garnter magic quadrant which positions companies as leaders, followers, outfits which did not pay for coverage, and names found with a Google search at Starbuck’s south of the old PanAm Building. Cognitive linguistics will have to wait until the two giants of expertise figure out how to define “personal computer market”, however.
  2. A series of posts from Dave Amerland and assorted wizards at SEO blogs which explain how to use the magic of cognitive linguistics to make a blog page — regardless of content, value, and coherence — number one for a Google query.
  3. A how to book from Wiley publishing called “Cognitive Linguistics for Dummies” with online reference material which may or many not actually be available via the link in the printed book
  4. A series of conferences run by assorted “instant conference” organizers with titles like “The Cognitive Linguistics Summit” or “Cognitive Linguistics: Global Impact”.

So many opportunities. Be still, my heart.

Cognitive linguistics — it’s time has come. Not a minute too soon for a couple of floundering enterprise search vendors to snag the buzzword and pivot to implementing cognitive linguistics for solving “all your information needs.” Which search company will embrace this technology: Coveo, IBM Watson, Sinequa?

DarkCyber is excited.

Stephen E Arnold, July 13, 2019

Sentiment Analysis: Can a Monkey Can Do It?

June 27, 2019

Sentiment analysis is a machine learning tool companies are employing to understand how their customers feel about their services and products. It is mainly deployed on social media platforms, including Facebook, Instagram, and Twitter. The Monkey Learn blog details how sentiment analysis is specifically being used on Twitter in the post, “Sentiment Analysis Of Twitter.”

Using sentiment analysis is not a new phenomenon, but there are still individuals unaware of the possible power at their fingertips. Monkey Learn specializes in customer machine learning solutions that include intent, keywords, and, of course, sentiment analysis. The post is a guide on the basics of sentiment analysis: what it is, how it works, and real life examples. Monkey Learn defines sentiment analysis as:

Sentiment analysis (a.k.a opinion mining) is the automated process of identifying and extracting the subjective information that underlies a text. This can be either an opinion, a judgment, or a feeling about a particular topic or subject. The most common type of sentiment analysis is called ‘polarity detection’ and consists in classifying a statement as ‘positive’, ‘negative’ or ‘neutral’.”

It also relies on natural language processing (NLP) to understand the information’s context.

Monkey Learn explains that sentiment analysis is important because most of the world’s digital data is unstructured. Machine learning with NLP’s assistance can quickly sort large data sets and detect their polarity. Monkey Learn promises with their sentiment analysis to bring their customers scalability, consistent criteria, and real-time analysis. Many companies are using Twitter sentiment analysis for customer service, brand monitoring, market research, and political campaigns.

The article is basically a promotional piece for Monkey Learn, but it does work as a starting guide for sentiment analysis.

Whitney Grace, June 27, 2019

How Smart Software Goes Off the Rails

June 23, 2019

Navigate to “How Feature Extraction Can Be Improved With Denoising.” The write up seems like a straight forward analytics explanation. Lots of jargon, buzzwords, and hippy dippy references to length squared sampling in matrices. The concept is not defined in the article. And if you remember statistics 101, you know that there are five types of sampling: Convenience, cluster, random, systematic, and stratified. Each has its strengths and weaknesses. How does one avoid the issues? Use length squared sampling obviously: Just sample rows with probability proportional to the square of their Euclidean norms. Got it?

However, the math is not the problem. Math is a method. The glitch is in defining “noise.” Like love, there are many ways to define love. The write up points out:

Autoencoders with more hidden layers than inputs run the risk of learning the identity function – where the output simply equals the input – thereby becoming useless. In order to overcome this, Denoising Autoencoders(DAE) was developed. In this technique, the input is randomly induced by noise. This will force the autoencoder to reconstruct the input or denoise. Denoising is recommended as a training criterion for learning to extract useful features that will constitute a better higher level representation.

Can you spot the flaw in approach? Consider what happens if the training set is skewed for some reason. The system will learn based on the inputs smoothed by statistical sanding. When the system encounters real world data, the system will, by golly, convert the “real” inputs in terms of the flawed denoising method. As one wit observed, “So s?c^2 p gives us a better estimation than the zero matrix.” Yep.

To sum up, the system just generates “drifting” outputs. The fix? Retraining. This is expensive and time consuming. Not good when the method is applied to real time flows of data.

In a more colloquial turn of phrase, the denoiser may not be denoising correctly.

A more complex numerical recipes are embedded in “smart” systems, there will be some interesting consequences. Does the phrase “chain of failure”? What about “good enough”?

Stephen E Arnold, June 23, 2019

Owlin Pivots Attracts Funding

June 21, 2019

Financial-tech startup Owlin is bound to be celebrating—TechCrunch announces, “Owlin, the Text and News Analytics Platform for Financial Institutions, Raises $3.5M Series A.” This is especially good news, considering the company lost ground when its original backer went bankrupt; that twist cost the company two founders, we’re told. Now, though, Velocity Capital is leading this round of funding. Writer Steve O’Hear reports:

“The fundraise follows the fintech company’s pivot from a real-time news alert service to a more comprehensive ‘AI-based’ text and news analytics platform to help financial institutions assess risk. … This is seeing Owlin enable 15,000 counter-party risk managers worldwide to track risk events that are not captured by traditional credit risk metrics. ‘We are adding news and unstructured data to their risk monitoring. In the end, our clients don’t just gain insights, they also gain time,’ adds the Owlin CEO.”

Apparently, the platform is unusually successful at augmenting certain types of data, making for more accurate risk models. Regulators love that, we’re reminded. Founded in 2012, Owlin is based in Amsterdam Some of the companies global clients are Deutsche Bank, ING, Fitch Ratings, Adyen, and KPMG.

Cynthia Murrell, June 21, 2019

Grammar Rules Help Algorithms Grasp Language

June 20, 2019

Researchers at several universities have teamed up with IBM to teach algorithms some subtleties of language. VentureBeat reports, “IBM, MIT, and Harvard’s AI Uses Grammar Rules to Catch Linguistic Nuances of U.S. English.” Writer Kyle Wiggers links to the two resulting research papers, noting the research was to be presented at the recent North American Chapter of the Association for Computational Linguistics conference. We learn:

“The IBM team, along with scientists from MIT, Harvard, the University of California, Carnegie Mellon University, and Kyoto University, devised a tool set to suss out grammar-aware AI models’ linguistic prowess. As the coauthors explain, one model in question was trained on a sentence structure called recurrent neural network grammars, or RNNGs, that imbued it with basic grammar knowledge. The RNNG model and similar models with little-to-no grammar training were fed sentences with good, bad, or ambiguous syntax. The AI systems assigned probabilities to each word, such that in grammatically ‘off’ sentences, low-probability words appeared in the place of high-probability words. These were used to measure surprisal[sic]. The coauthors found that the RNNG system consistently performed better than systems trained on little-to-no grammar using a fraction of the data, and that it could comprehend ‘fairly sophisticated’ rules.”

See the write-up for a few details about those rules, or check out the research papers for more information (links above). This is but a start for their model, the team cautions, for the work must be validated on larger data sets. Still, they believe, this project represents a noteworthy milestone.

Cynthia Murrell, June 20, 2019

Firefox Translation Add In

May 17, 2019

The DarkCyber team encounters information in a number of languages. For years, we relied on Google Translate, but the limits on document size proved an annoyance. has been more useful. We have an older installation of some Systran modules.

DarkCyber learned that Firefox has returned to the “translate now” territory with Translate Man. You can get an overview of the functionality of the add in in “Translate anything instantly in Firefox with Translate Man.” Translate Man uses Google’s API.

We haven’t tested the functionality of the add in in an extensive way. It did translate words and short passages in a helpful way.

The write up identifies useful features that add in delivers. Two are a translate on hover feature and a pronunciation function so you can “hear” the word or passage.

In our experience, some text requires a native speaker of the language to translate with accuracy.

Google has introduced its wonderfully named Translatotron. You can read about that innovation in “Google Unveils Translatotron, Its Speech-to-Speech Translation System.”

Now about these systems’ ability to translate the argot of insiders involved in “interesting” work in North Korea or Iran? What about making sense of emojis in clear text messages?

Someday perhaps.

Stephen E Arnold, May 17, 2019

Next Page »

  • Archives

  • Recent Posts

  • Meta