Gender Bias in Old Books. Rewrite Them?

October 9, 2019

Here is an interesting use of machine learning. Salon tells us “What Reading 3.5 Million Books Tells Us About Gender Stereotypes.” Researchers led by University of Copenhagen’s Dr. Isabelle Augenstein analyzed 11 billion English words in literature published between 1900 and 2008. Not surprisingly, the results show that adjectives about appearance were most often applied to women (“beautiful” and “sexy” top the list), while men were more likely to be described by character traits (“righteous,” “rational,” and “brave” were most frequent). Writer Nicole Karlis describes how the team approached the analysis:

“Using machine learning, the researchers extracted adjectives and verbs connected to gender-specific nouns, like ‘daughter.’ Then the researchers analyzed whether the words had a positive, negative or neutral point of view. The analysis determined that negative verbs associated with appearance are used five times more for women than men. Likewise, positive and neutral adjectives relating to one’s body appearance occur twice as often in descriptions of women. The adjectives used to describe men in literature are more frequently ones that describe behavior and personal qualities.

“Researchers noted that, despite the fact that many of the analyzed books were published decades ago, they still play an active role in fomenting gender discrimination, particularly when it comes to machine learning sorting in a professional setting. ‘The algorithms work to identify patterns, and whenever one is observed, it is perceived that something is “true.” If any of these patterns refer to biased language, the result will also be biased,’ Augenstein said. ‘The systems adopt, so to speak, the language that we people use, and thus, our gender stereotypes and prejudices.’” Augenstein explained this can be problematic if, for example, machine learning is used to sift through employee recommendations for a promotion.”

Karlis does list some caveats to the study—it does not factor in who wrote the passages, what genre they were pulled from, or how much gender bias permeated society at the time. The research does affirm previous results, like the 2011 study that found 57% of central characters in children’s books are male.

Dr. Augenstein hopes her team’s analysis will raise awareness about the impact of gendered language and stereotypes on machine learning. If they choose, developers can train their algorithms on less biased materials or program them to either ignore or correct for biased language.

Cynthia Murrell, October 9, 2019

Memos: Mac Search Tool for Images

October 3, 2019

You need a Mac. You need photos with text. You need Memos. (Apple account may be required to snag the software.) The software identifies texts in images and extracts text. Enter a query and the software displays the source image.

You can take a pix of text and Memos will OCR it. You can search across images for in photo text.

Cost? About $5.

On the App Store. Use the link on the Memos Web site. When we search the App Store, Memos was not findable.

Does it work? Still some bugs if the user comments are on point.

Stephen E Arnold, October 3, 2019

Palantir Technologies: Fund Raising Signal

September 6, 2019

Palantir Technologies offers products and services which serve analysts and investigators. The company was founded in 2003, and it gained some traction in a number of US government agencies. The last time I checked for Palantir’s total funding, my recollection is that the firm has ingested about $2 billion from a couple dozen funding rounds. If you subscribe to Crunchbase, you can view that service’s funding round up. An outfit known as Growjo reports that Palantir has 2,262 employees. That works out cash intake of $884,173 per employee. Palantir is a secretive outfit, so who knows about funding, the revenue, the profits or losses, and the number of full time equivalents, contractors, etc. But Palantir is one of the highest profile companies in the law enforcement, regulatory, and intelligence sectors.

I read “Palantir to Seek Funding on Private Market, Delay IPO” and noted this statement:

The company has never turned an annual profit.

Bloomberg points out that customization of the system is expensive. Automation is a priority. Sales cycles are lengthy. And some stakeholders and investors are critical of the company.

Understandable. After 16 years and allegedly zero profits, annoyance is likely to surface in the NYAC after an intense game of squash.

But I am not interested in Palantir. The information about Palantir strikes me as germane to the dozens upon dozens of Palantir competitors. Consider these questions:

  1. Intelligence, like enterprise search, requires software and services that meet the needs of users who have quite particular work processes. Why pay lots of money to customize something that will have to be changed when a surprise event tips over established procedures? Roll your own? Look for the lowest cost solution?
  2. With so many competitors, how will government agencies be able to invest in a wide range of solutions. Why not seek a single source solution and find ways to escape from the costs of procuring, acquiring, tuning, training, and changing systems? If Palantir was the home run, why haven’t Palantir customers convinced their peers and superiors to back one solution? That hasn’t happened, which makes an interesting statement in itself. Why isn’t Palantir the US government wide solution the way Oracle was a few years ago?
  3. Are the systems outputting useful, actionable information. Users of these systems who give talks at LE and intel conferences are generally quite positive. But the reality is that cyber problems remain and have not been inhibited by Palantir and similar tools or the raft of cyber intelligence innovations from companies in the UK, Germany, Israel, and China. What’s the problem? Staff turnover, complexity, training cost, reliability of outputs?

Net net: Palantir’s needing money is an interesting signal. Stealth, secrecy, good customer support, and impressive visuals of networks of bad actors — important. But maybe — just maybe — the systems are ultimately not working as advertised. Sustainable revenues, eager investors, and a home run product equivalent to Facebook or Netflix — nowhere to be found. Yellow lights are flashing in DarkCyber’s office for some intelware vendors.

Stephen E Arnold, September 6, 2019

Can a Well Worn Compass Help Enterprise Search Thrive?

September 4, 2019

In the early 1990s, Scotland Yard (which never existed although there is a New Scotland Yard) wanted a way to make sense of the data available to investigators in the law enforcement sector.

A start up in Cambridge, England, landed a contract. To cut a multi year story short, i2 Ltd. created Analyst’s Notebook. The product is now more than a quarter century old, and the Analyst’s Notebook is owned by IBM. In the span of five or six years, specialist vendors reacted to the Analyst’s Notebook functionalities. Even though the early versions were clunky, the software performed some functions that may be familiar to anyone who has tried to locate, analyze, and make sense of data within an organization. I am using “organization” in a broad sense, not just UK law enforcement, regulatory enforcement, and intelligence entities.

What were some of the key functions of Analyst’s Notebook, a product which most people in the search game know little about? Let me highly a handful, and then flash forward to what enterprise search vendors are trying to pull off in an environment which is very different from what the i2 experts tackled 25 years ago. Hint: Focus was the key to Analyst’s Notebook’s success and to the me-too products which are widely available to LE and intel professionals. Enterprise search lacks this singular advantage, and, as a result, is likely to flounder as it has for decades.

The Analyst’s Notebook delivered:

  • Machine assistance to investigators implemented in software which generally followed established UK police procedures. Forget the AI stuff. The investigator or a team of investigators focused on a case provided most of the brain power.
  • Software which could identify entities. An entity is a person, place, thing, phone number, credit card, event, or similar indexable item.
  • Once identified, the software — influenced by the Cambridge curriculum in physics — could display a relationship “map” or what today looks like a social graph.
  • Visual cues allowed investigators to see that a person who received lots of phone calls from another person were connected. To make the relationship explicit, a heavy dark line connected the two phone callers.
  • Ability to print out on a big sheet of paper these relationship maps and other items of interest either identified by an investigator or an item surfaced using maths which could identify entities within a cluster or an anomaly and its date and time.

Over the years, other functions were added. Today’s version offers a range of advanced functions that make it easy to share data, collaborate, acquire and add to the investigative teams’ content store (on premises, hybrid, or in the cloud), automate some functions using IBM technology (no, I won’t use the Watson word), and workflow. Imagery is supported. Drill down makes it easy to see “where the data came from.” An auditor can retrace an investigator’s action in order to verify a process. If you want more about i2, just run a Bing, Google, or Yandex query.

Why am I writing about decades old software?

The reason is that is read an item from my files as my team was updating my comments about Amazon’s policeware for the October TechnoSecurity & Digital Forensics Conference. The item I viewed is titled “Thomson Reuters Partners with Squirro to Combine Artificial Intelligence Technology and Data to Unlock Customer Intelligence.” I had written about Squirro in “Will Cognitive Search (Whatever That Is) Change Because of Squirro?

I took a look at the current Squirro Web site and learned that the company is the leader in “context intelligence.” That seemed similar to what i2 delivered in the 1990s version of Analyst’s Notebook. The software was designed to fit the context of a specific country’s principal police investigators. No marketing functions, no legal information, no engineering product data — just case related information like telephone records, credit card receipts, officer reports, arrest data, etc.

Squirro, founded in 2012 or 2013 (there are conflicting dates online) states that the software delivers

a personalized, real-time contextual stream from the sea of information directly to your workplace. It’s based on Squirro’s digital fingerprint technology connecting personal interests and workflows while learning and refining as user interactions increase.

I also noted this statement:

Squirro combines all the different tools you need to work with unstructured data and enables you to curate a self-learning 360° context radar natural to use in any enterprise system. ‘So What?’ Achieving this reduces searching time by 90%, significantly cutting costs and allows for better, more effective decision-making. The highly skilled Swiss team of search experts has been working together for over 10 years to create a precise context intelligence solution. Squirro: Your Data in Context.

Well, 2013 to the present is six years, seven if I accept the 2012 date.

The company states that it offers “A.I.-driven actionable Insights,” adding:

Squirro is a leading AI-platform – a self-learning system keeping you in the know and recommending what’s next.

I’m okay with marketing lingo. But to my way of thinking, Squirro is edging toward the i2 Analyst’s Notebook type of functionality. The difference is that Squirro wants to serve the enterprise. Yep, enterprise search with wrappers for smart software, reports, etc.

I don’t want to make a big deal of this similarity, but there is one important point to keep in mind. Delivering an enterprise solution to a commercial outfit means that different sectors of the business will have different needs. The different needs manifest themselves in workflows and data particular to their roles in the organization. Furthermore, most commercial employees are not trained like police and intelligence operatives; that is, employees looking for information have diverse backgrounds and different educational experiences. For better or worse, law enforcement intelligence professionals go to some type of training. In the US, the job is handled by numerous entities, but a touchstone is FLETC. Each country has its equivalent. Therefore, there is a shared base of information, a shared context if you will.

Modern companies are a bit like snowflakes. There’s a difference, however, the snowflakes may no longer work together in person. In fact, interactions are intermediated in numerous ways. This is not a negative, but it is somewhat different from how a team of investigators worked on a case in London in the 1990s.

What is the “search” inside the Squirro information retrieval system? The answer is open source search. The features are implemented via software add ons, wrappers, and micro services plus other 2019 methods.

This is neither good nor bad. Using open source reduces some costs. On the other hand, the resulting system will have a number of moving parts. As complexity grows with new features, some unexpected events will occur. These have to be chased down and fixed.

New features and functions can be snapped in. The trajectory of this modern approach is to create a system which offers many marketing hooks and opportunities to make a sale to an organization looking for a solution to the ever present “information problem.”

My hypothesis is that i2 Analyst’s Notebook succeeded an information access, analysis, and reporting system because it focused on solving a rather specific use case. A modern system such as a search and retrieval solution that tries to solve multiple problems is likely to hit a wall.

The digital wall is the same one that pushed Fast Search & Transfer and many other enterprise search systems to the sidelines or the scrap heap.

Net net: Focus, not jargon, may be valuable, not just for Squirro, but for other enterprise search vendors trying to attain sustainable revenues and a way to keep their sources of funding, their customers, their employees, and their stakeholders happy.

Stephen E Arnold, September 4, 2019

Search: Useless Results Finally Recognized?

August 22, 2019

I cannot remember how many years ago it was since I wrote “Search Sucks” for Barbara Quint, the late editor of Searcher. I recall her comment to me, “Finally, someone in the industry speaks out.”

Flash forward a decade. I can now repeat her comment to me with some minor updating: “Finally someone recognized by the capitalist tool, Forbes Magazine, recognizes that search sucks.

The death of search was precipitated by several factors. Mentioning these after a decade of ignoring Web search still makes me angry. The failure of assorted commercial search vendors, the glacial movement of key trade associations, and the ineffectuality of search “experts” still makes me angry.

Image result for fake information

There are other factors contributing to the sorry state of Web search today. Note: I am narrowing my focus to the “free” Web search systems. If I have the energy, I may focus on the remarkable performance of “enterprise search.” But not today.

Here are the reasons Web search fell to laughable levels of utility:

  1. Google adopted the GoTo / Overture / Yahoo approach to determining relevance. This is the pay-to-play model.
  2. Search engine optimization “experts” figured out that Google allowed some fiddling with how it determined “relevance.” Google and other ad supported search systems then suggested that those listings might decay. The fix? Buy ads.
  3. Users who were born with mobile phones and flexible fingers styled themselves “search experts” along with any other individual who obtains information by looking for “answers” in a “free” Web search system.
  4. The willful abandonment of editorial policies, yardsticks like precision and recall, and human indexing guaranteed that smart software would put the nails in the coffin of relevance. Note: artificial intelligence and super duped automated indexing systems are right about 80 percent of the time when hammering scientific, technical, and engineering information. Toss is blog posts, tweets, and Web content created by people who skipped high school English and the accuracy plummets. Way down, folks. Just like facial recognition systems.

The information presented in “As Search Engines Increasingly Turn To AI They Are Harming Search” is astounding. Not because it is new, but because it is a reflection of what I call the Web search mentality.

Here’s an example:

Yet over the past few years, search engines of all kinds have increasingly turned to deep learning-powered categorization and recommendation algorithms to augment and slowly replace the traditional keyword search. Behavioral and interest-based personalization has further eroded the impact of keyword searches, meaning that if ten people all search for the same thing, they may all get different results. As search engines depreciate traditional raw “search” in favor of AI-assisted navigation, the concept of informational access is being harmed and our digital world is being redefined by the limitations of today’s AI.

The problem is not artificial intelligence.

Read more

Trovicor: A Slogan as an Equation

August 2, 2019

We spotted this slogan on the Trovicor Web site:

The Trovicor formula: Actionable Intelligence = f (data generation; fusion; analysis; visualization)

The function consists of four buzzwords used by vendors of policeware and intelware:

  • Data generation (which suggests metadata assigned to intercepted, scraped, or provided content objects)
  • Fusion (which means in DarkCyber’s world a single index to disparate data)
  • Analysis (numerical recipes to identify patterns or other interesting data
  • Virtualization (use of technology to replace old school methods like 1950s’ style physical wire taps, software defined components, and software centric widgets).

The buzzwords make it easy to identify other companies providing somewhat similar services.

Trovicor maintains a low profile. But obtaining open source information about the company may be a helpful activity.

Stephen E Arnold, August 2, 2019

Need a Summary of a Web Page?

July 28, 2019

DarkCyber prefers to read articles. There are people like MBAs, engineers, and accountants who have a need for getting information fast. Minimal words, maximum optimization.


No poetry for these specialists. If you find yourself pressed for eyeball time, navigate to the Hacker Yogi. Paste your url or text in the appropriate box and the free service will spit out a usable abstract. We tried it on some of our DarkCyber posts. Worked well. We plugged in a paywall WSJ article, and the Hacker Yogi refused to summarize the content.

Useful for those who want a summary.

Stephen E Arnold, July 28, 2019


Need a Machine Learning Algorithm?

July 17, 2019

r entry

The Web site published “101 Machine Learning Algorithms for Data Science with Cheat Sheets.” The write up recycles information from DataScienceDojo, and some of the information looks familiar. But lists of algorithms are not original. They are useful. What sets this list apart is the inclusion of “cheat sheets.”

What’s a cheat sheet?

In this particular collection, a cheat sheet looks like this:

r entry example

You can see the entry for the algorithm: Bernoulli Naive Bayes with a definition. The “cheat sheet” is a link to a python example. In this case, the example is a link to an explanation on the Chris Albon blog.

What’s interesting is that the 101 algorithms are grouped under 18 categories. Of these 18, Bayes and derivative methods total five.

No big deal, but in my lectures about widely used algorithms I highlight 10, mostly because it is a nice round number. The point is that most of the analytics vendors use the same basic algorithms. Variations among products built on these algorithms are significant.

As analytics systems become more modular — that  is, like Lego blocks — it seems that the trajectory of development will be to select, preconfigure thresholds, and streamline processes in a black box.

Is this good or bad?

It depends on whether one’s black box is a dominant solution or platform?

Will users know that this almost inevitable narrowing has upsides and downsides?


Stephen E Arnold, July 17, 2019

New Jargon: Consultants, Start Your Engines

July 13, 2019

I read “What Is “Cognitive Linguistics“? The article appeared in Psychology Today. Disclaimer: I did some work for this outfit a long time ago. Anybody remember Charles Tillinghast, “CRM” when it referred to people, not a baloney discipline for a Rolodex filled with sales lead, and the use of Psychology Today as a text in a couple of universities? Yeah, I thought not. The Ziff connection is probably lost in the smudges of thumb typing too.

Onward: The write up explains a new spin on psychology, linguistics, and digital interaction. The jargon for this discipline or practice, if you will is:

Cognitive Linguistics

I must assume that the editorial processes at today’s Psychology Today are genetically linked to the procedures in use in — what was it, 1972? — but who knows.

excited fixed

Here’s the definition:

The cognitive linguistics enterprise is characterized by two key commitments. These are:
i) the Generalization Commitment: a commitment to the characterization of general principles that are responsible for all aspects of human language, and
ii) the Cognitive Commitment: a commitment to providing a characterization of general principles for language that accords with what is known about the mind and brain from other disciplines. As these commitments are what imbue cognitive linguistics with its distinctive character, and differentiate it from formal linguistics.

If you are into psychology and figuring out how to manipulate people or a Google ranking, perhaps this is the intellectual gold worth more than stolen treasure from Montezuma.

Several observations:

  1. I eagerly await an estimate from IDC for the size of the cognitive linguistics market, and I am panting with anticipation for a Garnter magic quadrant which positions companies as leaders, followers, outfits which did not pay for coverage, and names found with a Google search at Starbuck’s south of the old PanAm Building. Cognitive linguistics will have to wait until the two giants of expertise figure out how to define “personal computer market”, however.
  2. A series of posts from Dave Amerland and assorted wizards at SEO blogs which explain how to use the magic of cognitive linguistics to make a blog page — regardless of content, value, and coherence — number one for a Google query.
  3. A how to book from Wiley publishing called “Cognitive Linguistics for Dummies” with online reference material which may or many not actually be available via the link in the printed book
  4. A series of conferences run by assorted “instant conference” organizers with titles like “The Cognitive Linguistics Summit” or “Cognitive Linguistics: Global Impact”.

So many opportunities. Be still, my heart.

Cognitive linguistics — it’s time has come. Not a minute too soon for a couple of floundering enterprise search vendors to snag the buzzword and pivot to implementing cognitive linguistics for solving “all your information needs.” Which search company will embrace this technology: Coveo, IBM Watson, Sinequa?

DarkCyber is excited.

Stephen E Arnold, July 13, 2019

Sentiment Analysis: Can a Monkey Can Do It?

June 27, 2019

Sentiment analysis is a machine learning tool companies are employing to understand how their customers feel about their services and products. It is mainly deployed on social media platforms, including Facebook, Instagram, and Twitter. The Monkey Learn blog details how sentiment analysis is specifically being used on Twitter in the post, “Sentiment Analysis Of Twitter.”

Using sentiment analysis is not a new phenomenon, but there are still individuals unaware of the possible power at their fingertips. Monkey Learn specializes in customer machine learning solutions that include intent, keywords, and, of course, sentiment analysis. The post is a guide on the basics of sentiment analysis: what it is, how it works, and real life examples. Monkey Learn defines sentiment analysis as:

Sentiment analysis (a.k.a opinion mining) is the automated process of identifying and extracting the subjective information that underlies a text. This can be either an opinion, a judgment, or a feeling about a particular topic or subject. The most common type of sentiment analysis is called ‘polarity detection’ and consists in classifying a statement as ‘positive’, ‘negative’ or ‘neutral’.”

It also relies on natural language processing (NLP) to understand the information’s context.

Monkey Learn explains that sentiment analysis is important because most of the world’s digital data is unstructured. Machine learning with NLP’s assistance can quickly sort large data sets and detect their polarity. Monkey Learn promises with their sentiment analysis to bring their customers scalability, consistent criteria, and real-time analysis. Many companies are using Twitter sentiment analysis for customer service, brand monitoring, market research, and political campaigns.

The article is basically a promotional piece for Monkey Learn, but it does work as a starting guide for sentiment analysis.

Whitney Grace, June 27, 2019

Next Page »

  • Archives

  • Recent Posts

  • Meta