Smart Software and Textualists: Are You a Textualist?

June 13, 2022

Many thought it was simply a massive bad decision from an inexperienced judge. But there was more to it—it was a massive bad decision from an inexperienced textualist judge with an overreliance on big data. The Verge discusses “The Linguistics Search Engine that Overturned the Federal Mask Mandate.” Search is useful, but it must be accompanied by good judgment. When a lawsuit challenging the federal mask mandate came across her bench, federal judge Kathryn Mizelle turned to the letter of the law. Literally. Reporter Nicole Wetsman tells us:

“Mizelle took a textualist approach to the question — looking specifically at the meaning of the words in the law. But along with consulting dictionaries, she consulted a database of language, called a corpus, built by a Brigham Young University linguistics professor for other linguists. Pulling every example of the word ‘sanitation’ from 1930 to 1944, she concluded that ‘sanitation’ was used to describe actively making something clean — not as a way to keep something clean. So, she decided, masks aren’t actually ‘sanitation.’”

That is some fine hair splitting. The high-profile decision illustrates a trend in US courts that has been growing since 2018—basing legal decisions on large collections of texts meant for academic exploration. The article explains:

“A corpus is a vast database of written language that can include things like books, articles, speeches, and other texts, amounting to hundreds of millions of lines of text or more. Linguists usually use corpora for scholarly projects to break down how language is used and what words are used for. Linguists are concerned that judges aren’t actually trained well enough to use the tools properly. ‘It really worries me that naive judges would be spending their lunch hour doing quick-and-dirty searches of corpora, and getting data that is going to inform their opinion,’ says Mark Davies, the now-retired Brigham Young University linguistics professor who built both the Corpus of Contemporary American English and the Corpus of Historical American English. These two corpora have become the tools most commonly used by judges who favor legal corpus linguistics.”

Here is an example of how a lack of careful consideration while using the corpora can lead to a bad decision: the most frequent usage of a particular word (like “sanitation”) is not always the most commonly understood usage. Linguists emphasize the proper use of these databases requires skilled interpretation, a finesse a growing number of justices either do not possess or choose not to use. Such textualists apply a strictly literal interpretation to the words that make up a law, ignoring both the intent of lawmakers and legislative history. This approach means judges can avoid having to think too deeply or give reasons on the merits for their interpretations. Why, one might ask, should we have justices at all when we could just ask a database? Perhaps we are headed that way. We suppose it would save a lot of tax dollars.

See the article for more on legal corpora and how judges use them, textualism, and the problems with this simplified approach. If judges won’t respect the opinion of the very authors of the corpora on how they should and should not be used, where does that leave us?

Cynthia Murrell, June 13, 2022

Deepset: Following the Trail of DR LINK, Fast Search and Transfer, and Other Intrepid Enterprise Search Vendors

April 29, 2022

I noted a Yahooooo! news story called “Deepset Raises $14M to Help Companies Build NLP Apps.” To me the headline could mean:

Customization is our business and services revenue our monetization model

Precursor enterprise search vendors tried to get gullible prospects to believe a company could install software and employees could locate the information needed to answer a business question. STAIRS III, Personal Library Software / SMART, and the outfit with forward truncation (InQuire) among others were there to deliver.

Then reality happened. Autonomy and Verity upped the ante with assorted claims. The Golden Age of Enterprise Search was poking its rosy fingers through the cloud of darkness related to finding an answer.

Quite a ride: The buzzwords sawed through the doubt and outfits like Delphis, Entopia, Inference, and many others embraced variations on the smart software theme. Excursions into asking the system a question to get an answer gained steam. Remember the hand crafted AskJeeves or the mind boggling DR LINK; that was, document retrieval via linguistic knowledge.

Today there are many choices for enterprise search: Free Elastic, Algolia, Funnelback now the delightfully named Squiz, Fabasoft Mindbreeze, and, of course, many, many more.

Now we have Deepset, “the startup behind the open source NLP framework Haystack, not to be confused with Matt Dunie’s memorable “haystack with needles” metaphor, the intelware company Haystack, or a basic piles of dead grass.

The article states:

CEO Milos Rusic co-founded Deepset with Malte Pietsch and Timo Möller in 2018. Pietsch and Möller — who have data science backgrounds — came from Plista, an adtech startup, where they worked on products including an AI-powered ad creation tool. Haystack lets developers build pipelines for NLP use cases. Originally created for search applications, the framework can power engines that answer specific questions (e.g., “Why are startups moving to Berlin?”) or sift through documents. Haystack can also field “knowledge-based” searches that look for granular information on websites with a lot of data or internal wikis.

What strikes me? Three things:

  1. This is essentially a consulting and services approach
  2. Enterprise becomes apps for a situation, department, or specific need
  3. The buzzwords are interesting: NLP, semantic search, BERT,  and humor.

Humor is a necessary quality which trying to make decades old technology work for distributed, heterogeneous data, email on a sales professionals mobile, videos, audio recordings, images, engineering diagrams along with the nifty datasets for the gizmos in the illustration, etc.

A question: Is $14 million enough?

Crickets.

Stephen E Arnold, April 29, 2022

Monopolies Know Best: The Amazon Method Involves a Better Status Page

December 13, 2021

Here’s the fix for the Amazon AWS outage: An updated status page. “Amazon Web Services Explains Outage and Will Make It Easier to Track Future Ones” reports:

A major Amazon Web Services outage on Tuesday started after network devices got overloaded, the company said on Friday [December 10, 2021] .  Amazon ran into issues updating the public and taking support inquiries, and now will revamp those systems.

Several questions arise:

  1. How are those two pizza technical methods working out?
  2. What about automatic regional load balancing and redundancy?
  3. What is up with replicating the mainframe single point of failure in a cloudy world?

Neither the write up nor Amazon have answers. I have a thought, however. Monopolies see efficiency arising from:

  1. Streamlining by shifting human intermediated work to smart software which sort of works until it does not.
  2. Talking about technical prowess via marketing centric content and letting the engineering sort of muddle along until it eventually, if ever, catches up to the Mad Ave prose, PowerPoints, and rah rah speeches at bespoke conferences
  3. Cutting costs where one can; for example, robust network devices and infrastructure.

The AT&T approach is a goner, but it seems to be back, just in the form of Baby Bell thinking applied to an online bookstore which dabbles in national security systems and methods, selling third party products with mysterious origins, and promoting audio books to those who have cancelled the service due to endless email promotions.

Yep, outstanding, just from Wall Street’s point of view. From my vantage point, another sign of deep seated issues. What outfit is up next? Google, Microsoft, or some back office provider of which most humans have never heard?

The new and improved approach to an AT&T type business is just juicy with wonderfulness. Two pizzas. Yummy.

Stephen E Arnold, December 13, 2021

Semantics and the Web: A Snort of Pisco?

November 16, 2021

I read a transcript for the video called “Semantics and the Web: An Awkward History.” I have done a little work in the semantic space, including a stint as an advisor to a couple of outfits. I signed confidentiality agreements with the firms and even though both have entered the well-known Content Processing Cemetery, I won’t name these outfits. However, I thought of the ghosts of these companies as I worked my way through the transcript. I don’t think I will have nightmares, but my hunch is that investors in these failed outfits may have bad dreams. A couple may experience post traumatic stress. Hey, I am just suggesting people read the document, not go bonkers over its implications in our thumbtyping world.

I want to highlight a handful of gems I identified in the write up. If I get involved in another world-saving semantic project, I will want to have these in my treasure chest.

First, I noted this statement:

“Generic coding”, later known as markup, first emerged in the late 1960s, when William Tunnicliffe, Stanley Rice, and Norman Scharpf got the ideas going at the Graphics Communication Association, the GCA.  Goldfarb’s implementations at IBM, with his colleagues Edward Mosher and Raymond Lorie, the G, M, and L, made him the point person for these conversations.

What’s not mentioned is that some in the US government became quite enthusiastic. Imagine the benefit of putting tags in text and providing electronic copies of documents. Much better than loose-leaf notebooks. I wish I have a penny for every time I heard this statement. How does the government produce documents today? The only technology not in wide use is hot metal type. It’s been — what? — a half century?

Second, I circled this passage:

SGML included a sample vocabulary, built on a model from the earliest days of GML. The American Association of Publishers and others used it regularly.

Indeed wonderful. The phrase “slicing and dicing” captured the essence of SGML. Why have human editors? Use SGML. Extract chunks. Presto! A new book. That worked really well but for one drawback: The proliferation of wild and crazy “books” were tough to sell. Experts in SGML were and remain a rare breed of cat. There were SGML ecosystems but adding smarts to content was and remains a work in progress. Yes, I am thinking of Snorkel too.

Third, I like this observation too:

Dumpsters are available in a variety of sizes and styles.  To be honest, though, these have always been available.  Demolition of old projects, waste, and disasters are common and frequent parts of computing.

The Web as well as social media are dumpsters. Let’s toss in TikTok type videos too. I think meta meta tags can burn in our cherry red garbage container. Why not?

What do these observations have to do with “semantics”?

  1. Move from SGML to XML. Much better. Allow XML to run some functions. Yes, great idea.
  2. Create a way to allow content objects to be anywhere. Just pull them together. Was this the precursor to micro services?
  3. One major consequence of tagging or the lack of it or just really lousy tagging, marking up, and relying of software allegedly doing the heavy lifting is an active demand for a way to “make sense” of content. The problem is that an increasing amount of content is non textual. Ooops.

What’s the fix? The semantic Web revivified? The use of pre-structured, by golly, correct mark up editors? A law that says students must learn how to mark up and tag? (Problem: Schools don’t teach math and logic anymore. Oh, well, there’s an online course for those who don’t understand consistency and rules.)

The write up makes clear there are numerous opportunities for innovation. And the non-textual information. Academics have some interesting ideas. Why not go SAILing or revisit the world of semantic search?

Stephen E Arnold, November 16, 2021

Facebook Targets Paginas Amarillas: Never Enough, Zuck?

October 14, 2021

Facebook is working to make one of its properties more profitable. The Next Web reports, “WhatsApp Reinvents the ‘Yellow Pages’ and Proves there Are No New Ideas.” The company will test out a new business directory feature in San Paulo, Brazil, where local users will be able to search for “businesses nearby” through the app. Writer Ivan Mehta reports:

“For years, Facebook and Instagram have been trying to connect you to businesses and make your shop through their platforms. While the WhatsApp Business app has been around, you couldn’t really search for businesses using the app, unless you’ve interacted with them previously. WhatsApp already offers payment services in Brazil. So it makes sense for it to provide discovery services for local businesses, so you can shop for goods in person, and pay through the platform. The chat app doesn’t have any ads, unlike Facebook and Instagram, so business interactions and transactions are one of the biggest ways for Facebook to earn some moolah out of it. In June, the company integrated its Shops feature in WhatsApp. So, we can expect more business-facing features in near future.”

India and Indonesia are likely next on the list for the project, according to Facebook’s Matt Idema. We are assured the company will track neither users’ locations nor the businesses they search for. Have we heard similar promises before?

Cynthia Murrell, October 14, 2021

Ex-Googlers Work On Biased NLP Solutions

October 6, 2021

Google is on top of the world when it comes to money and technology. Google is the world’s most used search engine, its Chrome Web browser is used by two-thirds of users, and about 29% of 2021 digital advertising were Google ads. Fast Company asks and investigates important questions about Google’s product quality in: “It’s Not Just You. Google Search Really Is Getting Worse.”

Over 80% of Alphabet Inc.’s revenue, Google’s parent company, comes from advertising revenue and about 85% of the world’s search engine traffic feeds through Google. Google controls a lot of users’ screen time. The search engine’s quality results have been studied and researchers have learned that very few users scroll past the “fold” (all of the available content on a screen). Advertising space at the top of search results is incredibly valuable. It also means that users are forced to scroll further and further to reach non-paid results.

Alphabet Inc. has another revenue generating platform, YouTube. A huge portion of videos include multiple ads. Users can avoid ads by paying for a premium subscription, but very few do.

Google does want to improve its search quality. Currently a lot of information from queries are distributed across multiple Web sites. Google wants to condense everything:

“Google is working on bringing this information together. The search engine now uses sophisticated “natural language processing” software called BERT, developed in 2018, that tries to identify the intention behind a search, rather than simply searching strings of text. AskJeeves tried something similar in 1997, but the technology is now more advanced.

BERT will soon be succeeded by MUM (Multitask Unified Model), which tries to go a step further and understand the context of a search and provide more refined answers. Google claims MUM may be 1,000 times more powerful than BERT, and be able to provide the kind of advice a human expert might for questions without a direct answer.”

Google controls a huge portion of the Internet and how users utilize it. Alphabet Inc. is here to stay for a long time, but there are alternatives such as Bing, DuckDuckGo, Ecosia, and Tor browsers. Google, however, will one day fade. Sears Roebuck, Blockbuster, Kmart, cassettes, etc. were al household names, until they became obsolete.

Whitney Grace, October 6, 2021

Data Federation: Sure, Works Perfectly

June 1, 2021

How easy is it to snag a dozen sets of data, normalize them, parse them, and extract useful index terms, assign classifications, and other useful hooks? “Automated Data Wrangling” provides an answer sharply different from what marketers assert.

A former space explorer, now marooned on a beautiful dying world explains that the marketing assurances of dozens upon dozens of companies are baloney. Here’s a passage I noted:

Most public data is a mess. The knowledge required to clean it up exists. Cloud based computational infrastructure is pretty easily available and cost effective. But currently there seems to be a gap in the open source tooling. We can keep hacking away at it with custom rule-based processes informed by our modest domain expertise, and we’ll make progress, but as the leading researchers in the field point out, this doesn’t scale very well. If these kinds of powerful automated data wrangling tools are only really available for commercial purposes, I’m afraid that the current gap in data accessibility will not only persist, but grow over time. More commercial data producers and consumers will learn how to make use of them, and dedicate financial resources to doing so, knowing that they’ll be reap financial rewards. While folks working in the public interest trying to create universal public goods with public data and open source software will be left behind struggling with messy data forever.

Marketing is just easier than telling the truth about what’s needed in order to generate information which can be processed by a downstream procedure.

Stephen E Arnold, June xx, 2021

More about Bert: Will TikTok Videos Be Next?

May 28, 2021

Google asserts its new AI model will deliver significant improvements. SEO Hacker discusses “Google MUM: New Search Technology.” We are told MUM, or Multi Unified Model, is like BERT but much more powerful. We learn:

“They are built on the same Transformer architecture, but MUM is 1000x more powerful than its predecessor. … Another difference between MUM and BERT is that MUM is trained across 75 languages – not just one language (usually English). This enables the search engine, through the use of MUM, to connect information from all around the world without going through language barriers. Additionally, Google mentioned that MUM is multimodal, so it understands and processes information from modalities such as text and images. They also brought up the possibility for MUM to expand to other modalities such as videos and audio files.”

For an example of how the new model will work, see either the SEO Hacker write-up or Google’s blog post on the subject. The illustration involves Mt. Fuji. Naturally, the Search Engine Optimization site ponders how the change might affect SEO. Writer Sean Si predicts MUM’s understanding of 75 languages means non-English content will find much wider audiences. The revised algorithm will also serve up more types of content, like podcasts and videos, alongside text-based resources. Both of those sound like positives, at least for searchers. Other ramifications on the field remain to be seen, but Si anticipates SEO pros will have to develop entirely new approaches. Of course, producing quality content relevant to one’s site should remain the top recommendation.

Cynthia Murrell, May 28, 2021

UCF Cracks Sarcasm: With a Crocodile Smile?

May 18, 2021

I read some big news from Big News. The story “Researchers Develop A.I. That Can Detect Sarcasm” explains that smart software has the ability to parse text so that a determination can be made about the degree of non-smarty writing can be detected. The article states:

The team taught the computer model to find patterns that often indicate sarcasm and combined that with teaching the program to correctly pick out cue words in sequences that were more likely to indicate sarcasm. They taught the model to do this by feeding it large data sets and then checked its accuracy.

Presumably the hand-crafting of the training set is able to keep pace with the language of those seeking customer support. I have commented about the brilliance and responsiveness of the customer support available from major companies; for example, Microsoft and Verizon. Improving upon the clarity of information available from these organizations is difficult for me to envision. The excellent handling of SolarWinds by Microsoft and the management acumen demonstrated by Verizon with regard to Yahoo chisels a benchmark in marketing effectiveness.

The write up adds:

The multi-head self-attention module aids in identifying crucial sarcastic cue-words from the input, and the recurrent units learn long-range dependencies between these cue-words to better classify the input text.

Mix in sentiment analysis, and the simplicity of the method is evident.

I noted this statement:

Sarcasm detection in online communications from social networking platforms is much more challenging.

It seems that one of the final frontiers of human utterance has been cross. Sarcasm has been cracked. As I write this I manifest a crocodile smile. The reason? The time and cost of maintaining the training set so it reflects what TikTok and Dread users “do” with language may be a sticking point. Then the rules must be updated in near real time, assuming that the data flows are related to crime, war fighting, or financial fraud.

A big crocodile? Yes, and a big smile. But research grants and graduate students are eager to contribute because… degree.

Stephen E Arnold, May 18, 2021

GitHub: Amusing Security Management

April 8, 2021

I got a kick out of “GitHub Investigating Crypto-Mining Campaign Abusing Its Server Infrastructure.” I am not sure if the write up is spot on, but it is entertaining to think about Microsoft’s security systems struggling to identify an unwanted service running in GitHub. The write up asserts:

Code-hosting service GitHub is actively investigating a series of attacks against its cloud infrastructure that allowed cybercriminals to implant and abuse the company’s servers for illicit crypto-mining operations…

In the wake of the SolarWinds’ and Exchange Server “missteps,” Microsoft has been making noises about the tough time it has dealing with bad actors. I think one MSFT big dog said there were 1,000 hackers attacking the company.

The main idea is that attackers allegedly mine cryptocurrency on GitHub’s own servers.

This is post SolarWinds and Exchange Server “missteps”, right?

What’s the problem with cyber security systems that monitoring real time threats and uncertified processes?

Oh, I forgot. These aggressively marketed cyber systems still don’t work it seems.

Stephen E Arnold, April 8, 2021

Next Page »

  • Archives

  • Recent Posts

  • Meta