Newton and Shoulders of Giants? Baloney. Is It Everyday Theft?

January 31, 2023

Here I am in rural Kentucky. I have been thinking about the failure of education. I recall learning from Ms. Blackburn, my high school algebra teacher, this statement by Sir Isaac Newton, the apple and calculus guy:

If I have seen further, it is by standing on the shoulders of giants.

Did Sir Isaac actually say this? I don’t know, and I don’t care too much. It is the gist of the sentence that matters. Why? I just finished reading — and this is the actual article title — “CNET’s AI Journalist Appears to Have Committed Extensive Plagiarism. CNET’s AI-Written Articles Aren’t Just Riddled with Errors. They Also Appear to Be Substantially Plagiarized.”

How is any self-respecting, super buzzy smart software supposed to know anything without ingesting, indexing, vectorizing, and any other math magic the developers have baked into the system? Did Brunelleschi wake up one day and do the Eureka! thing? Maybe he stood on line and entered the Pantheon and looked up? Maybe he found a wasp’s nest and cut it in half and looked at what the feisty insects did to build a home? Obviously intellectual theft. Just because the dome still stands, when it falls, he is an untrustworthy architect engineer. Argument nailed.

The write up focuses on other ideas; namely, being incorrect and stealing content. Okay, those are interesting and possibly valid points. The write up states:

All told, a pattern quickly emerges. Essentially, CNET‘s AI seems to approach a topic by examining similar articles that have already been published and ripping sentences out of them. As it goes, it makes adjustments — sometimes minor, sometimes major — to the original sentence’s syntax, word choice, and structure. Sometimes it mashes two sentences together, or breaks one apart, or assembles chunks into new Frankensentences. Then it seems to repeat the process until it’s cooked up an entire article.

For a short (very, very brief) time I taught freshman English at a big time university. What the Futurism article describes is how I interpreted the work process of my students. Those entitled and enquiring minds just wanted to crank out an essay that would meet my requirements and hopefully get an A or a 10, which was a signal that Bryce or Helen was a very good student. Then go to a local hang out and talk about Heidegger? Nope, mostly about the opposite sex, music, and getting their hands on a copy of Dr. Oehling’s test from last semester for European History 104. Substitute the topics you talked about to make my statement more “accurate”, please.

I loved the final paragraphs of the Futurism article. Not only is a competitor tossed over the argument’s wall, but the Google and its outstanding relevance finds itself a target. Imagine. Google. Criticized. The article’s final statements are interesting; to wit:

As The Verge reported in a fascinating deep dive last week, the company’s primary strategy is to post massive quantities of content, carefully engineered to rank highly in Google, and loaded with lucrative affiliate links. For Red Ventures, The Verge found, those priorities have transformed the once-venerable CNET into an “AI-powered SEO money machine.” That might work well for Red Ventures’ bottom line, but the specter of that model oozing outward into the rest of the publishing industry should probably alarm anybody concerned with quality journalism or — especially if you’re a CNET reader these days — trustworthy information.

Do you like the word trustworthy? I do. Does Sir Isaac fit into this future-leaning analysis. Nope, he’s still pre-occupied with proving that the evil Gottfried Wilhelm Leibniz was tipped off about tiny rectangles and the methods thereof. Perhaps Futurism can blame smart software?

Stephen E Arnold, January 31, 2023

Transcription Services: Three Sort of New Ones

December 19, 2022

Update: 2 pm Eastern US time, December 19, 2022. One of the research team pointed out that the article we posted earlier today chopped out a pointer to a YouTube video transcription service. YouTube Transcript accepts a url and outputs a transcript. You can obtain more information at https://youtubetranscript.com/.

One of the Arnold IT research team spotted two new or newish online transcription services. If you want text of an audio file or the text of a video, maybe one of these services will be useful to you. We have not tested either; we are just passing along what appear to be interesting examples of useful semi smart software.

The first is called Deepgram. (The name echoes n-gram, grammar, and grandma.) Once a person signs up, the registrant gets 200 hours of free transcription. That approximately a month of Jason Calacanis podcasts. The documentation and information about the service’s SDK may be found at this link.

The second service is Equature. The idea is, according to Yahoo Finance:

a first-of-its-kind transcription and full-text search engine. Equature Transcription provides automated transcription of audio from 9-1-1 calls, radio transmissions, Equature Armor Body-worn Camera video, and any other form of media captured within the Equature recording system. Once transcribed, all written text is searchable within the system.

Equature’s service is tailored to public safety applications. You can get more information from the firm’s Web site.

Oh, we don’t listen to Mr. Calacanis, but we do scan the transcript and skip the name drops, Musk cheers, and quasi-academic pontification.

Stephen E Arnold, December 19, 2022

Don Quixote Rides Again: Instead of Windmills, the Target Is Official and True Government Documents

December 8, 2022

I read “Archiving Official Documents as an Act of Radical Journalism.” The main idea is that a non governmental entity will collect official and “true” government documents, save them, and make them searchable. Now this is an interesting idea, and it one that most of countries for which I have provided consulting services related to archiving information have solutions. The solutions range from the wild and wooly methods used in the Japanese government to the logical approach implemented in Sweden. There’s a carnival atmosphere in Brazil, and there is a fairly interesting method in Croatia. France? Mais oui.

In each of these countries, one has to have quite specific know how in order to obtain an official and true government document. I know from experience that a person not a resident of some of these countries has pretty much zero chance of getting a public transcript of public hearing. In some cases, even with appropriate insider assistance, finding the documents is often impossible. Sure, the documents are “there.” But due to budget constraints, lousy technology, or staff procedures — not a chance. The Vatican Library has a number of little discussed incidents where pages from old books get chopped out of a priceless volume. Where are those pages now? Hey, where’s that hymn book from the 14th century?

I want you to notice that I did not mention the US. In America we have what some might call “let many flowers bloom” methods. You might think the Library of Congress has government documents. Yeah, sort of, well, some. Keep in mind that the US Senate has documents as does the House. Where are the working drafts of a bill? Try chasing that one down, assuming you have connections and appropriate documentation to poke around. Who has the photos of government nuclear facilities from the 1950. I know where they used to be in the “old” building in Germantown, Maryland. I even know how to run the wonky vertical lift to look in the cardboard boxes. Now? You have to be kidding. What about the public documents from Health and Human Services related to MIC, RAC, and ZPIC? Oh, you haven’t heard about these? Good luck finding them. I could work through every US government agency in which I have worked and provide what I think are fun examples of official government documents that are often quite, quite, quite difficult to locate.

The write up explains its idea which puts a windmill in the targeting device:

Democracy’s Library, a new project of the Internet Archive that launched last month, has begun collecting the world’s government publications into a single, permanent, searchable online repository, so that everyone—journalists, authors, academics, and interested citizens—will always be able to find, read, and use them. It’s a very fundamental form of journalism.

I am not sure the idea is a good one. In some countries, collecting government documents could become what I would characterize as a “problem.” What type of problem? How about fine, jail time, or unpleasantness that can follow you around like Shakespeare’s spaniels at your heels.

Several observations:

  1. Public official government documents change, they disappear, and they become non public without warning. An archive of public government documents will become quite a management challenge when classification changes, regimes change, and when government bureaucracy changes course. Chase down a US government repository librarian at a US government repository library near you and ask some questions. Let me know how that works out when you bring up some of the administrative issues for documents in a collection.
  2. A collection of official and true documents which tries to be comprehensive from a single country is going to be radioactive. Searchable information is problematic. That’s why enterprise search vendors who say, “All the information in your organization is searchable” evokes statements like “Get this outfit out of my office.” Some data is harmless when isolated. Pile data and information together and the stuff can go critical.
  3. Electronic official and true government documents are often inaccessible. Examples range from public information stored in Lotus Notes which is not the world’s best document system in my opinion to PowerPoint reports prepared for a public conference about the US Army’s Distributed Common Ground Information System. Now try to get the public document and you may find that what was okay for a small fish conference in Tyson’s Corner is going to evoke some interesting responses as the requests buck up the line.
  4. Collecting and piling up official and true information sounds good … to some. Others may view the effort with some skepticism because public government information is essentially infinite. Once collected those data may never go away. Never is a long time. How about those FOIA requests?

What’s the fix? Answer: Don Quixote became an icon for a reason, and it was not just elegant Spanish prose.

Stephen E Arnold, December 2022

The Failure of Search: Let Many Flowers Bloom and… Die Alone and Sad

November 1, 2022

I read “Taxonomy is Hard.” No argument from me. Yesterday (October 31, 2022) I spoke with a long time colleague and friend. Our conversations usually include some discussion about the loss of the expertise embodied in the early commercial database firms. The old frameworks, work processes, and shared beliefs among the top 15 or 20 for fee online database companies seem to have scattered and recycled in a quantum crazy digital world. We did not mention Google once, but we could have. My colleague and I agreed on several points:

  • Those who want to make digital information must have an informing editorial policy; that is, what’s the content space, what’s included, what’s excluded, and what problem does the commercial database solve
  • Finding information today is more difficult than it has been our two professional lives. We don’t know if the data are current and accurate (online corrections when publications issue fixes), fit within the editorial policy if there is one or the lack of policy shaped by the invisible hand of politics, advertising, and indifference to intellectual nuances. In some services, “old” data are disappeared presumably due to the cost of maintaining, updating if that is actually done, and working out how to make in depth queries work within available time and budget constraints
  • The steady erosion of precision and recall as reliable yardsticks for determining what a search system can find within a specific body of content
  • Professional indexing and content curation is being compressed or ignored by many firms. The process is expensive, time consuming, and intellectually difficult.

The cited article reflects some of these issues. However, the mirror is shaped by the systems and methods in use today. The approaches pivot on metadata (index terms) and tagging (more indexing). The approach is understandable. The shift to technology which slash the needed for subject matter experts, manual methods, meetings about specific terms or categories, and the other impedimenta are the new normal.

A couple of observations:

  1. The problems of social media boil down to editorial policies. Without these guard rails and the specialists needed to maintain them, finding specific items of information on widely used platforms like Facebook, TikTok, or Twitter, among others is difficult
  2. The challenges of processing video are enormous. The obvious fix is to gate the volume and implement specific editorial guidelines before content is made available to a user. Skipping this basic work task leads to the craziness evident in many services today
  3. Indexing can be supplemented by smart software. However, that smart software can drift off course, so specialists have to intervene and recalibrate the system.
  4. Semantic, statistical, or behavior centric methods for identifying and suggesting possible relevant content require the same expert centric approach. There is no free lunch is automated indexing, even for narrow vocabulary technical fields like nuclear physics or engineered materials. What smart software knows how to deal with new breakthroughs in physics which emerge from the study of inter cell behavior among proteins in the human brain?

Net net: Is it time to re-evaluate some discarded systems and methods? Is it time to accept the fact that technology cannot solve in isolation certain problems? Is it time to recognize that close enough for horseshoes and good enough are not appropriate when it comes to knowledge centric activities? Search engines die when the information garden cannot support the buds and shoots of finding useful information the user seeks.

Stephen E Arnold, November 1, 2022

Smart Software and Textualists: Are You a Textualist?

June 13, 2022

Many thought it was simply a massive bad decision from an inexperienced judge. But there was more to it—it was a massive bad decision from an inexperienced textualist judge with an overreliance on big data. The Verge discusses “The Linguistics Search Engine that Overturned the Federal Mask Mandate.” Search is useful, but it must be accompanied by good judgment. When a lawsuit challenging the federal mask mandate came across her bench, federal judge Kathryn Mizelle turned to the letter of the law. Literally. Reporter Nicole Wetsman tells us:

“Mizelle took a textualist approach to the question — looking specifically at the meaning of the words in the law. But along with consulting dictionaries, she consulted a database of language, called a corpus, built by a Brigham Young University linguistics professor for other linguists. Pulling every example of the word ‘sanitation’ from 1930 to 1944, she concluded that ‘sanitation’ was used to describe actively making something clean — not as a way to keep something clean. So, she decided, masks aren’t actually ‘sanitation.’”

That is some fine hair splitting. The high-profile decision illustrates a trend in US courts that has been growing since 2018—basing legal decisions on large collections of texts meant for academic exploration. The article explains:

“A corpus is a vast database of written language that can include things like books, articles, speeches, and other texts, amounting to hundreds of millions of lines of text or more. Linguists usually use corpora for scholarly projects to break down how language is used and what words are used for. Linguists are concerned that judges aren’t actually trained well enough to use the tools properly. ‘It really worries me that naive judges would be spending their lunch hour doing quick-and-dirty searches of corpora, and getting data that is going to inform their opinion,’ says Mark Davies, the now-retired Brigham Young University linguistics professor who built both the Corpus of Contemporary American English and the Corpus of Historical American English. These two corpora have become the tools most commonly used by judges who favor legal corpus linguistics.”

Here is an example of how a lack of careful consideration while using the corpora can lead to a bad decision: the most frequent usage of a particular word (like “sanitation”) is not always the most commonly understood usage. Linguists emphasize the proper use of these databases requires skilled interpretation, a finesse a growing number of justices either do not possess or choose not to use. Such textualists apply a strictly literal interpretation to the words that make up a law, ignoring both the intent of lawmakers and legislative history. This approach means judges can avoid having to think too deeply or give reasons on the merits for their interpretations. Why, one might ask, should we have justices at all when we could just ask a database? Perhaps we are headed that way. We suppose it would save a lot of tax dollars.

See the article for more on legal corpora and how judges use them, textualism, and the problems with this simplified approach. If judges won’t respect the opinion of the very authors of the corpora on how they should and should not be used, where does that leave us?

Cynthia Murrell, June 13, 2022

France and French: The Language of Diplomacy Says “Non, Non” to Gamer Lingo

May 31, 2022

I like France. Years ago I shipped my son to Paris to learn French. He learned other things. So, as a good daddy, I shipped him off to a language immersion school in Poitier. He learned other things. Logically, I responded as a good shepherd of my only son, I shipped him to Jarnac, to work for a cognac outfit. He learned other things. Finally, I shipped him to Montpellier. How was his French? Coming along I think.

He knew many slang terms.

Most of these were unknown to my wife (a French teacher) and me (a dolt from central Illinois). We bought a book of French slang, and it was useless. The French language zips right along: Words and phrases from French speaking Swiss people (mon dieu). Words and phrases from North Africans (what’s the term for head butt?). Words and phrases from the Middle East popular among certain fringe groups.

Over the decades, French has become Franglish. But the rock of Gibraltar (which should be a French rock, according to some French historians) is the Académie française e and its mission (a tiny snippet follows but there is a lot more at this link.

La mission confiée à l’Académie est claire : « La principale fonction de l’Académie sera de travailler, avec tout le soin et toute la diligence possibles, à donner des règles certaines à notre langue et à la rendre pure, éloquente et capable de traiter les arts et les sciences.»

Who cares? The French culture ministry (do we have one in the US other than Disneyland?)

France Bans English Gaming Tech Jargon in Push to Preserve Language Purity” explains:

Among several terms to be given official French alternatives were “cloud gaming”, which becomes “jeu video en nuage”, and “eSports”, which will now be translated as “jeu video de competition”. The ministry said experts had searched video game websites and magazines to see if French terms already existed. The overall idea, said the ministry, was to allow the population to communicate more easily.

Will those French “joueur-animateur en direct” abandon the word “streamer”?

Sure, and France will once again dominate Europe, parts of Africa, and the beaver-rich lands in North America. And Gibraltar? Sure, why not?

Stephen E Arnold, May 30, 2022

The FLoc Disperses: Are There Sheep Called Topics?

February 9, 2022

It looks like that FLoC thing is not working out for Google after all, so now it is trying another cookie-alternative called Topics. According to Inc., with this move, “Google Just Gave You the Best Reason Yet to Finally Quit Using Chrome.” Writer Jason Aten explains:

“Google said it would introduce an alternative known as Federated Learning of Cohorts, or FLoC. The short version is that Chrome would track your browsing history and use it to identify you as a part of a cohort of other users with similar interests. … The thing is, no one likes FLoC. Privacy experts hate it because it’s not actually more private just because the tracking and profiling happens in your browser. Advertisers and ad-tech companies don’t like FLoC because, well, they like cookies. They’d mostly prefer Google just leave things alone since cookies are what let them know exactly when you click on an ad, put something in your cart, and buy it. Now, Google is introducing an alternative it calls Topics. The idea is that Chrome will look at your browsing activity and identify up to five topics that it thinks you’re interested in. When you visit a website, Chrome will show it three of those topics, with the idea that the site will then show you an ad that matches your interest.”

Of course, all Chrome users will be enrolled in Topics by default. Google will provide a way to opt out, but it is well aware most users will not bother. If privacy is really important, why not just do away with targeted advertising altogether? Do not be silly—ad revenue is what Google is all about, even when it tries to pretend otherwise. Aten notes that Safari and Brave both allow users to block third-party cookies and neither had planned to support FLoC. Other browsers have ways to block them, too. According to this write-up, it is time to give up on Chrome altogether and choose a browser that actually respects users’ privacy.

Cynthia Murrell, February 10, 2022

Fuzzifying Data: Yeah, Sure

January 19, 2022

Data are often alleged to be anonymous, but they may not be. Expert companies such as LexisNexis, Acxiom, and mobile phone providers argue that as long as personal identifiers, including names, address, etc., are removed from data it is rendered harmless. Unfortunately data can be re-anonymized without too much trouble. Wired posted Justin Sherman’s article, “Big Data May Not Know Your Name. But It Knows Everything Else.”

Despite humans having similar habits, there is some truth in the phrase “everyone is unique.” With a few white hat or black hat tactics, user data can be traced back to the originator. Data proves to be not only individualized based on a user’s unique identity, but there are also minute ways to gather personal information ranging from Internet search history, GPS logs, and IP address. Companies that want to sell you goods and services purchase the data, but also governments and law enforcement agencies do as well.

There are stringent privacy regulations in place, but in the face of the all mighty dollar and governments bypassing their own laws, it is like spitting in the wind. The scariest fact is that nothing is secret anymore:

“The irony that data brokers claim that their “anonymized” data is risk-free is absurd: Their entire business model and marketing pitch rests on the premise that they can intimately and highly selectively track, understand, and micro target individual people.

This argument isn’t just flawed; it’s also a distraction. Not only do these companies usually know your name anyway, but data simply does not need to have a name or social security number attached to cause harm. Predatory loan companies and health insurance providers can buy access to advertising networks and exploit vulnerable populations without first needing those people’s names. Foreign governments can run disinformation and propaganda campaigns on social media platforms, leveraging those companies’ intimate data on their users, without needing to see who those individuals are.”

Companies and organizations need to regulate themselves, while governments need to pass laws that protect their citizens from bad actors. Self-regulation in the face of dollar signs is like asking a person with sweet tooth to stop eating sugar. However, if governments concentrated on types of data and types of data collection and sharing to regulate rather than a blanket solution could protect users.

Let’s think about the implications. No, let’s not.

Whitney Grace January 19, 2022

Semantics and the Web: A Snort of Pisco?

November 16, 2021

I read a transcript for the video called “Semantics and the Web: An Awkward History.” I have done a little work in the semantic space, including a stint as an advisor to a couple of outfits. I signed confidentiality agreements with the firms and even though both have entered the well-known Content Processing Cemetery, I won’t name these outfits. However, I thought of the ghosts of these companies as I worked my way through the transcript. I don’t think I will have nightmares, but my hunch is that investors in these failed outfits may have bad dreams. A couple may experience post traumatic stress. Hey, I am just suggesting people read the document, not go bonkers over its implications in our thumbtyping world.

I want to highlight a handful of gems I identified in the write up. If I get involved in another world-saving semantic project, I will want to have these in my treasure chest.

First, I noted this statement:

“Generic coding”, later known as markup, first emerged in the late 1960s, when William Tunnicliffe, Stanley Rice, and Norman Scharpf got the ideas going at the Graphics Communication Association, the GCA.  Goldfarb’s implementations at IBM, with his colleagues Edward Mosher and Raymond Lorie, the G, M, and L, made him the point person for these conversations.

What’s not mentioned is that some in the US government became quite enthusiastic. Imagine the benefit of putting tags in text and providing electronic copies of documents. Much better than loose-leaf notebooks. I wish I have a penny for every time I heard this statement. How does the government produce documents today? The only technology not in wide use is hot metal type. It’s been — what? — a half century?

Second, I circled this passage:

SGML included a sample vocabulary, built on a model from the earliest days of GML. The American Association of Publishers and others used it regularly.

Indeed wonderful. The phrase “slicing and dicing” captured the essence of SGML. Why have human editors? Use SGML. Extract chunks. Presto! A new book. That worked really well but for one drawback: The proliferation of wild and crazy “books” were tough to sell. Experts in SGML were and remain a rare breed of cat. There were SGML ecosystems but adding smarts to content was and remains a work in progress. Yes, I am thinking of Snorkel too.

Third, I like this observation too:

Dumpsters are available in a variety of sizes and styles.  To be honest, though, these have always been available.  Demolition of old projects, waste, and disasters are common and frequent parts of computing.

The Web as well as social media are dumpsters. Let’s toss in TikTok type videos too. I think meta meta tags can burn in our cherry red garbage container. Why not?

What do these observations have to do with “semantics”?

  1. Move from SGML to XML. Much better. Allow XML to run some functions. Yes, great idea.
  2. Create a way to allow content objects to be anywhere. Just pull them together. Was this the precursor to micro services?
  3. One major consequence of tagging or the lack of it or just really lousy tagging, marking up, and relying of software allegedly doing the heavy lifting is an active demand for a way to “make sense” of content. The problem is that an increasing amount of content is non textual. Ooops.

What’s the fix? The semantic Web revivified? The use of pre-structured, by golly, correct mark up editors? A law that says students must learn how to mark up and tag? (Problem: Schools don’t teach math and logic anymore. Oh, well, there’s an online course for those who don’t understand consistency and rules.)

The write up makes clear there are numerous opportunities for innovation. And the non-textual information. Academics have some interesting ideas. Why not go SAILing or revisit the world of semantic search?

Stephen E Arnold, November 16, 2021

Exposing Big Data: A Movie Person Explains Fancy Math

April 16, 2021

I am not “into” movies. Some people are. I knew a couple of Hollywood types, but I was dumbfounded by their thought processes. One of these professionals dreamed of crafting a motion picture about riding a boat powered by the wind. I think I understand because I skimmed one novel by Herman Melville, who grew up with servants in the house. Yep, in touch with the real world of fish and storms at sea.

However, perhaps an exception is necessary. A movie type offered some interesting ideas in the BBC “real” news story “Documentary Filmmaker Adam Curtis on the Myth of Big Data’s Predictive Power: It’s a Modern Ghost Story.” Note: This article is behind a paywall designed to compensate content innovators for their highly creative work. You have been warned.

Here are several statements I circled in bright True Blue marker ink:

  • “The best metaphor for it is that Amazon slogan, which is: ‘If you like that, then you’ll like this,'” [Adam] Curtis [the documentary film maker]
  • [Adam Curtis] pointed to the US National Security Agency’s failure to intercept a single terrorist attack, despite monitoring the communications of millions of Americans for the better part of two decades.
  • [Big data and online advertising] a bit like sending someone with a flyer advertising pizzas to
    the lobby of a pizza restaurant,” said Curtis. “You give each person one of those flyers as they come into the restaurant and they walk out with a pizza.  “It looks like it’s one of your flyers that’s done it. But it wasn’t – it’s a pizza restaurant.”

Maybe I should pay more attention to the filmic mind. These observations strike me as accurate.

Predictive analytics, fancy math, and smart software? Ghosts.

But what if ghosts are real?

Stephen E Arnold, April 16, 2021

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta