A Semantic Search Use Case: But What about General Business Content with Words and Charts?

September 9, 2022

I am okay with semantic search. The idea is that a relatively standard suite of mathematical procedures delivers “close enough for horse shoes” matches germane to a user’s query. Elastic is now combining key word with some semantic goodness. The idea is that mixing methods delivers more useful results. Is this an accurate statement?

The answer is, “It depends on the use cases.”

How Semantic Search Improves Search Accuracy” explains a use case that is anchored in a technical corpus. Now I don’t want to get crossways with a group of search experts. I would submit that, in general, the vocabulary for scientific, medical, and technical information is more constrained. One does not expect to find “cheugy” or OG* in a write up about octonitrocubane.

In my limited experience, what happens is that a constrained corpus allows the developer of a finding system to use precise taxonomies, and some dinobabies may employ controlled vocabularies like those kicking around old-school commercial databases.

However, what happens when the finding system ingests a range of content objects from tweets, online news services, and TikTok-type content?

The write up says:

One particular advantage of semantic search is the resolution of ambiguous terminology and that all specific subtypes (“children”) of a technical term will be found without the need to mention them in the query explicitly.

Sounds good, particularly for scientific and technical content. What about those pesky charts and graphs? These are often useful, but many times are chock full of fudged data. What about the query, “Octonitrocubane invalid data”? I want to have the search system present links to content which may be in an article. Why? I want to make sure the alleged data set squares with my limited knowledge of statistical principles. Yeah, sorry.

The write up asserts:

A lexical search will deliver back all documents in which “pesticides” is mentioned as the text string “pesticides” plus variants thereof. A semantic search will, in addition to all documents containing the text string “pesticides”, also return documents that contain specific pesticides like bixafen, boscalid, or imazamox.

What about a chemical structure search? I want a document with structure information. Few words, just nifty structures just like the stuff inorganic and organic chemists inhale each day. Sorry about that.

Net net: Writing about search is tough when the specific corpus, the content objects, and the presence of controlled terms in addition to strings in a content object are not spelled out. Without this information, the assertions are a bit fluffy.

And the video thing? The DoD, NIST, and other outfits are making videos. Things that go boom are based on chemistry. Can semantic search find the videos and the results of tests?

Yeah, sure. The PowerPoint deck probably says so. Hands on search experience may not. Search-enabled applications may work better than plain old search jazzed up with close enough for horse shoes methods.

Stephen E Arnold, September 9, 2022

[* OG means original gangster]

Not a Eulogy for the Semantic Web, Maybe an Elegy

August 22, 2022

If you are into the semantic Web, you will enjoy “The Semantic Web is Dead – Long Live the Semantic Web!” The article has examples, some explanation, and a prediction. Spoiler: The Semantic Web will rise like Lazarus, just wearing Amazon normcore clothing. You know: For everyone.

I noted one passage and circled it in blue:

The political economy of academia and its interaction with industry is the origin of our current lack of a functional Semantic Web. Academia is structured in a way that there is very little incentive for anyone to build useable software. Instead you are elevated for rapidly throwing together an idea, a tiny proof of concept, and to iterate on microscopic variations of this thing to produce as many papers as possible. In engineering the devil is in the detail. You really need to get into the weeds before you can know what the right thing to do is. This is simultaneously a devastating situation for industry and academia. Nobody is going to wait around for a team of engineers to finish building a system to write about it in Academia. You’ll be passed immediately by legions of paper pushers. And in industry, you can’t just be mucking about with a system that you might have to throw away. We have structured collaboration as the worst of both worlds. Academics drop in random ideas, and industry try them, find them useless, and move on.

I believe in the Tooth Fairy, not Jack Benny’s Blue Fairy. The write up, like me, is mostly optimistic. I learned:

The Future of the Semantic Web is there, the Semantic Web will rise, but it will not be the Semantic Web of the past. Humanities access to data is of ever increasing importance, and the ability to make resilient and distributed methods of curating, updating and utilizing this information is key. The ideas which drove the creation of the Semantic Web are nowhere near obsolete, even if the tool chain and technologies which have defined it up to day are fated to go the way of the dinosaur.

Now I am a dinobaby. What about Web 3-ized well-formed XML? Great idea, right?

Stephen E Arnold, August 22, 2022

Smart Software and Textualists: Are You a Textualist?

June 13, 2022

Many thought it was simply a massive bad decision from an inexperienced judge. But there was more to it—it was a massive bad decision from an inexperienced textualist judge with an overreliance on big data. The Verge discusses “The Linguistics Search Engine that Overturned the Federal Mask Mandate.” Search is useful, but it must be accompanied by good judgment. When a lawsuit challenging the federal mask mandate came across her bench, federal judge Kathryn Mizelle turned to the letter of the law. Literally. Reporter Nicole Wetsman tells us:

“Mizelle took a textualist approach to the question — looking specifically at the meaning of the words in the law. But along with consulting dictionaries, she consulted a database of language, called a corpus, built by a Brigham Young University linguistics professor for other linguists. Pulling every example of the word ‘sanitation’ from 1930 to 1944, she concluded that ‘sanitation’ was used to describe actively making something clean — not as a way to keep something clean. So, she decided, masks aren’t actually ‘sanitation.’”

That is some fine hair splitting. The high-profile decision illustrates a trend in US courts that has been growing since 2018—basing legal decisions on large collections of texts meant for academic exploration. The article explains:

“A corpus is a vast database of written language that can include things like books, articles, speeches, and other texts, amounting to hundreds of millions of lines of text or more. Linguists usually use corpora for scholarly projects to break down how language is used and what words are used for. Linguists are concerned that judges aren’t actually trained well enough to use the tools properly. ‘It really worries me that naive judges would be spending their lunch hour doing quick-and-dirty searches of corpora, and getting data that is going to inform their opinion,’ says Mark Davies, the now-retired Brigham Young University linguistics professor who built both the Corpus of Contemporary American English and the Corpus of Historical American English. These two corpora have become the tools most commonly used by judges who favor legal corpus linguistics.”

Here is an example of how a lack of careful consideration while using the corpora can lead to a bad decision: the most frequent usage of a particular word (like “sanitation”) is not always the most commonly understood usage. Linguists emphasize the proper use of these databases requires skilled interpretation, a finesse a growing number of justices either do not possess or choose not to use. Such textualists apply a strictly literal interpretation to the words that make up a law, ignoring both the intent of lawmakers and legislative history. This approach means judges can avoid having to think too deeply or give reasons on the merits for their interpretations. Why, one might ask, should we have justices at all when we could just ask a database? Perhaps we are headed that way. We suppose it would save a lot of tax dollars.

See the article for more on legal corpora and how judges use them, textualism, and the problems with this simplified approach. If judges won’t respect the opinion of the very authors of the corpora on how they should and should not be used, where does that leave us?

Cynthia Murrell, June 13, 2022

Economical Semantics: Check Out GitHub

June 9, 2022

A person asked me at lunch this week, “How can we do a sentiment analysis search on the cheap?” My reaction was, “There are many options. Check out GitHub and let it rip.” After lunch, one of my trust researchers reminded me that our files contained a cop of a 2021 article called “Semantic Search on the Cheap.” I re-read the article and noticed that I had circled this passage in October 2021:

Innovative models are being released at a blistering pace, with different architectures and better scores against the benchmarks. The models are almost always bigger networks, with billions of parameters, requiring more and more GPU power. These models are extremely expressive, dynamic and can be fine-tuned to solve a multitude of problems.

Despite the cratering of some tech juggernauts, the pace of marketing in the smart software sector continues to outpace innovation. The write up is interesting because it raised a number of questions on Thursday, June 2, 2022. In a post-lunch stupor, I asked myself these questions:

  1. How many organizations want to know the “sentiment” of a chunk of text. The early sentiment analysis systems operated on word lists. Some of the words and phrases in a customer email, for example, reveal the emotional payload of a customer’s message; for example, “sue you” or “terminate our agreement.” The semantic sentiment has launched a thousand PowerPoints, but what about the emotional payload of an employee complaining on TikTok?
  2. Is 85 percent accuracy the high water mark? If it is, the “accuracy” scores are in what I continue to call the “close enough for horse shoes” playing area. In 100 text passages, the best one can do is generate 15 misses. Lower “scores” mean more misses. This is okay for online advertising, but what about diagnosing a child’s medical condition. Hey, only 15 get worse and that is the best case. No sentiment score for the parents’ communications with a malpractice attorney is necessary.
  3. Is cheap the optimal way to get good “performance”? The answer is that it costs money to go fast. Plus, smart software has a nasty tendency to drift. As the content fed into the system reflects words and concepts not part of the system’s furniture, the camp chairs get mixed up with the love seats. For certain applications like customer service in companies that don’t want to hear from customers, this approach is perfect.

Google wants everyone to Snorkel. Meta or Zuckbook wants everyone to embrace the outputs of FAIR (Facebook Artificial Intelligence Research). Clever, eh? Amazon and Microsoft are players too. We must not forget IBM. Who could ever forget Watson and DataFountain?

Net net: Download stuff from GitHub or another open source repository and get coding. Reserve time for a zippy PowerPoint too.

Stephen E Arnold, June 9, 2022

France and French: The Language of Diplomacy Says “Non, Non” to Gamer Lingo

May 31, 2022

I like France. Years ago I shipped my son to Paris to learn French. He learned other things. So, as a good daddy, I shipped him off to a language immersion school in Poitier. He learned other things. Logically, I responded as a good shepherd of my only son, I shipped him to Jarnac, to work for a cognac outfit. He learned other things. Finally, I shipped him to Montpellier. How was his French? Coming along I think.

He knew many slang terms.

Most of these were unknown to my wife (a French teacher) and me (a dolt from central Illinois). We bought a book of French slang, and it was useless. The French language zips right along: Words and phrases from French speaking Swiss people (mon dieu). Words and phrases from North Africans (what’s the term for head butt?). Words and phrases from the Middle East popular among certain fringe groups.

Over the decades, French has become Franglish. But the rock of Gibraltar (which should be a French rock, according to some French historians) is the Académie française e and its mission (a tiny snippet follows but there is a lot more at this link.

La mission confiée à l’Académie est claire : « La principale fonction de l’Académie sera de travailler, avec tout le soin et toute la diligence possibles, à donner des règles certaines à notre langue et à la rendre pure, éloquente et capable de traiter les arts et les sciences.»

Who cares? The French culture ministry (do we have one in the US other than Disneyland?)

France Bans English Gaming Tech Jargon in Push to Preserve Language Purity” explains:

Among several terms to be given official French alternatives were “cloud gaming”, which becomes “jeu video en nuage”, and “eSports”, which will now be translated as “jeu video de competition”. The ministry said experts had searched video game websites and magazines to see if French terms already existed. The overall idea, said the ministry, was to allow the population to communicate more easily.

Will those French “joueur-animateur en direct” abandon the word “streamer”?

Sure, and France will once again dominate Europe, parts of Africa, and the beaver-rich lands in North America. And Gibraltar? Sure, why not?

Stephen E Arnold, May 30, 2022

Deepset: Following the Trail of DR LINK, Fast Search and Transfer, and Other Intrepid Enterprise Search Vendors

April 29, 2022

I noted a Yahooooo! news story called “Deepset Raises $14M to Help Companies Build NLP Apps.” To me the headline could mean:

Customization is our business and services revenue our monetization model

Precursor enterprise search vendors tried to get gullible prospects to believe a company could install software and employees could locate the information needed to answer a business question. STAIRS III, Personal Library Software / SMART, and the outfit with forward truncation (InQuire) among others were there to deliver.

Then reality happened. Autonomy and Verity upped the ante with assorted claims. The Golden Age of Enterprise Search was poking its rosy fingers through the cloud of darkness related to finding an answer.

Quite a ride: The buzzwords sawed through the doubt and outfits like Delphis, Entopia, Inference, and many others embraced variations on the smart software theme. Excursions into asking the system a question to get an answer gained steam. Remember the hand crafted AskJeeves or the mind boggling DR LINK; that was, document retrieval via linguistic knowledge.

Today there are many choices for enterprise search: Free Elastic, Algolia, Funnelback now the delightfully named Squiz, Fabasoft Mindbreeze, and, of course, many, many more.

Now we have Deepset, “the startup behind the open source NLP framework Haystack, not to be confused with Matt Dunie’s memorable “haystack with needles” metaphor, the intelware company Haystack, or a basic piles of dead grass.

The article states:

CEO Milos Rusic co-founded Deepset with Malte Pietsch and Timo Möller in 2018. Pietsch and Möller — who have data science backgrounds — came from Plista, an adtech startup, where they worked on products including an AI-powered ad creation tool. Haystack lets developers build pipelines for NLP use cases. Originally created for search applications, the framework can power engines that answer specific questions (e.g., “Why are startups moving to Berlin?”) or sift through documents. Haystack can also field “knowledge-based” searches that look for granular information on websites with a lot of data or internal wikis.

What strikes me? Three things:

  1. This is essentially a consulting and services approach
  2. Enterprise becomes apps for a situation, department, or specific need
  3. The buzzwords are interesting: NLP, semantic search, BERT,  and humor.

Humor is a necessary quality which trying to make decades old technology work for distributed, heterogeneous data, email on a sales professionals mobile, videos, audio recordings, images, engineering diagrams along with the nifty datasets for the gizmos in the illustration, etc.

A question: Is $14 million enough?

Crickets.

Stephen E Arnold, April 29, 2022

Semantics Have Become an Architecture: Sounds Good but

December 17, 2021

Semantic Architecture Is A Big Data Cash Grab

A few years ago, big data was the hot topic term and in its wake a surge of techno babble followed. Many technology companies develop their own techno babble to peddle their wares, while some of the jargon does have legitimate means to exist. Epiexpress has the lowdown on one term that does have actual meaning: “What Is Semantic Architecture, And How To Build One?”

The semantic data layer is a system’s brain or hub, because most data can be found through a basic search. It overlays the more complex data in a system. Companies can leverage the semantic layer for business decisions and discover new insights. The semantic layer uses an ontology model and enterprise knowledge graph to organize data. Before building the architecture, one should consider the following:

“1. Defining and listing the organizational needs

When developing a semantic enterprise solution, properly-outlined use cases provide the critical questions that the semantic architecture will answer. It, in turn, gives a better knowledge of the stakeholders and users, defines the business value, and facilitates the definition of measurable success criteria.

2. Survey the relevant business data

Many enterprises possess a data architecture founded on data warehouses, relational databases, and an array of hybrid cloud systems and applications that aid analytics and data analysis abilities
In such enterprises, employing relevant unification processes and model mapping practices based on the enterprise’s use cases, staff skill-sets, and enterprise architecture capabilities will be an effective approach for data modeling and mapping from source systems.

3. Using semantic web standards for ensuring governance and interoperability

When implementing semantic architecture, it is important to use semantic technology such as graph management apps to be middleware. Middleware acts as organizational tools for proper metadata governance. Do not forger that users will need tools to interact with the data, such as enterprise search, chatbots, and data visualization tools.

Semantic babble?

Whitney Grace, December 17, 2021

Semantics and the Web: A Snort of Pisco?

November 16, 2021

I read a transcript for the video called “Semantics and the Web: An Awkward History.” I have done a little work in the semantic space, including a stint as an advisor to a couple of outfits. I signed confidentiality agreements with the firms and even though both have entered the well-known Content Processing Cemetery, I won’t name these outfits. However, I thought of the ghosts of these companies as I worked my way through the transcript. I don’t think I will have nightmares, but my hunch is that investors in these failed outfits may have bad dreams. A couple may experience post traumatic stress. Hey, I am just suggesting people read the document, not go bonkers over its implications in our thumbtyping world.

I want to highlight a handful of gems I identified in the write up. If I get involved in another world-saving semantic project, I will want to have these in my treasure chest.

First, I noted this statement:

“Generic coding”, later known as markup, first emerged in the late 1960s, when William Tunnicliffe, Stanley Rice, and Norman Scharpf got the ideas going at the Graphics Communication Association, the GCA.  Goldfarb’s implementations at IBM, with his colleagues Edward Mosher and Raymond Lorie, the G, M, and L, made him the point person for these conversations.

What’s not mentioned is that some in the US government became quite enthusiastic. Imagine the benefit of putting tags in text and providing electronic copies of documents. Much better than loose-leaf notebooks. I wish I have a penny for every time I heard this statement. How does the government produce documents today? The only technology not in wide use is hot metal type. It’s been — what? — a half century?

Second, I circled this passage:

SGML included a sample vocabulary, built on a model from the earliest days of GML. The American Association of Publishers and others used it regularly.

Indeed wonderful. The phrase “slicing and dicing” captured the essence of SGML. Why have human editors? Use SGML. Extract chunks. Presto! A new book. That worked really well but for one drawback: The proliferation of wild and crazy “books” were tough to sell. Experts in SGML were and remain a rare breed of cat. There were SGML ecosystems but adding smarts to content was and remains a work in progress. Yes, I am thinking of Snorkel too.

Third, I like this observation too:

Dumpsters are available in a variety of sizes and styles.  To be honest, though, these have always been available.  Demolition of old projects, waste, and disasters are common and frequent parts of computing.

The Web as well as social media are dumpsters. Let’s toss in TikTok type videos too. I think meta meta tags can burn in our cherry red garbage container. Why not?

What do these observations have to do with “semantics”?

  1. Move from SGML to XML. Much better. Allow XML to run some functions. Yes, great idea.
  2. Create a way to allow content objects to be anywhere. Just pull them together. Was this the precursor to micro services?
  3. One major consequence of tagging or the lack of it or just really lousy tagging, marking up, and relying of software allegedly doing the heavy lifting is an active demand for a way to “make sense” of content. The problem is that an increasing amount of content is non textual. Ooops.

What’s the fix? The semantic Web revivified? The use of pre-structured, by golly, correct mark up editors? A law that says students must learn how to mark up and tag? (Problem: Schools don’t teach math and logic anymore. Oh, well, there’s an online course for those who don’t understand consistency and rules.)

The write up makes clear there are numerous opportunities for innovation. And the non-textual information. Academics have some interesting ideas. Why not go SAILing or revisit the world of semantic search?

Stephen E Arnold, November 16, 2021

Semantic: Scholar and Search

September 1, 2021

The new three musketeers could be named Semantic, Scholar, and Search. What’s missing is a digital d’Artagnan. What are three valiant mousquetaires up to? Fixing search for scholarly information.

To learn why smart software goes off the rails, navigate to “Building a Better Search Engine for Semantic Scholar.” The essay documents how a group of guardsmen fixed up search which is sort of intelligent and sort of sensitive to language ambiguities like “cell”: A biological cell or “cell” in wireless call admission control. Yep, English and other languages require context to figure out what someone might be trying to say. Less tricky for bounded domains, but quite interesting for essay writing or tweets.

Please, read the article because it makes clear some of the manual interventions required to make search deliver objective, on point results. The essay is important because it talks about issues most search and retrieval “experts” prefer to keep under their kepis. Imagine what one can do with the knobs and dials in this system to generate non-objective and off point results. That would be exciting in certain scholarly fields I think.

Here are some quotes which suggest that Fancy Dan algorithmic shortcuts like those enabled by Snorkel-type solutions; for example:

Quote A

The best-trained model still makes some bizarre mistakes, and posthoc correction is needed to fix them.

Meaning: Expensive human and maybe machine processes are needed to get the model outputs back into the realm of mostly accurate.

Quote B

Here’s another:

Machine learning wisdom 101 says that “the more data the better,” but this is an oversimplification. The data has to be relevant, and it’s helpful to remove irrelevant data. We ended up needing to remove about one-third of our data that didn’t satisfy a heuristic “does it make sense” filter.

Meaning: Rough sets may be cheaper to produce but may be more expensive in the long run. Why? The outputs are just wonky, at odds with what an expert in a field knows, or just plain wrong. Does this make you curious about black box smart software? If not, it should.

Quote C

And what about this statement:

The model learned that recent papers are better than older papers, even though there was no monotonicity constraint on this feature (the only feature without such a constraint). Academic search users like recent papers, as one might expect!

Meaning: The three musketeers like their information new, fresh, and crunchy. From my point of view, this is a great reason to delete the backfiles. Even thought “old” papers may contain high value information, the new breed wants recent papers. Give ‘em what they want and save money on storage and other computational processes.

Net Net

My hunch is that many people think that search is solved. What’s the big deal? Everything is available on the Web. Free Web search is great. But commercial search systems like LexisNexis and Compendex with for fee content are chugging along.

A free and open source approach is a good concept. The trajectory of innovation points to a need for continued research and innovation. The three musketeers might find themselves replaced with a more efficient and unmanageable force like smart software trained by the Légion étrangère drunk on digital pastis.

Stephen E Arnold, September 1, 2021

SEO Relevance Destroyers and Semantic Search

August 18, 2021

Search Engine Journal describes to SEO professionals how the game has changed since early days, when it was all about keywords and backlinks, in “Semantic Search: What it Is & Why it Matters.” Writer Aleh Barysevich emphasizes:

“Now, you need to understand what those keywords mean, provide rich information that contextualizes those keywords, and firmly understand user intent. These things are vital for SEO in an age of semantic search, where machine learning and natural language processing are helping search engines understand context and consumers better. In this piece, you’ll learn what semantic search is, why it’s essential for SEO, and how to optimize your content for it.”

Semantic search strives to comprehend each searcher’s intent, a query’s context, and the relationships between words. The increased use of voice search adds another level of complexity. Barysevich traces Google’s semantic search evolution from 2012’s Knowledge Graph to 2019’s BERT. SEO advice follows, including tips like these: focus on topics instead of keywords, optimize site structure, and continue to offer authoritative backlinks. The write-up concludes:

“Understanding how Google understands intent in intelligent ways is essential to SEO. Semantic search should be top of mind when creating content. In conjunction, do not forget about how this works with Google E-A-T principles. Mediocre content offerings and old-school SEO tricks simply won’t cut it anymore, especially as search engines get better at understanding context, the relationships between concepts, and user intent. Content should be relevant and high-quality, but it should also zero in on searcher intent and be technically optimized for indexing and ranking. If you manage to strike that balance, then you’re on the right track.”

Or one could simply purchase Google ads. That’s where traffic really comes from, right?

Cynthia Murrell, August 17, 2021

Next Page »

  • Archives

  • Recent Posts

  • Meta