AI: Multi Modal Wu Dao

June 24, 2021

Last summer OpenAI’s GPT-3 text generator was the impressive AI of the season, creating passages of text most could not discern from human-penned prose. Now we are told a model out of the Beijing Academy of Artificial Intelligence (BAAI) has surpassed that software. According to Yahoo, “China’s Gigantic Multi-Modal AI is No One-Trick Pony.” The new deep learning model, named Wu Dao, can emulate human writers as well as GPT-3 and then some. Reporter Andrew Tarantola asserts:

“First off, Wu Dao is flat out enormous. It’s been trained on 1.75 trillion parameters (essentially, the model’s self-selected coefficients) which is a full ten times larger than the 175 billion GPT-3 was trained on and 150 billion parameters larger than Google’s Switch Transformers. In order to train a model on this many parameters and do so quickly — Wu Dao 2.0 arrived just three months after version 1.0’s release in March — the BAAI researchers first developed an open-source learning system akin to Google’s Mixture of Experts, dubbed FastMoE. This system, which is operable on PyTorch, enabled the model to be trained both on clusters of supercomputers and conventional GPUs. This gave FastMoE more flexibility than Google’s system since FastMoE doesn’t require proprietary hardware like Google’s TPUs and can therefore run on off-the-shelf hardware — supercomputing clusters notwithstanding. With all that computing power comes a whole bunch of capabilities. Unlike most deep learning models which perform a single task — write copy, generate deep fakes, recognize faces, win at Go — Wu Dao is multi-modal, similar in theory to Facebook’s anti-hatespeech AI or Google’s recently released MUM.”

Tarantola checked out the researchers’ recent demo. While OpenAI taught us that software can now mimic news stories and similar content, Wu Dao takes language further by generating essays, poems, and couplets in traditional Chinese. It can also take clues from static images to write relevant text and can create almost photorealistic images from natural-language descriptions. With the help of Microsoft’s XiaoIce, Wu Dao can also power virtual idols and predict 3D protein structures a la AlphaFold. Talk about use cases from different ends of the spectrum. BAAI chair Dr. Zhang Hongjiang declares the key to AI’s future lies in “big models and a big computer.” Perhaps those models can divine a way to minimize their own power consumption and work without alleged biases toward everyone not in CompSci 410.

Cynthia Murrell, June 24, 2021

First, a Cabinet, Then a Laptop? Quantum Computing Hype Escalates

June 21, 2021

I read “Compact Quantum Computer for Server Centers.” The write up explains:

“Our quantum computing experiments usually fill 30- to 50-square-meter laboratories,” says Thomas Monz of the University of Innsbruck. “We were now looking to fit the technologies developed here in Innsbruck into the smallest possible space while meeting standards commonly used in industry.” The new device aims to show that quantum computers will soon be ready for use in data centers. “We were able to show that compactness does not have to come at the expense of functionality,” adds Christian Marciniak from the Innsbruck team.

I think this is an interesting idea. The big radio in homes in the 1920s became the micro circuits in a mobile phone. Tiny is better. Quantum computers are going to become smaller too. A desktop device? Maybe a laptop? How about a mobile phone?

Is it important to skip over issues like software and applications, error rates, and figuring out how to know exactly what the constantly vibrating tiny things are doing?

Trivial issues obviously.

The write up explains that the ion trap in the vacuum chamber has been made smaller. That’s good. What happens if someone gives the device a hard knock? Heat? No problema. Commercial use cases? Certainly. How about word processing or calculating whether it will rain this weekend? Absolutely.

What this write up said to me was, “We are doing good stuff and we need more funding.” How many other EU quantum wizards will cite this work and generate non reproducible and non verifiable results? What? Academics fudging stuff? Never.

Stephen E Arnold, June 21, 2021

Microsoft: Timing and Distraction

June 16, 2021

From my point of view, the defining event of 2021 was the one-two punch of SolarWinds and the Microsoft Exchange Server breaches. I call these “missteps” because the jargon of the cyber wizards at the Redmond outfit and the legions of cyber security vendors talk around compromising systems in ways which are mind boggling. Yep, a “misstep.” Not worth worrying about.

I scanned the research data in “Unsuccessful Tech Projects Get Axed During the Pandemic” and checked with  my trusty red ink ball point pen, these items. Let’s just assume these data are close enough for horse shoes, shall we?

  • 30 percent of a sample of 700 plus “professionals” say they killed one or more unsuccessful digital transformation projects. Okay, one third failure rate. How’s that work if one is building 100 school buses? Yep, one third go up in flames, presumably killing some of the occupants. Call it 20 children per bus when one detonates. That works out to 600 no longer functioning children. Acceptable? Okay for software, just not for school buses.
  • 65 percent of the sample are going to try and try again. Improving methods? No data on that, so we can figure one third of these digital adventures will drive off a cliff I assume.
  • Making the right decision is almost a guess. The article’s data suggest that 29 percent of those in the sample “struggle to keep pace with technological developments.” So let’s do marketing, maybe hand waving, or just some Jazz Age razzle dazzle, right?

That what I thought when I read “Windows 11 Has Leaked Online: What the Next Version of Windows Looks Like.” This write up does not talk about addressing the software update methods, the trust mechanisms within the Windows ecosystem, nor the vulnerabilities of decades old practices for libraries and dynamic linked libraries, among others. Nope. It’s this in my opinion:

image

Image source: Noemi P.

A new look, snappy dance moves, and distraction. The tune is probably going to be a toe tapper. The only hitch is that the missteps of SolarWinds and Microsoft Exchange Server missteps might throw the marketing routine off beat.

Stephen E Arnold, June 16, 2021

Are 15 Square Feet Enough? A Question for the Google

June 15, 2021

I flipped through the dead tree edition of the outstanding sun-like Wall Street Journal this morning (June 15, 2021). And what did I find inside the edition which sometimes makes its way to Harrod’s Creek, Kentucky? The answer was a four page ad in the Murdoch infused Wall Street Journal. Each page is about 23 inches by 24 inches. That works out to 552 square inches (give or take a few due to variances in trim sizes) per page. With four pages, the total is more than 2,208 square inches of dead tree space or larger than the vinyl floor protector under my discount store office chair and that of one of my assistant’s floor protectors. Which is better vinyl floor protectors or dead tree paper? I am on the fence.

a google ad 61521

Above is a thumbnail of the four page Google ad in the June 15, 2021, Wall Street Journal.

What’s the message in the ad? At first glance, the ad is pitching a free Google service. Some people perceive Google free services as having a modest cost. Here in Harrod’s Creek, we love the freebies from the Google. In this particular case, Google is pitching this message:

If you want to show the world how it’s done, you have to change the way you do things.

Change is hard, and it depends on whether the change is motivated internally like the good old but out of fashion notion of self improvement, gumption, and Go West, young man! Or whether the change is imposed on one; for example, Rupert Murdoch had constraints on unauthorized telephone tapping imposed on his otherwise outstanding organization. There is also an Orwellian type change which can be more difficult for those lacking critical thinking skills to identify. A good example of this is assertions made under oath in the US Congress that certain high technology companies will do better. The companies then keep on keepin’ on as some in Harrod’s Creek say.

The interior two pages convey this message:

Say hello to Google Workspace.

The text explains that Google Workspace is pretty much like Salesforce Slack, Microsoft Teams, and the ever wonderful and avant garde Cisco Webex service, the somewhat popular Zoom, among others. The most interesting passage in the advertisement is the explanation of “how we do it here too”:

All 100K+ Google employees – from engineering, to marketing, to the PhDs in the quantum lab—relay on Google Workspace every day. Our scientists leave comments in research doss, and the security team keeps our inboxes clear of spam and viruses. Google’s entire business is riding on it, just like yours. Because no matter the task at hand, when your customers are depending on your. Google Workspace is how it’s done.

What came to mind was “how it’s done” in staff management. Dare I mention Dr. Timnit Gebru? No, I don’t dare. What about the subtle management vibes at DeepMind. Nope, I know zero about that too. What about … Nope, no more of this management thinking. Life’s too short. (I wonder if critiques of Dr. Gebru’s AI ethics paper were handled within this Workspace thing?)

The final page lists alleged customers (users) of Google Workspace. These include Grandma’s, Operation BBQ Relief, and Ms.. Kim’s class, among others.

Some observations are warranted by this lavish presentation of the Google Workspace message in the dead tree edition of a traditional newspaper nestled within the woke empire of News Corp. Herewith:

  1. I find it amusing to think that the world’s largest online advertising outfit is pitching its Workspace product in a medium which is centuries old, non digital, and mostly reporting that water which has passed under the bridge over information
  2. I would like to see the ad reach data and conversion estimate for pulling new customers based on this rather impressive expanse of newspaper. My hunch is that the Google wanted to send a message, probably to Microsoft. Why not email the outstanding leader working hard to eliminate cyber security risks?
  3. The organizations mentioned as customers (users) are interesting. Links to case examples of what’s shaking at Grandma’s or Ms. Kim’s class would be fascinating. The wonky little icons in the ad are interesting but “yinka” was a bit of a puzzle to me.

Net net: Is Google changing or does Google want others to change from Microsoft Teams to Workspace? My hunch is that Google is assuming that the Greek god Koalemos will make their endeavor a home run.

Stephen E Arnold, June 15, 2021

Don Quixote Lives: Another Assault on Data Silos

June 3, 2021

Keep in mind that in some organizations data silos are necessary: Poaching colleagues (hello, big pharma), government security requirements (yep, the top Beltway bandits too), and common sense (lawyers heading to trial with a judge who has a certain reputation). Data silos are like everywhere. The were a couple of firms which billed themselves as “silo breakers.” How is that working out? The answer to the question resides in an analyst’s “data silo.” There you go.

Security is the biggest reason much-maligned data silos, also known as fragmented data, persist. Google now hopes to change that, we learn from “Google Cloud Launches New Services for a Unified Data Platform” at IT Brief. The company asserts its new solutions mean organizations can now forget about data silos and securely analyze their data in the cloud. We have yet to see detailed evidence for that claim, however. We will continue to keep our sensitive data separated, thank you very much.

Writer Ryan Morris-Reade describes the three new services upon which Google is pinning its cloudy unification hopes:

  • Datastream, a new serverless Change Data Capture and replication service. Datastream enables customers to replicate data streams in real-time, from Oracle and MySQL databases to Google Cloud services such as BigQuery, Cloud SQL, Google Cloud Storage, and Cloud Spanner. This solution allows businesses to power real-time analytics, database replication, and event-driven architectures.
  • Analytics Hub, a new capability that allows companies to create, curate, and manage analytics exchanges securely and in real-time. With Analytics Hub, customers can share data and insights, including dynamic dashboards and machine learning models securely inside and outside their organization.
  • Dataplex, an intelligent data fabric that provides an integrated analytics experience, bringing the best of Google Cloud and open-source together, to enable users to rapidly curate, secure, integrate, and analyze their data at scale. Automated data quality allows data scientists and analysts to address data consistency across the tools of their choice, to unify and manage data without data movement or duplication. With built-in data intelligence using Google’s best-in-class AI and Machine Learning capabilities, organizations spend less time with infrastructure complexities and more time using data to deliver business outcomes.”

We learn consulting firm Deloitte is helping Google implement these solutions. That company’s global chief commercial officer emphasizes the tools provide “enhanced data experiences” for companies with siloed data by simplifying implementation and management. We are also told that Equifax and Deutsche Bank trust Google Cloud with their data. I guess that is supposed to mean we should, too.

But Google is quite the fan of data silos. Remember “universal search.” Google has separate indexes for news, scholarly information, and other content types. Universal implies breaking down “data silos.” But it is easier to talk about solving the data silo problem than delivering.

And what about Deloitte? This firm was fined about $20 million US because it had data silos which partitioned some partners from the work of the professionals working for Autonomy.

Yep, data silos. Persistent and embarrassing when someone thinks of “universal search” and Deloitte’s internal oversight methods.

Cynthia Murrell, June 03, 2021

Making Life Easier for Professional Publishers: A Call for More Blatant Fraud

May 31, 2021

I enjoyed “Please Commit More Blatant Academic Fraud.” The intent is to highlight the disgusting underbelly of academic underbellies of naked mole rats. The author picks up on the fraudulent peer cheerleading for research related to artificial intelligence, but when tenure is at stake, I wager that professors teaching ethics can be manipulation minded as well. It just depends upon how one frames the argument, right?

The essay has a very interesting quote; to wit:

It would, of course, be quite difficult to actually distinguish the papers published fraudulently from the those published “legitimately”. (That fact alone tells you all you really need to know about the current state of AI research.)

I want to add a slightly different quantum entanglement to the nuclear nature of the academic fraud issue. The professional publishers must be considered. These are the outstanding executives who often publish research known to be wonky. The professional publishers create journals filled with hocus pocus, wrapped in the magic of peer reviewing, and totted up to be the beacons of “real” information.

If anyone wants more and crazier research written by authors and institutions willing to pay assorted fees to get their estimable contributions to knowledge published, it is the publishers. When an author makes a change, the outstanding professional publishers often charge to fix up a passage. Want reprints? Just get out that electronic payment system. Order away.

The professional publishers are struggling to get libraries to buy, subscribe, license, and renew automatically if possible. More junk research and increased content manipulation will improve the professional publishing system itself.

Imagine. Bogus research in medicine, social science, and quantum computing. When something actually reproducible and substantive becomes available, a researcher will have to spend more time on for fee commercial databases, apply more research assistant labor, and scan more tweets to figure out what’s “real” and what’s fake.

The advancement of knowledge is enabled, and even the professional publishers can get behind the call for action expressed in “Please Commit More Blatant Academic Fraud.” Marketing is more important for everyone it seems now.

Stephen E Arnold, May 31, 2021

Marketers Assert AI Perfect for eDiscovery

May 24, 2021

Automated eDiscovery firm ZyLab makes a case for AI in the law firm with its post, “A Chief Legal Officer’s Guide to AI-Based eDiscovery and Analytics,” shared at JDSupra. Writer Jeffrey Wolff begins by outlining the job of a CLO. He notes lawyers in that position tend to be most comfortable with the “traditional” duties of risk mitigation, monitoring legal matters, and minding laws and regulations. According to a Deloitte study, however, executives would like to see their CLOs work more on guiding the company culture and squaring legal concerns with company goals. Wolff suggests outsourcing this part of the CLO role. (We observe his company happens to offer such expert professional services.)

After that pitch, we learn why CLOs should consider AI. We’re told:

“AI excels at sifting through massive quantities of data to identify specific terms or concepts, even when those concepts are expressed in different terms. Because an AI system can scan data faster than any human and doesn’t get tired or distracted, it can evaluate data sets faster and more easily than a human while maintaining accuracy. A machine can also manage repetitive, laborious tasks quickly and effectively without falling prey to boredom or wandering attention. Legal departments can therefore use AI to streamline processes, reduce costs, and increase their productivity. Given that ‘nearly two-thirds (63 percent) of [legal department] respondents say recurring tasks and data management constraints prevent their legal teams from creating value at their organization,’ AI offers a way for CLOs to offload those time-consuming responsibilities and focus on the strategy and growth priorities that matter to the company’s future.”

A good place to start is with ZyLab’s specialty, eDiscovery. That area does involve a mind-boggling amount of data and AI can be quite valuable, even indispensable for larger firms. Wolff describes six ways AI tools can help with corporate eDiscovery: completing early case assessment, structuring data through concept clustering, using Technology-Assisted Review, redacting personal information, generating eDiscovery analytics, and managing eDiscovery costs. See the write-up for more on each of these tasks.

The company’s technology dates from 1983 (38 years ago). Today’s ZyLab supplies eDiscovery and Information Governance tech to large corporations, government organizations, regulatory agencies, and law firms around the world. The company launched with its release of the first full-text retrieval software for the PC. It’s eDiscovery/ Information Management platform was introduced in 2010. ZyLab is based in Amsterdam and has embraced the lingo of smart software like other eDiscovery firms.

Cynthia Murrell, May 24, 2021

What the Colonial Pipeline Affair Has Disclosed

May 21, 2021

I worked through some of the analyses of the Colonial Pipeline event. You can get the “predictive analytics” view in Recorded Future’s marketing-centric blog post “DarkSide Ransomware Gang Says It Lost Control of Its Servers & Money a Day after Biden Threat.” You can get the digital currency can be deanonymized view in the marketing-oriented “Elliptic Follows the Bitcoin Ransoms Paid by Colonial Pipeline and Other Dark Side Ransomware Victims.” You can get the marketing-oriented “Colonial Pipeline Ransomware Attack: What We Know So Far.” Please, read these after-action reports, pull out nuggets of information, and learn how well hindsight works. What’s hindsight? Here’s a definition:

the ability to understand an event or situation only after it has happened (Cambridge.org)

The definition edges close to the situation in which cyber security (not Colonial) finds itself; namely, I have seen no names of the individuals responsible. I have seen no identification of the sources of funding and support for the group responsible. I have seen no print outs illustrating the formation of the attack plan or of the log data making explicit an attack was underway.

The cyber security industry is a club, and the members of the club know their in-crowd has a license to send invoices. Not even IBM in its FUD days could have created a more effective way to sell products and services. These range from real time threat intelligence, to predictive reports explaining that lighting is about to strike, or smart autonomous cyber nervous systems sounding alarms.

Nope, not that I have heard.

Here are some issues which Colonial raised when I participated in a conference call with a couple of LE and intel types less than 24 hours ago:

  1. The existing threat intelligence, Dark Web scanners, and super AI infused whiz bang systems don’t work. They missed SolarWinds, Exchange Server, and now the Colonial Pipeline affair. Yikes. Don’t work? Right. Don’t work. If even one of the cyber security systems “worked”, then none of these breaches would have be possible. What did I hear in Harrod’s Creek? Crickets.
  2. In the case of Colonial, how much of the problem was related to business matters, not the unknown, undetected wizards of Dark Side? Who knows if the bad actors were the problem or if Colonial found the unpleasantness and opportunity for some breathing room for other activities? Where are the real journalists from Bloomberg, the New York Times, the Wall Street Journal, the Washington Post, et al? Yep, sources produced nothing and now the after action analyses will flow for a while.
  3. What about the specialist firms clustered in Herliya? What about the monitoring and alerting systems among Cambridge, Cheltenham, and London? What about the outfits clustered near government centers in Brussels, Berlin, and Prague? I have not heard or seen anything in the feeds I monitor. Zippo.

Let’s step back.

The current cyber security set up is almost entirely reactive. Any breach is explained in terms of China, Iran, and Russia. Some toss in Iran and North Korea. Okay, add them to the list of malefactors. That does not change the calculus of these escalating cyber breaches.

The math looks like this: 1 + 0 = 32

Let me explain:

The “1” represents a cyber breach

The “0” represents the failure of existing cyber security systems to notice and/or block the bad actor’s method

The 32 means the impact is exponential—in favor of the bad actors.

With no meaningful proactive measures working in a reliable function, the cyber security systems now in place are sitting ducks.

Some body said, “Our reaction to a situation literally has the power to change the situation itself.” Too bad this aphorism is dead wrong.

When the reactions are twisted into marketing opportunities and the fix does not work, where are we? I would suggest in a place that warrants more than sales lingo, jargon, and hand waving.

The talk about cyber security and threat intelligence sounds similar to the phrase, “Please, take off your shoes.”

Stephen E Arnold, May 21, 2021

Microsoft Partners Up for Smarter Security

May 13, 2021

I noted “Microsoft Partners with Darktrace to Help Customers Combat Cyber Threats with AI.” You may know that Microsoft has been the subject of some attention. No, I am not talking about Windows 10 updates which cause printers to become doorstops. Nope. I am not talking about the fate of a leaner, meaner version of Windows. Yep, I am making a reference to the SolarWinds’ misstep and the alleged manipulation of Microsoft Exchange Server to create a reprise of “waiting on line for fuel.” This was a popular side show in the Washington, DC, area in the mid-1970s.

How does Microsoft address its security PR challenge? There are white papers from Microsoft threat experts. There are meetings in DC ostensibly about JEDI but which may — just by happenstance — bring up the issue of security. No big deal, of course. And Microsoft forms new security-centric partnerships.

The partner mentioned in the write up is Darktrace. The company relies on technology somewhat related to the systems and methods packaged in the Autonomy content processing system. That technology included Bayesian methods, was at one time owned by Cambridge Neurodynamics, and licensed to Autonomy. (A summary of Autonomy is available at this link. The write up points out that Bayesian methods are centuries old and often criticized because humans have to set thresholds for some applications of the numerical recipes. Thus, outputs are not “objective” and can vary as the method iterates.) Darktrace’s origins are in Cambridge and some of the firm’s funding came from Michael Lynch-affiliated Invoke Capital. The firm’s Web page states:

Founded by celebrated technologist and entrepreneur, Dr Mike Lynch OBE, Invoke Capital founds, invests in and advises fast-growing fundamental technology companies in Europe. With deep expertise in identifying and commercializing artificial intelligence research and a close relationship with the University of Cambridge, Invoke exists to realize the commercial possibilities of Britain’s extraordinary science and deep technology base. Since 2012, Invoke has been instrumental in founding, creating and developing prominent technologies, and then finding the right teams to scale them into global businesses. Invoke’s companies include Darktrace, a world-leading cyber AI company that employs more than 1,500 people globally, Luminance, an award-winning machine learning platform for the legal industry, and AI fraud-detection engine, Featurespace. Invoke exited data-driven medicine experts, Sophia Genetics, in 2020.

{The Register provides a run down of some of the legal activity associated with Mr. Lynch at this link. )

The item presenting the tie up of Microsoft and Darktrace states:

Microsoft announced today a new partnership with Darktrace, a UK-based cyber security AI firm that works with customers to address threats using what it describes as “self-learning artificial intelligence”. Darktrace’s threat response system is designed to counter insider threats, espionage, supply chain attacks, phishing, and ransomware. The partnership between Microsoft and Darktrace is meant to give organizations an automated way of investigating threats across multiple platforms. Darktrace’s system works by learning the data within a specific environment as well as how users behave. The goal is to tell which activity is benign or malicious.

For more information about Darktrace, one can consult the firm’s Web site. For a different view, an entity with the handle OneWithCommonSense provides his/her assessment of the system. You can find that document (verified online on May 13, 2021) at this link.

Why is this interesting?

  1. The use of a system and method which may be related to how the Autonomy system operates may be an example how one mathematical method can be extended to a different suite of use cases; specifically, cyber security.
  2. The Darktrace disclosures about its technology make it clear that the technology is in the category of “artificial intelligence” or what I call smart software. Systems and methods which are more efficient, economical, and more effective are reasons why smart software is an important product category to watch.
  3. Darktrace (to my knowledge) may have the capability to recognize and issue an alert about SolarWinds-type incursions. Other cyber security firms’ smart software dropped the ball and many were blindsided by the subsequent Microsoft Exchange Server and shell exploits.

As a side note, Microsoft acquired the Fast Search & Transfer company after there were legal inquiries into the company. That was a company based in Norway. With the Darktrace deal, Microsoft is again looking offshore for solution to what on the surface seems to be the Achilles’ heel of the company’s product portfolio: Its operating system and related services.

Will Darktrace’s technology address the debilitating foot injury Microsoft has suffered? Worth watching because bad actors are having a field day with free ice cream as a result of the revelations related to Microsoft’s security engineering. Windows Defender may get an injection of a technology that caught Dr. Lynch’s eye. Quick is better in my opinion.

Stephen E Arnold, May 13, 2021

More Search Explaining: Will It Help an Employee Locate an Errant PowerPoint?

May 13, 2021

Semantics, Ambiguity, and the role of Probability in NLU” is a search-and-retrieval explainer. After half a century of search explaining, one would think that the technology required to enter a keyword and get a list of documents in which the key word appears would be nailed down. Wrong.

“Search” in 2021 embraces many sub disciplines. These range from explicit index terms like the date of a document to more elusive tags like “sentiment” and “aboutness.” Boolean has been kicked to the curb. Users want to talk to search, at least to Alexa and smartphones. Users want smart software to deliver results without the user having to enter a query. When I worked at Booz, Allen & Hamilton, one of my colleagues (I think his name was Harvey Poppel, the smart person who coined the phrase “paperless office”) suggested that someday a smart system would know when a manager walked into his or her office. The smart software would display what the person needed to know for that day. The idea, I think, was that whist drinking herbal tea, the smart person would read the smart outputs and be more smart when meeting with a client. That was in the late 1970s, and where are we? On Zooms and looking at smartphones. Search is an exercise in frustration, and I think that is why venture firms continue to pour money into ideas, methods, concepts, and demos which have been recycled many times.

I once reproduced a chunk of Autonomy’s marketing collateral in a slide in one of my presentations. I asked those in the audience to guess at what company wrote the text snippet. There were many suggestions, but none was Autonomy. I doubt that today’s search experts are familiar with the lingo of search vendors like Endeca, Verity, InQuire, et all. That’s too bad because the prose used to describe those systems could be recycled with little or no editing for today’s search system prospects.

The write up in question is serious. The author penned the report late last year, but Medium emailed me a link to it a day ago along with a “begging for dollars” plea. Ah, modern online blogs. Works of art indeed.

The article covers these topics as part of the “search” explainer:

  • Ambiguity
  • Understanding
  • Probability

Ambiguity is interesting. One example is a search for the word “terminal.” Does the person submitting the query want information about a computer terminal, a bus terminal, or some other type of terminal; for instance the post terminal on the transformer to my model train set circa 1951? Smart software struggles with this type of ambiguity. I want to point out that a subject matter expert can assign a “field code” to the term and eliminate the ambiguity, but SMEs are expensive and they lose their index precision capability as the work day progresses.

The deal with the “terminal” example, the modern system has to understand [a] what the user wants and [b] what the content objects are about. Yep, aboutness. Today’s smart software does an okay job with technical text because jargon like Octanitrocubane allows relatively on point identification of a document relevant to a chemist in Columbus, Ohio. Toss in a chemical structure diagram, and the precision of the aboutness ticks up a notch. However, if you search for a word replete with social justice meaning, smart software often has a difficult time figuring out the aboutness. One example is a reference to Skokie, Illinois. Is that a radical right wing code word or a town loved for Potawatomi linguistic heritage?

Probability is a bit more specific — usually. The idea in search is that numbers can illuminate some of the dark corners of text’s meaning. Examples are plentiful. Curious about Miley Cyrus on SNL and then at the after party? The search engine will display the most probable content based on whatever data is sluiced through the query matcher and stored in a cache. If others looked at specific articles, then, by golly, a query about Miley is likely or highly probable to be just what the searcher wanted. The difference between ambiguity, understanding, and probability is — in my opinion — part of the problem search vendors faces. No one can explain why, after 50 years of SMART, and Personal Library Software, STAIRS, et al, finding on point information remains frustrating, expensive, and ineffective.

The write up states:

ambiguity was not invented to create uncertainty — it was invented as a genius compression technique for effective communication. And it works like magic, because on the receiving end of the message, there is a genius decoding and decompression technique/algorithm to uncover all that was not said to get at the intended thought behind the message. Now we know very well how we compress our thoughts into a message using a genius encoding scheme, let us now concentrate on finding that genius decoding scheme — a task that we all call now ‘natural language understanding’.

Sounds great. Now try this test. You have a recollection of viewing a PowerPoint a couple of weeks ago at an offsite. You know who the speaker was and you want the slide with the number of instant messages sent per day on WhatsApp? How do you find that data?

[a] Run a query on your Fabasoft, SearchUnify, or Yext system?

[b] Run a query on Google in the hopes that the GOOG will point you to Statista, a company you believe will have the data?

[c] Send an email to the speaker?

[d] All of the above.

I would just send the speaker a text message and hope for an answer. If today’s search systems were smart, wouldn’t the single PowerPoint slide be in my email anyway? Sure, someday.

Stephen E Arnold, May 13, 2021

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta