The Only Dataset Search Tool: What Does That Tell Us about Google?

April 11, 2024

green-dino_thumb_thumb_thumbThis essay is the work of a dumb dinobaby. No smart software required.

If you like semi-jazzy, academic write ups, you will revel in “Discovering Datasets on the Web Scale: Challenges and Recommendations for Google Dataset Search.” The write up appears in a publication associated with Jeffrey Epstein’s favorite university. It may be worth noting that MIT and Google have teamed to offer a free course in Artificial Intelligence. That is the next big thing which does hallucinate at times while creating considerable marketing angst among the techno-giants jousting to emerge as the go-to source of the technology.

Back to the write up. Google created a search tool to allow a user to locate datasets accessible via the Internet. There are more than 700 data brokers in the US. These outfits will sell data to most people who can pony up the cash. Examples range from six figure fees for the Twitter stream to a few hundred bucks for boat license holders in states without much water.

The write up says:

Our team at Google developed Dataset Search, which differs from existing dataset search tools because of its scope and openness: potentially any dataset on the web is in scope.

image

A very large, money oriented creature enjoins a worker to gather data. If someone asks, “Why?”, the monster says, “Make up something.” Thanks MSFT Copilot. How is your security today? Oh, that’s too bad.

The write up does the academic thing of citing articles which talk about data on the Web. There is even a table which organizes the types of data discovery tools. The categorization of general and specific is brilliant. Who would have thought there were two categories of a vertical search engine focused on Web-accessible data. I thought there was just one category; namely, gettable. The idea is that if the data are exposed, take them. Asking permission just costs time and money. The idea is that one can apologize and keep the data.

The article includes a Googley graphic. The French portal, the Italian “special” portal, and the Harvard “dataverse” are identified. Were there other Web accessible collections? My hunch is that Google’s spiders such down as one famous Googler said, “All” the world’s information. I will leave it to your imagination to fill in other sources for the dataset pages. (I want to point out that Google has some interesting technology related to converting data sets into normalized data structures. If you are curious about the patents, just write benkent2020 at yahoo dot com, and one of my researchers will send along a couple of US patent numbers. Impressive system and method.)

The section “Making Sense of Heterogeneous Datasets” is peculiar. First, the Googlers discovered the basic fact of data from different sources — The data structures vary. Think in terms  of grapes and deer droppings. Second, the data cannot be “trusted.” There is no fix to this issue for the team writing the paper. Third, the authors appear to be unaware of the patents I mentioned, particularly the useful example about gathering and normalizing data about digital cameras. The method applies to other types of processed data as well.

I want to jump to the “beyond metadata” idea. This is the mental equivalent of “popping” up a perceptual level. Metadata are quite important and useful. (Isn’t it odd that Google strips high value metadata from its search results; for example, time and data?) The authors of the paper work hard to explain that the Google approach to data set search adds value by grouping, sorting, and tagging with information not in any one data set. This is common sense, but the Googley spin on this is to build “trust.” Remember: This is an alleged monopolist engaged in online advertising and co-opting certain Web services.

Several observations:

  1. This is another of Google’s high-class PR moves. Hooking up with MIT and delivering razz-ma-tazz about identifying spiderable content collections in the name of greater good is part of the 2024 Code Red playbook it seems. From humble brag about smart software to crazy assertions like quantum supremacy, today’s Google is a remarkable entity
  2. The work on this “project” is divorced from time. I checked my file of Google-related information, and I found no information about the start date of a vertical search engine project focused on spidering and indexing data sets. My hunch is that it has been in the works for a while, although I can pinpoint 2006 as a year in which Google’s technology wizards began to talk about building master data sets. Why no time specifics?
  3. I found the absence of AI talk notable. Perhaps Google does not think a reader will ask, “What’s with the use of these data? I can’t use this tool, so why spend the time, effort, and money to index information from a country like France which is not one of Google’s biggest fans. (Paris was, however, the roll out choice for the answer to Microsoft and ChatGPT’s smart software announcement. Plus that presentation featured incorrect information as I recall.)

Net net: I think this write up with its quasi-academic blessing is a bit of advance information to use in the coming wave of litigation about Google’s use of content to train its AI systems. This is just a hunch, but there are too many weirdnesses in the academic write up to write off as intern work or careless research writing which is more difficult in the wake of the stochastic monkey dust up.

Stephen E Arnold, April 11, 2024

Backpressure: A Bit of a Problem in Enterprise Search in 2024

March 27, 2024

green-dino_thumb_thumb_thumbThis essay is the work of a dumb dinobaby. No smart software required.

I have noticed numerous references to search and retrieval in the last few months. Most of these articles and podcasts focus on making an organization’s data accessible. That’s the same old story told since the days of STAIRS III and other dinobaby artifacts. The gist of the flow of search-related articles is that information is locked up or silo-ized. Using a combination of “artificial intelligence,” “open source” software, and powerful computing resources — problem solved.

image

A modern enterprise search content processing system struggles to keep pace with the changes to already processed content (the deltas) and the flow of new content in a wide range of file types and formats. Thanks, MSFT Copilot. You have learned from your experience with Fast Search & Transfer file indexing it seems.

The 2019 essay “Backpressure Explained — The Resisted Flow of Data Through Software” is pertinent in 2024. The essay, written by Jay Phelps, states:

The purpose of software is to take input data and turn it into some desired output data. That output data might be JSON from an API, it might be HTML for a webpage, or the pixels displayed on your monitor. Backpressure is when the progress of turning that input to output is resisted in some way. In most cases that resistance is computational speed — trouble computing the output as fast as the input comes in — so that’s by far the easiest way to look at it.

Mr. Phelps identifies several types of backpressure. These are:

  1. More info to be processed than a system can handle
  2. Reading and writing file speeds are not up to the demand for reading and writing
  3. Communication “pipes” between and among servers are too small, slow, or unstable
  4. A group of hardware and software components cannot move data where it is needed fast enough.

I have simplified his more elegantly expressed points. Please, consult the original 2019 document for the information I have hip hopped over.

My point is that in the chatter about enterprise search and retrieval, there are a number of situations (use cases to those non-dinobabies) which create some interesting issues. Let me highlight these and then wrap up this short essay.

In an enterprise, the following situations exist and are often ignored or dismissed as irrelevant. When people pooh pooh my observations, it is clear to me that these people have [a] never been subject to a legal discovery process associated with enterprise search fraud and [b] are entitled whiz kids who don’t do too much in the quite dirty, messy, “real” world. (I do like the variety in T shirts and lumberjack shirts, however.)

First, in an enterprise, content changes. These “deltas” are a giant problem. I know that none of the systems I have examined, tested, installed, or advised which have a procedure to identify a change made to a PowerPoint, presented to a client, and converted to an email confirming a deal, price, or technical feature in anything close to real time. In fact, no one may know until the president’s laptop is examined by an investigator who discovers the “forgotten” information. Even more exciting is the opposing legal team’s review of a laptop dump as part of a discovery process “finds” the sequence of messages and connects the dots. Exciting, right. But “deltas” pose another problem. These modified content objects proliferate like gerbils. One can talk about information governance, but it is just that — talk, meaningless jabber.

Second, the content which an employees needs to answer a business question in a timely manner can reside in am employee’s laptop or a mobile phone, a digital notebook, in a Vimeo video or one of those nifty “private” YouTube videos, or behind the locked doors and specialized security systems loved by some pharma company’s research units, a Word document in something other than English, etc. Now the content is changed. The enterprise search fast talkers ignore identifying and indexing these documents with metadata that pinpoints the time of the change and who made it. Is this important? Some contract issues require this level of information access. Who asks for this stuff? How about a COTR for a billion dollar government contract?

Third, I have heard and read that modern enterprise search systems “use”, “apply,” “operate within” industry standard authentication systems. Sure they do within very narrowly defined situations. If the authorization system does not work, then quite problematic things happen. Examples range from an employee’s failure to find the information needed and makes a really bad decision. Alternatively the employee goes on an Easter egg hunt which may or may not work, but if the egg found is good enough, then that’s used. What happens? Bad things can happen? Have you ridden in an old Pinto? Access control is a tough problem, and it costs money to solve. Enterprise search solutions, even the whiz bang cloud centric distributed systems, implement something, which is often not the “right” thing.

Fourth, and I am going to stop here, the problem of end-to-end encrypted messaging systems. If you think employees do not use these, I suggest you do a bit of Eastern egg hunting. What about the content in those systems? You can tell me, “Our company does not use these.” I say, “Fine. I am a dinobaby, and I don’t have time to talk with you because you are so much more informed than I am.”

Why did I romp though this rather unpleasant issue in enterprise search and retrieval? The answer is, “Enterprise search remains a problematic concept.” I believe there is some litigation underway about how the problem of search can morph into a fantasy of a huge business because we have a solution.”

Sorry. Not yet. Marketing and closing deals are different from solving findability issues in an enterprise.

Stephen E Arnold, March 27, 2024

A Look at Web Search: Useful for Some OSINT Work

February 22, 2024

green-dino_thumb_thumb_thumbThis essay is the work of a dumb dinobaby. No smart software required.

I read “A Look at Search Engines with Their Own Indexes.” For me, the most useful part of the 6,000 word article is the identified search systems. The author, a person with the identity Seirdy, has gathered in one location a reasonably complete list of Web search systems. Pulling such a list together takes time and reflects well on Seirdy’s attention to a difficult task. There are some omissions; for example, the iSeek education search service (recently repositioned), and Biznar.com, developed by one of the founders of Verity. I am not identifying problems; I just want to underscore that tracking down, verifying, and describing Web search tools is a difficult task. For a person involved in OSINT, the list may surface a number of search services which could prove useful; for example, the Chinese and Vietnamese systems.

A generated image based on your input prompt

A new search vendor explains the advantages of a used convertible driven by an elderly person to take a French bulldog to the park once a day. The clueless fellow behind the wheel wants to buy a snazzy set of wheels. The son in the yellow shirt loves the vehicle. What does that car sales professional do? Some might suggest that certain marketers lie, sell useless add ons, patch up problems, and fiddle the interest rate financing. Could this be similar to search engine cheerleaders and the experts who explain them? Thanks ImageFX. A good enough illustration with just a touch of bias.

I do want to offer several observations:

  1. Google dominates Web search. There is an important distinction not usually discussed when some experts analyze Google; that is, Google delivers “search without search.” The idea is simple. A person uses a Google service of which there are many. Take for example Google Maps. The Google runs queries when users take non-search actions; for example, clicking on another part of a map. That’s a search for restaurants, fuel services, etc. Sure, much of the data are cached, but this is an invisible search. Competitors and would-be competitors often forget that Google search is not limited to the Google.com search box. That’s why Google’s reach is going to be difficult to erode quickly. Google has other search tricks up its very high-tech ski jacket’s sleeve. Think about search-enabled applications.
  2. There is an important difference between building one’s own index of Web content and sending queries to other services. The original Web indexers have become like rhinos and white tigers. It is faster, easier, and cheaper to create a search engine which just uses other people’s indexes. This is called metasearch. I have followed the confusion between search and metasearch for many years. Most people do not understand or care about the difference in approaches. This list illustrates how Web search is perceived by many people.
  3. Web search is expensive. Years ago when I was an advisor to BearStearns (an estimable outfit indeed), my client and I were on a conference call with Prabhakar Raghavan (then a Yahoo senior “search” wizard). He told me and my client, “Indexing the Web costs only $300,000 US.” Sorry Dr. Raghavan (now the Googler who made the absolutely stellar Google Bard presentation in France after MSFT and OpenAI caught Googzilla with its gym shorts around its ankles in early 2023) you were wrong. That’s why most “new” search systems look for short cuts. These range from recycling open source indexes to ignoring pesky robots.txt files to paying some money to use assorted also-ran indexes.

Net net: Web search is a complex, fast-moving, and little-understood business. People who know now do other things. The Google means overt search, embedded search, and AI-centric search. Why? That is a darned good question which I have tried to answer in my different writings. No one cares. Just Google it.

PS. Download the article. It is a useful reference point.

Stephen E Arnold, February 22, 2024

The Next Big Thing in Search: A Directory of Web Sites

February 12, 2024

green-dino_thumb_thumb_thumbThis essay is the work of a dumb dinobaby. No smart software required.

In the early 1990s, an entrepreneur with whom I had worked in the 1980s convinced me to work on a directory of Web sites. Yahoo was popular at the time, but my colleague had a better idea. The good news is that our idea worked and the online  service we built became part of the CMGI empire. Our service was absorbed by one of the leading finding services at the time. Remember Lycos? My partner and I do. Now the Web directory is back decades after those original Yahooligans and our team provided a useful way to locate a Web site.

Search Chatbots? Pah, This Startup’s Trying on Yahoo’s Old Outfit of Web Directories” presents information about the utility of a directory of Web sites and captures some interesting observations about the findability service El Toco.

image

The innovator driving the directory concept is Thomas Chopping, a “UK based economist.” He made several observations in a recent article published by the British outfit The Register; for example:

“During the decades since it launched, we’ve been watching Google steadily trying to make search more predictive, by adding things like autocomplete and eventually instant answers,” Chopping told The Register. “This has the byproduct of increasing the amount of time users spend on their site, at the expense of visiting the underlying sources of the data.”

The founder of El Toco also notes:

It’s impossible to browse with conversational-style search tools, which are entirely focused on answering questions. “Right now, this is playing into the hands of Meta and TikTok, because it takes so much effort to find good quality websites via search engines that people stopped bothering.

El Taco wants to facilitate browsing, and the model is a directory listing. The user can browse and click. The system displays a Web site for the user to scan, read, or bookmark.

Another El Taco principle is:

“We don’t need the user’s personal data to work out which results to show, because the user can express this on their own. We don’t need AI to turn the search into a conversation, because this can be done with a few clicks of the user interface

The economist-turned-entrepreneur points out:

“Charging users for Web search is a model which clearly doesn’t work, thanks to Neeva for demonstrating that, so we allow adverts but if the users care they can go into a menu and simply switch them off.”

Will El Taco gain traction? My team and I have been involved in information retrieval for decades. From indexing information about nuclear facilities to providing some advice to an AI search start up a few months ago. I have learned that predicting what will become the next big thing in findability is quite difficult.

A number of interesting Web search solutions are available. Some are niche-focused like Biznar. Others are next-generation “phinding” services like Phind.com. Others are metasearch solutions like iSeek. Some are less crazy Google-style systems like Swisscows. And there are more coming every day.

Why? Let me share several observations or “learnings” from a half century of working in the information retrieval sector:

  1. People have different information needs and a one-size-fits-all search system is fraught with problems. One person wants to search for “pizza near me”. Another wants information about Dark Web secure chat services.
  2. Almost everyone considers themselves a good or great online searcher. Nothing could be further from the truth. Just ask the OSINT professionals at any intelligence conference.
  3. Search companies with some success often give in to budgeting for a minimally viable system, selling traffic or user data, and to dark patterns in pursuit of greater revenue.
  4. Finding information requires effort. Convenience, however, is the key feature of most finding systems. Microfilm is not convenient; therefore, it sucks. Looking at research data takes time and expertise; therefore, old-fashioned work sucks. Library work involving books is not for everyone; therefore, library research sucks. Only a tiny percentage of online users want to exert significant effort finding, validating, and making sense of information. Most people prefer to doom scroll or watch dance videos on a mobile device.

Net net: El Taco is worth a close look. I hope that editorial policies, human curation, and frequent updating become the new normal. I am just going to remain open minded. Information is an extremely potent tool. If I tell you human teeth can explode, do you ask for a citation? Do you dismiss the idea because of your lack of knowledge? Do you begin to investigate of high voltage on the body of a person who works around a 133 kV transmission line? Do you dismiss my statement because I am obviously making up a fact because everyone knows that electricity is 115 to 125 volts?

Unfortunately only subject matter experts operating within an editorial policy and given adequate time can figure out if a scientific paper contains valid data or made-up stuff like that allegedly crafted by the former presidents of Harvard and Stanford University and probably faculty at the university closest to your home.

Our 1992 service had a simple premise. We selected Web sites which contained valid and useful information. We did not list porn sites, stolen software repositories, and similar potentially illegally or harmful purveyors of information. We provided the sites our editors selected with an image file that was our version of the old Good Housekeeping Seal of Approval.

Point(Top 5% of the Internet.)

The idea was that in the early days of the Internet and Web sites, a parent or teacher could use our service without too much worry about setting off a porn storm or a parent storm. It worked, we sold, and we made some money.

Will the formula work today? Sure, but excellence and selectivity have been key attributes for decades. Give El Taco a look.

Stephen E Arnold, February 12, 2024

The American Way: Loose the Legal Eagles! AI, Gray Lady, AI.

December 29, 2023

green-dino_thumb_thumb_thumbThis essay is the work of a dumb dinobaby. No smart software required.

With the demands of the holidays, I have been remiss in commenting upon the festering legal sores plaguing the “real” news outfits. Advertising is tough to sell. Readers want some stories, not every story. Subscribers churn. The dead tree version of “real” news turn yellow in the windows of the shrinking number of bodegas, delis, and coffee shops interested in losing floor space to “real” news displays.

image

A youthful senior manager enters Dante’s fifth circle of Hades, the Flaming Legal Eagles Nest. Beelzebub wishes the “real” news professional good luck. Thanks, MSFT Copilot, I encountered no warnings when I used the word “Dante.” Good enough.

Google may be coming out of the dog training school with some slightly improved behavior. The leash does not connect to a shock collar, but maybe the courts will provide curtail some of the firm’s more interesting behaviors. The Zuckbook and X.com are news shy. But the smart software outfits are ripping the heart out of “real” news. That hurts, and someone is going to pay.

Enter the legal eagles. The target is AI or smart software companies. The legal eagles says, “AI, gray lady, AI.”

How do I know? Navigate to “New York Times Sues OpenAI, Microsoft over Millions of Articles Used to Train ChatGPT.” The write up reports:

The New York Times has sued Microsoft and OpenAI, claiming the duo infringed the newspaper’s copyright by using its articles without permission to build ChatGPT and similar models. It is the first major American media outfit to drag the tech pair to court over the use of stories in training data.

The article points out:

However, to drive traffic to its site, the NYT also permits search engines to access and index its content. "Inherent in this value exchange is the idea that the search engines will direct users to The Times’s own websites and mobile applications, rather than exploit The Times’s content to keep users within their own search ecosystem." The Times added it has never permitted anyone – including Microsoft and OpenAI – to use its content for generative AI purposes. And therein lies the rub. According to the paper, it contacted Microsoft and OpenAI in April 2023 to deal with the issue amicably. It stated bluntly: "These efforts have not produced a resolution."

I think this means that the NYT used online search services to generate visibility, access, and revenue. However, it did not expect, understand, or consider that when a system indexes content, that content is used for other search services. Am I right? A doorway works two ways. The NYT wants it to work one way only. I may be off base, but the NYT is aggrieved because it did not understand the direction of AI research which has been chugging along for 50 years.

What do smart systems require? Information. Where do companies get content? From online sources accessible via a crawler. How long has this practice been chugging along? The early 1990s, even earlier if one considers text and command line only systems. Plus the NYT tried its own online service and failed. Then it hooked up with LexisNexis, only to pull out of the deal because the “real” news was worth more than LexisNexis would pay. Then the NYT spun up its own indexing service. Next the NYT dabbled in another online service. Plus the outfit acquired About.com. (Where did those writers get that content?” I know the answer, but does the Gray Lady remember?)

Now with the success of another generation of software which the Gray Lady overlooked, did not understand, or blew off because it was dealing with high school management methods in its newsroom — now the Gray Lady has let loose the legal eagles.

What do I make of the NYT and online? Here are the conclusions I reached working on the Business Dateline database and then as an advisor to one of the NYT’s efforts to distribute the “real” news to hotels and steam ships via facsimile:

  1. Newspapers are not very good at software. Hey, those Linotype machines were killers, but the XyWrite software and subsequent online efforts have demonstrated remarkable ways to spend money and progress slowly.
  2. The smart software crowd is not in touch with the thought processes of those in senior management positions in publishing. When the groups try to find common ground, arguments over who pays for lunch are more common than a deal.
  3. Legal disputes are expensive. Many of those engaged reach some type of deal before letting a judge or a jury decide which side is the winner. Perhaps the NYT is confident that a jury of its peers will find the evil AI outfits guilty of a range of heinous crimes. But maybe not? Is the NYT a risk taker? Who knows. But the NYT will pay some hefty legal bills as it rushes to do battle.

Net net: I find the NYT’s efforts following a basic game plan. Ask for money. Learn that the money offered is less than the value the NYT slaps on its “real” news. The smart software outfit does what it has been doing. The NYT takes legal action. The lawyer engage. As the fees stack up, the idea that a deal is needed makes sense.

The NYT will do a deal, declare victory, and go back to creating “real” news. Sigh. Why? Microsoft has more money and can tie up the matter in court until Hell freezes over in my opinion. If the Gray Lady prevails, chalk up a win. But the losers can just up their cash offer, and the Gray Lady will smile a happy smile.

Stephen E Arnold, December 29, 2023

Google: Rock Solid Arguments or Fanciful Confections?

November 17, 2023

green-dino_thumb_thumbThis essay is the work of a dumb humanoid. No smart software required.

I read some “real” news from a “real” newspaper. My belief is that a “real journalist”, an editor, and probably some supervisory body reviewed the write up. Therefore, by golly, the article is objective, clear, and actual factual. What’s “What Google Argued to Defend Itself in Landmark Antitrust Trial” say?

image

“I say that my worthy opponent’s assertions are — ahem, harrumph — totally incorrect. I do, I say, I do offer that comment with the greatest respect. My competitors are intellectual giants compared to the regulators who struggle to use Google Maps on an iPhone,” opines a legal eagle who supports Google. Thanks, Microsoft Bing. You have the “chubby attorney” concept firmly in your digital grasp.

First, the write up says zero about the secrecy in which the case is wrapped. Second, it does not offer any comment about the amount the Google paid to be the default search engine other than offering the allegedly consumer-sensitive, routine, and completely logical fees Google paid. Hey, buying traffic is important, particularly for outfits accused of operating in a way that requires a US government action. Third, the support structure for the Google arguments is not evident. I could not discern the logical threat that linked the components presented in such lucid prose.

The pillars of the logical structure are:

  1. Appropriate payments for traffic; that is, the Google became the default search engine. Do users change defaults? Well, sure they do? If true, then why be the default in the first place. What are the choices? A Russian search engine, a Chinese search engine, a shadow of Google (Bing, I think), or a metasearch engine (little or no original indexing, just Vivisimo-inspired mash up results)? But pay the “appropriate” amount Google did.
  2. Google is not the only game in town. Nice terse statement of questionable accuracy. That’s my opinion which I articulated in the three monographs I wrote about Google.
  3. Google fosters competition. Okay, it sure does. Look at the many choices one has: Swisscows.com, Qwant.com, and the estimable Mojeek, among others.
  4. Google spends lots of money on helping people research to make “its product great.”
  5. Google’s innovations have helped people around the world?
  6. Google’s actions have been anticompetitive, but not too anticompetitive.

Well, I believe each of these assertions. Would a high school debater buy into the arguments? I know for a fact that my debate partner and I would not.

Stephen E Arnold, November 17, 2023

By Golly, the Gray Lady Will Not Miss This AI Tech Revolution!

November 2, 2023

green-dino_thumb_thumbThis essay is the work of a dumb humanoid. No smart software required.

The technology beacon of the “real” newspaper is shining like a high-technology beacon. Flash, the New York Times Online. Flash, terminating the exclusive with LexisNexis. Flash. The shift to a — wait for it — a Web site. Flash. The in-house indexing system. Flash. Buying About.com. Flash. Doing podcasts. My goodness, the flashes have impaired my vision. And where are we today after labor strife, newsroom craziness, and a list of bestsellers that gets data from…? I don’t really know, and I just haven’t bothered to do some online poking around.

image

A real journalist of today uses smart software to write listicles for Buzzfeed, essays for high school students, and feature stories for certain high profile newspapers. Thanks for the drawing Microsoft Bing. Trite but okay.

I thought about the technology flashes from the Gray Lady’s beacon high atop its building sort of close to Times Square. Nice branding. I wonder if mobile phone users know why the tourist destination is called Times Square. Since I no longer work in New York, I have forgotten. I do remember the high intensity pinks and greens of a certain type of retail establishment. In fact, I used to know the fellow who created this design motif. Ah, you don’t remember. My hunch is that there are other factoids you and I won’t remember.

For example, what’s the byline on a New York Times’s story? I thought it was the name or names of the many people who worked long hours, made phone calls, visited specific locations, and sometimes visited the morgue (no, the newspaper morgue, not the “real” morgue where the bodies of compromised sources ended up).

If the information in  that estimable source Showbiz411.com is accurate, the Gray Lady may cite zeros and ones. The article is “The New York Times Help Wanted: Looking for an AI Editor to Start Publishing Stories. Six Figure Salary.” Now that’s an interesting assertion. A person like me might ask, “Why not let a recent college graduate crank out machine generated stories?” My assumption is that most people trying to meet a deadline and in sync with Taylor Swift will know about machine-generated information. But, if the story is true, here’s what’s up:

… it looks like the Times is going let bots do their journalism. They’re looking for “a senior editor to lead the newsroom’s efforts to ambitiously and responsibly make use of generative artificial intelligence.” I’m not kidding. How the mighty have fallen. It’s on their job listings.

The Showbiz411.com story allegedly quotes the Gray Lady’s help wanted ad as saying:

“This editor will be responsible for ensuring that The Times is a leader in GenAI innovation and its applications for journalism. They will lead our efforts to use GenAI tools in reader-facing ways as well as internally in the newsroom. To do so, they will shape the vision for how we approach this technology and will serve as the newsroom’s leading voice on its opportunity as well as its limits and risks. “

There are a bunch of requirements for this job. My instinct is that a few high school students could jump into this role. What’s the difference between a ChatGPT output about crossing the Delaware and writing a “real” news article about fashion trends seen at Otto’s Shrunken Head.

Several observations:

  • What does this ominous development mean to the accountants who will calculate the cost of “real” journalists versus a license to smart software? My thought is that the general reaction will be positive. Imagine: No vacays, no sick days, and no humanoid protests. The Promised Land has arrived.
  • How will the Gray Lady’s management team explain this cuddling up to smart software? Perhaps it is just one of those newsroom romances? On the other hand, what if something serious develops and the smart software moves in? Yipes.
  • What will “informed” reads think of stories crafted by the intellectual engine behind a high school student’s essay about great moments in American history? Perhaps the “informed” readers won’t care?

Exciting stuff in the world of real journalism down the street from Times Square and the furries, pickpockets, and gawkers from Ames, Iowa. I wonder if the hallucinating smart software will be as clever as the journalist who fabricates a story? Probably not. “Real” journalists do not shape, weaponized, or filter the actual factual. Is John Wiley & Sons ready to take the leap?

Stephen E Arnold, November 2, 2023

test

Is Google Setting a Trap for Its AI Competition

October 6, 2023

Vea4_thumb_thumb_thumb_thumb_thumb_tNote: This essay is the work of a real and still-alive dinobaby. No smart software involved, just a dumb humanoid.

The litigation about the use of Web content to train smart generative software is ramping up. Outfits like OpenAI, Microsoft, and Amazon and its new best friend will be snagged in the US legal system.

But what big outfit will be ready to offer those hungry to use smart software without legal risk? The answer is the Google.

How is this going to work?

simple. Google is beavering away with its synthetic data. Some real data are used to train sophisticated stacks of numerical recipes. The idea is that these algorithms will be “good enough”; thus, the need for “real” information is obviated. And Google has another trick up its sleeve. The company has coveys of coders working on trimmed down systems and methods. The idea is that using less information will produce more and better results than the crazy idea of indexing content from wherever in real time. The small data can be licensed when the competitors are spending their days with lawyers.

How do I know this? I don’t but Google is providing tantalizing clues in marketing collateral like “Researchers from the University of Washington and Google have Developed Distilling Step-by-Step Technology to Train a Dedicated Small Machine Learning Model with Less Data.” The author is a student who provides sources for the information about the “less is more” approach to smart software training.

And, may the Googlers sing her praises, she cites Google technical papers. In fact, one of the papers is described by the fledgling Googler as “groundbreaking.” Okay.

What’s really being broken is the approach of some of Google’s most formidable competition.

When will the Google spring its trap? It won’t. But as the competitors get stuck in legal mud, the Google will be an increasingly attractive alternative.

The last line of the Google marketing piece says:

Check out the Paper and Google AI Article. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 30k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

Get that young marketer a Google mouse pad.

Stephen E Arnold, October 6, 2023

HP Autonomy: A Modest Disagreement Escalates

May 15, 2023

Vea4_thumb_thumb_thumb_thumb_thumb_tNote: This essay is the work of a real and still-alive dinobaby. No smart software involved, just a dumb humanoid.

About 12 years ago, Hewlett Packard acquired Autonomy. The deal was, as I understand the deal, HP wanted to snap up Autonomy to make a move in the enterprise services business. Autonomy was one of the major providers of search and some related content processing services in 2010. Autonomy’s revenues were nosing toward $800 million, a level no other search and retrieval software company had previously achieved.

However, as Qatalyst Partners reported in an Autonomy profile, the share price was not exactly hitting home runs each quarter:

image

Source: Autonomy Trading and Financial Statistics, 2011 by Qatalyst Partners

After some HP executive turmoil, the deal was done. After a year or so, HP analysts determined that the Silicon Valley company paid too much for Autonomy. The result was high profile litigation. One Autonomy executive found himself losing and suffering the embarrassment of jail time.

Autonomy Founder Mike Lynch Flown to US for HPE Fraud Trial” reports:

Autonomy founder Mike Lynch has been extradited to the US under criminal charges that he defrauded HP when he sold his software business to them for $11 billion in 2011. The 57-year-old is facing allegations that he inflated the books at Autonomy to generate a higher sale price for the business, the value of which HP subsequently wrote down by billions of dollars.

Although I did some consulting work for Autonomy, I have no unique information about the company, the HP allegations, or the legal process which will unspool in the US.

In a recent conversation with a person who had first hand knowledge of the deal, I learned that HP was disappointed with the Autonomy approach to business. I pushed back and pointed out three things to a person who was quite agitated that I did not share his outrage. My points, as I recall, were:

  1. A number of search-and-retrieval companies failed to generate revenue sufficient to meet their investors’ expectations. These included outfits like Convera (formerly Excalibur Technologies), Entopia, and numerous other firms. Some were sold and were operated as reasonably successful businesses; for example, Dassault Systèmes and Exalead. Others were folded into a larger business; for example, Microsoft’s purchase of Fast Search & Transfer and Oracle’s acquisition of Endeca. The period from 2008 to 2013 was particularly difficult for vendors of enterprise search and content processing systems. I documented these issues in The Enterprise Search Report and a couple of other books I wrote.
  2. Enterprise search vendors and some hybrid outfits which developed search-related products and services used bundling as a way to make sales. The idea was not new. IBM refined the approach. Buy a mainframe and get support free for a period of time. Then the customer could pay a license fee for the software and upgrades and pay for services. IBM charged me $850 to roll a specialist to look at my three out-of-warranty PC 704 servers. (That was the end of my reliance on IBM equipment and its marvelous ServeRAID technology.) Libraries, for example, could acquire hardware. The “soft” components had a different budget cycle. The solution? Split up the deal. I think Autonomy emulated this approach and added some unique features. Nevertheless, the market for search and content related services was and is a difficult one. Fast Search & Transfer had its own approach. That landed the company in hot water and the founder on the pages of newspapers across Scandinavia.
  3. Sales professionals could generate interest in search and content processing systems by describing the benefits of finding information buried in a company’s file cabinets, tucked into PowerPoint presentations, and sleeping peacefully in email. Like the current buzz about OpenAI and ChatGPT, expectations are loftier than the reality of some implementations. Enterprise search vendors like Autonomy had to deal with angry licensees who could not find information, heated objections to the cost of reindexing content to make it possible for employees to find the file saved yesterday (an expensive and difficult task even today), and howls of outrage because certain functions had to be coded to meet the specific content requirements of a particular licensee. Remember that a large company does not need one search and retrieval system. There are many, quite specific requirements. These range from engineering drawings in the R&D center to the super sensitive employee compensation data, from the legal department’s need to process discovery information to the mandated classified documents associated with a government contract.

These issues remain today. Autonomy is now back in the spot light. The British government, as I understand the situation, is not chasing Dr. Lynch for his methods. HP and the US legal system are.

The person with whom I spoke was not interested in my three points. He has a Harvard education and I am a geriatric. I will survive his anger toward Autonomy and his obvious affection for the estimable HP, its eavesdropping Board and its executive revolving door.

What few recall is that Autonomy was one of the first vendors of search to use smart software. The implementation was described as Neuro Linguistic Programming. Like today’s smart software, the functioning of the Autonomy core technology was a black box. I assume the litigation will expose this Autonomy black box. Is there a message for the ChatGPT-type outfits blossoming at a prodigious rate?

Yes, the enterprise search sector is about to undergo a rebirth. Organizations have information. Findability remains difficult. The fix? Merge ChatGPT type methods with an organization’s content. What do you get? A party which faded away in 2010 is coming back. The Beatles and Elvis vibe will be live, on stage, act fast.

Stephen E Arnold, May 15, 2023

A Googley Rah Rah for Synthetic Data

April 27, 2023

Vea4_thumb_thumb_thumbNote: This essay is the work of a real and still-alive dinobaby. No smart software involved, just a dumb humanoid.

I want to keep this short. I know from experience that most people don’t think too much about synthetic data. The idea is important, but other concepts are important and no one really cares too much. When was the last time Euler’s Number came up at lunch?

A gaggle of Googlers extoll the virtues of synthetic in a 19 page ArXiv document called “Synthetic Data from Diffusion Models Improves ImageNet Classification.” The main idea is that data derived from “real” data are an expedient way to improve some indexing tasks.

I am not sure that a quote from the paper will do much to elucidate this facet of the generative model world. The paper includes charts, graphs, references to math, footnotes, a few email addresses, some pictures, wonky jargon, and this conclusion:

And we have shown improvements to ImageNet classification accuracy extend to large amounts of generated data, across a range of ResNet and Transformer-based models.

The specific portion of this quote which is quite important in my experience is the segment “across a range of ResNet and Transformer-based models.” Translating to Harrod’s Creek lingo, I think the wizards are saying, “Synthetic data is really good for text too.”

What’s bubbling beneath the surface of this archly-written paper? Here are my answers to this question:

  1. Synthetic data are a heck of a lot cheaper to generate for model training; therefore, embrace “good enough” and move forward. (Think profits and bonuses.)
  2. Synthetic data can be produced and updated more easily that fooling around with “real” data. Assembling training sets, tests, deploying and reprocessing are time sucks. (There is more work to do than humanoids to do it when it comes to training, which is needed frequently for some applications.)
  3. Synthetic datasets can be smaller. Even baby Satan aka Sam Altman is down with synthetic data. Why? Elon could only buy so many nVidia processing units. Thus finding a way to train models with synthetic data works around a supply bottleneck.

My summary of the Googlers’ article is much more brief than the original: Better, faster, cheaper.

You don’t have to pick one. Just believe the Google. Who does not trust the Google? Why not buy synthetic data and ready-to-deploy models for your next AutoGPT product? Google’s approach worked like a champ for online ads. Therefore, Google’s approach will work for your smart software. Trust Google.

Stephen  E Arnold, April 27, 2023

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta