Google’s Government Indexing Desire

December 13, 2008

I laughed when I read the Washington Post article “Firms Push for a More Searchable Federal Web” by Peter Whoriskey. I wiped away my tears and pondered the revelation that Google wants to index the US government’s information. The leap from Eric Schmidt to the Smithsonian to search engine optimizer par excellence was almost too much for me. I assume that Google’s man in Washington did not recall the procurements in which Google participated. Procurement I might add that Google did not win. The company lost out to Inktomi, Fast Search, Microsoft, and Vivisimo. Now it seems Google wants to get back in the game. Google is in the game. The company has an index of some US government information here. The service is called Google US Government Search, and it even has an American flag to remind you that Google is indexing some of the US government’s public facing content. When I compare the coverage of Microsoft Vivisimo’s index here with that of Google’s US government index, I think Google delivers more on point information. Furthermore, the clutter free Google search pages lets me concentrate on search. The question that does not occur to Mr. Whoriskey is, “Why doesn’t the US government use Google for the USA.gov service?” I don’t know the answer to this question, but I have a hunch that Google did not put much emphasis on competition for a government wide indexing contract. Now I think Google wants that contract, and it is making its interest known. With cheerful yellow Google Search Appliances sprouting like daisies in government agencies, the GOOG wants

How good is Google’s existing index of US government information? I ran a few test queries. Here is a summary of my results of testing Google’s service with the USA.gov service provided by Microsoft Vivisimo. The first query is “nuclear eccs”. I wanted to see on the first results page pointers to emergency core cooling system information available from the Nuclear Regulatory Commission and a couple of other places which have US government facing nuclear related information. Well, Google nailed my preferred source with a directly link to the NRC Revision of Appendix K which is about ECCS. USA.gov provided a pointer to a GE document but the second link was to my preferred document. Close enough for horseshoes.

My second query was for “Laura Bush charity”. Google delivered a useful hit at item 3 “$2 Million Grant for Literacy Programs.” USA.gov nailed the result at the number one position in its hit list.

My third query was for “companycommand.com”. Google presented the link to CompanyCommand.com at the top of the results list and displayed a breakdown of the main sections of the Web site. USA.gov delivered the hit at the top of the results list.

My test–as unscientific as it was–revealed to me that neither Google nor Microsoft Vivisimo perform better than one another. Neither services sucks the content from the depths of the Department of Commerce. Where are those consulting reports prepared for small businesses?

What’s the difference between Google’s government index and the Microsoft Vivisimo government index?

The answer is simple, “Buzz”.

No one in my circle of contacts in the US government gets jazzed about Microsoft Vivisimo. But mention Google, and you have the Tang of search. Google strikes me as responding to requests from departments to buy Google Maps or the Google Search Appliance.

If Google wants to nail larger sales in the US government, the company will need the support of partners who can deliver the support and understanding government executives expect and warrant. The messages refracted through a newspaper with support from wizards in the business of getting a public Web site to appear at the top of a results list wont do the job.

In my opinion, both Google and the Washington Post have to be more precise in their communications. Search is a complicated beastie even though many parvenu consultants love the words “Easy,” “Simple” and “Awesome performance.” That’s the sizzle, not the steak. Getting content to the user is almost as messy as converting Elsie the cow into hamburgers.

Google’s the search system with buzz. The incumbent Microsoft Vivisimo has the government contract. If Google wants that deal and others of that scale, Google will want to find partners who can deliver, not partners who are in Google’s social circle. Google will want to leverage its buzz and build into their comments that search engine optimization is not exactly what is needed for the Google Search Appliance to deliver a solution to a government agency. Finally, the Washington Post may want to dig a bit deeper when writing about search in order to enhance clarity and precision.

Stephen Arnold, December 12, 2008

Tribune Says: Google’s Automated Indexing Not Good

September 11, 2008

I have been a critic of Sam Zell’s Tribune since I tangled with the site for my 86 year old father. You can read my negative views of the site’s usability, its indexing, and its method of displaying content here.

Now on with my comments on this Marketwatch story titled “Tribune Blames Google for Damaging News Story” by John Letzing, a good journalist in my book. Mr. Letzing reports that Google’s automated crawler and indexing system could not figure out that a story from 2002 was old. As a result, the “old” story appeared in Google News and the stock of United Airlines took a hit. The Tribune, according to the story, blames Google.

Hold your horses. This problem is identical to the folks who say, “Index my servers. The information on them is what we want indexed.” As soon as the index goes live, these same folks complain that the search engine has processed ripped off music, software from mysterious sources, Cub Scout fund raising materials, and some content I don’t want to mention in a Web log. How do I know? I have heard this type of rationalization many times. Malformed XML, duplicate content, and other problems means content mismanagement, not bad indexing by a search systems.

Most people don’t have a clue what’s on their public facing servers. The content management system may be at fault. The users might be careless. Management may not have policies and create an environment in which those policies are observed. Most people don’t know that “dates” are assigned and may not correlate with the “date” embedded in a document. In fact, some documents contain many dates. Entity extraction can discover a date, but when there are multiple dates, which date is the “right one”? What’s a search system supposed to do? Well, search systems process what’s exposed on a public facing server or a source identified in the administrative controls for the content acquisition system.

Blaming a software system for lousy content management is a flashing yellow sign that says to me “Uninformed ahead. Detour around problem.”

Based on my experience with indexing content managed by people who were too busy to know what was on their machines, I think blaming Google is typical of the level of understanding in traditional media about how automated or semi automated systems work. Furthermore, when I examined the Tribune’s for fee service referenced in my description identified above, it was clear that the level of expertise brought to bear on this service was in my opinion rudimentary.

Traditional media is eager to find fault with Google. Yet some of these outfits use automated systems to index content and cut headcount. The indexing generated by these systems is acceptable, but there are errors. Some traditional publishers not only index in a casual manner, these publishers charge for each query. A user may have to experiment in order to find relevant documents. Each search puts money in the publisher’s pocket. The Tribune charges for an online service that is essentially unusable by my 86 year old father.

If a Tribune company does not know what’s on its servers and exposes those servers on the Internet, the problem is not Google’s. The problem is the Tribune’s.

Stephen Arnold, September 11, 2008

Indexing Dynamic Databased Content

April 20, 2008

In the last week, there’s been considerable discussion of what is now called “deep Web” content. The idea is that some content requires the user to enter a query. The system processes the querey and generates a search result from a database. This function is easier to illustrate than explain in words.

Look at the screen shot below. I have navigated to Southwest Airlines Web page and entered a query for flights from Louisville, Kentucky, to Baltimore, Maryland.

southwest form

Here’s what the system shows me:

southwest result

If you do a search on Google, Live.com, or Yahoo, you won’t see the specific listing of flights shown below:

southwest flight listing

Read more

Indexing Hot Spots

February 29, 2008

Introduction

This is the third in a series of cost hot spots in behind-the-firewall search. This essay does not duplicate the information in Beyond Search, my new study for the Gilbane Group. This document is designed to highlight several functions or operations in an indexing subsystem than can cause system slow downs or bottlenecks. No specific vendors’ systems are referenced in this essay. I see no value in finger pointing because no indexing subsystem is without potential for performance degradation in a real world installation. – Stephen Arnold, February 29, 2008

Indexing: Often a Mysterious Series of Multiple Operations

One of the most misunderstood parts of a behind-the-firewall search system is indexing. The term indexing itself is the problem. For most people, an index is the key word listing that appears at the back of a book. For those hip to the ways of online, indexing means metatagging, usually in the form of a series of words or phrases assigned to a Web page or an element in a document; for example, an image and its associated test. The actual index in your search system may not be one data table. The index may be multiple tables or numeric values that “float” mysteriously within the larger search system. The “index” may not even be in one system. Parts of the index are in different places, updated in a series of processes that cannot be easily recreated after a crash, software glitch, or other corruption. This CartoonStock.com image makes clear the impact of a search system crash.

Centuries ago, people lucky enough to have a “book” learned that some sort of system was needed to find a scroll, stored in a leather or clay tube, sometimes chained to the wall to keep the source document from wandering off. In the so called Dark Ages, information was not free, nor did it flow freely. Information was something special and of high value. Today, we talk about information as a flood, a tidal wave, a problem. It is ubiquitous, without provenance, and digital. Information wants to be free, fluid, moving around, and unstable, dynamic. For indexing to work, you have a specific object at a point in time to process; otherwise, the index is useless. Also, the index must be “fresh”. Fresh means that the most recent information is in the system and therefore available to users. With lots of new and changed information, you have to determine how fresh is fresh enough. Real time data also provides a challenge. If your system can index 100 megabytes a minute and to keep up with larger volumes of new and changed data, something’s got to give. You may have to prioritize what you index. You handle high-priority documents first, then shift to lower priority document until new higher-priority documents arrive. This triage affects the freshness in the index or you can throw more hardware at your system, thus increasing capital investment and operational cost.Index freshness is important. A person in a professional setting cannot do “work” unless the digital information can be located. Once located, the information must be the “right” information. Freshness is important, but there are issues of versions of documents. These are indexing challenges and can require considerable intellectual effort to resolve. You have to get freshness right for a search system to be useful to your colleagues. In general, the more involved your indexing, the more important is the architecture and engineering of the “moving parts” in your search system’s indexing subsystem.Why is indexing a cost hot spot? Let’s look at some hot spots I have encountered in the last nine months.

Remediating Indiscriminate Indexing

When you deploy your behind-the-firewall search or content processing system, you have to tell your system how to process the content. You can operate an advanced system in default mode, but you may want to select certain features, level of stringency, and make sure that you are familiar with the various controls available to you. Investing time prior to deployment in testing may be useful when troubleshooting. The first cost hot spot is encountering disc thrashing or long indexing times. You come in one morning, check the logs, and learn no content was processed. In Beyond Search I talk about some steps you can take to troubleshoot this condition. If you can’t remediate the situation by rebooting the indexing subsystem, then you will have to work through the vendor’s technical support group, restore the system to a known good state, or – in some cases – reinstall the system. When you reinstall, some systems cannot use the back up index files. If you find that your back ups won’t work or deliver erratic results on test queries, then you may have to rebuild the index. In a small two person business, the time and cost are trivial. In an organization with hundreds of servers, the process can consume significant resources.

Updating the Index or Indexes

Your search or content processing system allows you to specify how frequently the index updates. When your system has robust resources, you can specify indexing to occur as soon as content becomes available. Some vendors talk about their systems as “real time” indexing engines. If you find that your indexing engine starts to slow down, you may have encountered a “big document” problem. Indexing systems make short work of HTML pages, short PDFs, and emails. But when document size grows, the indexing subsystem needs more “time” to process long documents. I have encountered situations in which a Word document includes objects that are large. The Word document requires the indexing subsystem to grind away on this monster file. If you hit a patch characterized by a large number of big documents, the indexing subsystem will appear to be busy but indexing subsystem outputs fall sharply.Let’s assume you build your roll out index based on a thorough document analysis. You have verified security and access controls so the “right” people see the information to which they have appropriate access. You know that the majority of the documents your system processes are in the 600 kilobyte range over the first three months of indexing subsystem operation. Suddenly the document size leaps to six megabytes and the number of big documents becomes more than 20 percent of the document throughput. You may learn that the set up of your indexing subsystem or the resources available are hot spots.Another situation concerns different versions of documents. Some search and content processing systems identify duplicates using date and time stamps. Other systems include algorithms to identify duplicate content and remove it or tag it so the duplicates may or may not be displayed under certain conditions. A surge in duplicates may occur when an organization is preparing for a trade show. Emails with different versions of a PowerPoint may proliferate rapidly. Obviously indexing every six megabyte PowerPoint makes sense if each PowerPoint is different. How your indexing subsystem handles duplicates is important. A hot spot occurs when a surge in the number of files with the same name and different date and time stamps are fed into the indexing system. The hot spot may be remediated by identifying the problem files and deleting them manually or via your system’s administrative controls. Versions of documents can become an issue under certain circumstances such as a legal matter. Unexpected indexing subsystem behavior may be related to a duplicate file situation.Depending on your system, you will have some fiddling to do in order to handle different versions of documents in a way that makes sense to your users. You also have to set up a de-duplication process in order to make it easy for your users to find the specific version of the document needed to perform a work task. These administrative interventions are not difficult when you know where to look for the problem. If you are not able to pinpoint a specific problem, the hunt for the hot spot can become time consuming.

Common Operations Become a Problem

Once an index has been constructed – a process often called indexation – incremental updates are generally trouble free. Notice that I said generally. Let’s look at some situations that can arise, albeit infrequently.Index RebuildYou have a crash. The restore operation fails. You have to reindex the content. Why is this expensive? You have to plan reindexing and then baby sit the update. For reindexing you will need the resources required when you performed the first indexation of your content. In addition, you have to work through the normal verifications for access, duplicates, and content processing each time you update. Whatever caused the index restore operation to fail must be remediated, a back up created when reindexing is completed, and then a test run to make sure the new back up restores correctly.Indexing New or Changed ContentLet’s assume that you have a system, and you have been performing incremental indexes for six months with no observable problems and no red flags from users. Users with no prior history of complaining about the search system complain that certain new documents are not in the system. Depending on your search system’s configuration, you may have a hot spot in the incremental indexing update process. The cause may be related to volume, configuration, or an unexpected software glitch. You need to identify the problem and figure out a fix. Some systems maintain separate indexes based on a maximum index size. When the index grows beyond a certain size, the system creates or allows the system administrator to create a second index. Parallelization makes it possible to query index components with no appreciable increase in system response time. A hot spot can result when a configuration error causes an index to exceed its maximum size, halting the system or corrupting the index itself, although other symptoms may be observable. Again – the key to resolving this hot spot is often configuration and infrastructure.Value-Added Content ProcessingNew search and content processing systems incorporate more sophisticated procedures, systems, and methods than systems did a few years ago. Fortunately faster processors, 64-bit chips, and plummeting prices for memory and storage devices allows indexing systems to pile on the operations and maintain good indexing throughput, easily several megabytes a minute to five gigabytes of content per hour or more.If you experience slow downs in index updating, you face some stark choices when you saturate your machine capacity or storage. In my experience, these are:

  • Reduce the number of documents processed
  • Expand the indexing infrastructure; that is, throw hardware at the problem
  • Turn off certain resource intensive indexing operations; in effect, eliminating some of the processes that use statistical, linguistic, or syntactic functions.

One of the questions that comes up frequently is, “Why are value-added processing systems more prone to slow downs?” The answer is that when the number of documents processed goes up or the size of documents rises, the infrastructure cannot handle the load. Indexing subsystems require constant monitoring and routine hardware upgrades.Iterative systems cycle through processes two or more times.Some iterative functions are dependent on other processes; for example, until the linguistic processes complete, another component – for example, entity extraction – cannot be completed. Many current indexing systems are be parallelized. But situations can arise in which indexing slows to a crawl because a software glitch fails to keep the internal pipelines flowing smoothly. If process A slows down, the lack of available data to process means process B waits. Log analysis can be useful in resolving this hot spot.Crashes: Still OccurMany modern indexing systems can hiccup and corrupt an index. The way to fix a corrupt index is to have two systems. When one fails, the other system continues to function.But many organizations can’t afford tandem operation and hot failovers. When an index corruption occurs, some organizations restore the index to a prior state. A gap may exist between the points in the back up and the index state at the time of the failure. Most systems can determine which content must be processed to “catch up”. Checking the rebuilt indexes is a useful step to take when a crash has taken place and the index restored and rebuilt. Keep in mind that back ups are not fool proof. Test your system’s back up and restore procedures to make sure you can survive a crash and have the system again operational.

Wrap Up

Let’s step back. The hot spots for indexing fall into three categories. First, you have to have adequate infrastructure. Ideally your infrastructure will be engineered to permit pipelined functions to operate rapidly and without latency. Second, you will want to have specific throughput targets so you can handle new and changed content whether your vendor requires one index or multiple indexes. Third, you will want to understand how to recover from a failure and have procedures in place to restore an index or “roll back” to a known good state and then process content to ensure no lost content.In general, the more value added content processing you use, your potential for hot spots increases. Search used to be simpler from an operational point of view. Key word indexing is very straight forward compared to some of the advanced content processing systems in use today. The performance of any system fluctuates to some extent. As sophisticated as today’s systems are, there is room for innovation in system design, architecture, and administration of indexing subsystems. Keep in mind that more specific information appears in Beyond Search, due out in April 2008.

Stephen Arnold, February 29, 2008

A Discernment Challenge for Those Who Are Dull Normal

June 24, 2024

dinosaur30a_thumbThis essay is the work of a dinobaby. Unlike some folks, no smart software improved my native ineptness. 

Techradar, an online information service, published “Ahead of GPT-5 Launch, Another Test Shows That People Cannot Distinguish ChatGPT from a Human in a Conversation Test — Is It a Watershed Moment for AI?”  The headline implies “change everything” rhetoric, but that is routine AI jargon-hype.

Once again, academics who are unable to land a job in a “real” smart software company studied the work of their former colleagues who make a lot more money than those teaching do. Well, what do academic researchers do when they are not sitting in the student union or the snack area in the lab whilst waiting for a graduate student to finish a task? In my experience, some think about their CVs or résumés. Others ponder the flaws in a commercial or allegedly commercial product or service.

image

A young shopper explains that the outputs of egg laying chickens share a similarity. Insightful observation from a dumb carp. Thanks, MSFT Copilot. How’s that Recall project coming along?

The write up reports:

The Department of Cognitive Science at UC San Diego decided to see how modern AI systems fared and evaluated ELIZA (a simple rules-based chatbot from the 1960’s included as a baseline in the experiment), GPT-3.5, and GPT-4 in a controlled Turing Test. Participants had a five-minute conversation with either a human or an AI and then had to decide whether their conversation partner was human.

Here’s the research set up:

In the study, 500 participants were assigned to one of five groups. They engaged in a conversation with either a human or one of the three AI systems. The game interface resembled a typical messaging app. After five minutes, participants judged whether they believed their conversation partner was human or AI and provided reasons for their decisions.

And what did the intrepid academics find? Factoids that will get them a job at a Perplexity-type of company? Information that will put smart software into focus for the elected officials writing draft rules and laws to prevent AI from making The Terminator come true?

The results were interesting. GPT-4 was identified as human 54% of the time, ahead of GPT-3.5 (50%), with both significantly outperforming ELIZA (22%) but lagging behind actual humans (67%). Participants were no better than chance at identifying GPT-4 as AI, indicating that current AI systems can deceive people into believing they are human.

What does this mean for those labeled dull normal, a nifty term applied to some lucky people taking IQ tests. I wanted to be a dull normal, but I was able to score in the lowest possible quartile. I think it was called dumb carp. Yes!

Several observations to disrupt your clear thinking about smart software and research into how the hot dogs are made:

  1. The smart software seems to have stalled. Our tests of You.com which allows one to select which object models parrots information, it is tough to differentiate the outputs. Cut from the same transformer cloth maybe?
  2. Those judging, differentiating, and testing smart software outputs can discern differences if they are way above dull normal or my classification dumb carp. This means that indexing systems, people, and “new” models will be bamboozled into thinking what’s incorrect is a-okay. So much for the informed citizen.
  3. Will the next innovation in smart software revolutionize something? Yep, some lucky investors.

Net net: Confusion ahead for those like me: Dumb carp. Dull normals may be flummoxed. But those super-brainy folks have a chance to rule the world. Bust out the party hats and little horns.

Stephen E Arnold, June 24, 2024

Will AI Kill Us All? No, But the Hype Can Be Damaging to Mental Health

June 11, 2024

dinosaur30a_thumb_thumbThis essay is the work of a dinobaby. Unlike some folks, no smart software improved my native ineptness.

I missed the talk about how AI will kill us all. Planned? Nah, heavy traffic. From what I heard, none of the cyber investigators believed the person trying hard to frighten law enforcement cyber investigators. There are other — slightly more tangible threats. One of the attendees whose name I did not bother to remember asked me, “What do you think about artificial intelligence?” My answer was, “Meh.”

image

A contrarian walks alone. Why? It is hard to make money being negative. At the conference I attended June 4, 5, and 6, attendees with whom I spoke just did not care. Thanks, MSFT Copilot. Good enough.

Why you may ask? My method of handling the question is to refer to articles like this: “AI Appears to Rapidly Be Approaching Be Approaching a Brick Wall Where It Can’t Get Smarter.” This write up offers an opinion not popular among the AI cheerleaders:

Researchers are ringing the alarm bells, warning that companies like OpenAI and Google are rapidly running out of human-written training data for their AI models. And without new training data, it’s likely the models won’t be able to get any smarter, a point of reckoning for the burgeoning AI industry

Like the argument that AI will change everything, this claim applies to systems based upon indexing human content. I am reasonably certain that more advanced smart software with different concepts will emerge. I am not holding my breath because much of the current AI hoo-hah has been gestating longer than new born baby elephant.

So what’s with the doom pitch? Law enforcement apparently does not buy the idea. My team doesn’t. For the foreseeable future, applied smart software operating within some boundaries will allow some tasks to be completed quickly and with acceptable reliability.  Robocop is not likely for a while.

One interesting question is why the polarization. First, it is easy. And, second, one can cash in. If one is a cheerleader, one can invest in a promising AI start and make (in theory) oodles of money. By being a contrarian, one can tap into the segment of people who think the sky is falling. Being a contrarian is “different.” Plus, by predicting implosion and the end of life one can get attention. That’s okay. I try to avoid being the eccentric carrying a sign.

The current AI bubble relies in a significant way on a Google recipe: Indexing text. The approach reflects Google’s baked in biases. It indexes the Web; therefore, it should be able to answer questions by plucking factoids. Sorry, that doesn’t work. Glue cheese to pizza? Sure.

Hopefully new lines of investigation may reveal different approaches. I am skeptical about synthetic (or made up data that is probably correct). My fear is that we will require another 10, 20, or 30 years of research to move beyond shuffling content blocks around. There has to be a higher level of abstraction operating. But machines are machines and wetware (human brains) are different.

Will life end? Probably but not because of AI unless someone turns over nuclear launches to “smart” software. In that case, the crazy eccentric could be on the beam.

Stephen E Arnold, June 11, 2024

Flawed AI Will Still Take Jobs

May 16, 2024

dinosaur30a_thumbThis essay is the work of a dinobaby. Unlike some folks, no smart software improved my native ineptness.

Shocker. Organizations are using smart software which is [a] operating in an way its creators cannot explain, [b] makes up information, and [c] appears to be dominated by a handful of “above the law” outfits. Does this characterization seem unfair? No, well, stop reading. If it seems anchored in reality, you may find my comments about jobs for GenX, GenY or GenWhy?, millennials, and Alphas (I think this is what marketers call wee lads and lasses) somewhat in line with the IMF’s view of AI.

image

The answer is, “Your daughter should be very, very intelligent and very, very good at an in-demand skill. If she is not, then it is doom scrolling for sure. Thanks, MSFT Copilot. Do your part for the good of mankind today.

Artificial Intelligence Hitting Labour Forces Like a Tsunami – IMF Chief” screws up the metaphor. A tsunami builds, travels, dissipates. I am not sure what the headline writer thinks will dissipate in AI land. Jobs for sure. But AI seems to have some sticking power.

What does the IMF say? Here’s a bit of insight:

Artificial intelligence is likely to impact 60% of jobs in advanced economies and 40% of jobs around the world in the next two years…

So what? The IMF Big Dog adds:

“It could bring tremendous increase in productivity if we manage it well, but it can also lead to more misinformation and, of course, more inequality in our society.”

Could. I think it will but for those who know their way around AI and are in the tippy top of smart people. ATM users, TikTok consumers, and those who think school is stupid may not emerge as winners.

I find it interesting to consider what a two-tier society in the US and Western Europe will manifest. What will the people who do not have jobs do? Volunteer to work at the local animal shelter, pick up trash, or just kick back. Yeah, that’s fun.

What if one looks back over the last 50 years? When I grew up, my father had a job. My mother worked at home. I went to school. The text books were passed along year to year. The teachers grouped students by ability and segregated some students into an “advanced” track. My free time was spent outside “playing” or inside reading. When I was 15, I worked as a car hop. No mobile phones. No computer. Just radio, a record player, and a crappy black-and-white television which displayed fuzzy programs. The neighbors knew me and the other “kids.” From my eighth grade class, everyone went to college after high school. In my high school class of 1962, everyone was thinking about an advanced degree. Social was something a church sponsored. Its main feature was ice cream. After getting an advanced degree in 1965 I believe, I got a job because someone heard me give a talk about indexing Latin sermons and said, “We need you.” Easy.

A half century later, what is the landscape. AI is eliminating jobs. Many of these will be either intermediating jobs like doing email spam for a PR firm’s client or doing legal research. In the future, knowledge work will move up the Great Chain of Being. Most won’t be able to do the climbing to make it up to a rung with decent pay, some reasonable challenges, and a bit of power.

Let’s go back to the somewhat off-the-mark tsunami metaphor. AI is going to become more reliable. The improvements will continue. Think about what an IBM PC looked like in the 1980s. Now think about the MacBook Air you or your colleague has. They are similar but not equivalent. What happens when AI systems and methods keep improving? That’s tough to predict. What’s obvious is that the improvements and innovations in smart software are not a tsunami.

I liken it more like the continuous pressure in a petroleum cracking facility. Work is placed in contact with smart software, and stuff vaporizes. The first component to be consumed are human jobs. Next, the smart software will transform “work” itself. Most work is busy work; smart software wants “real” work. As long as the electricity stays on, the impact of AI will be on-going. AI will transform. A tsunami crashes, makes a mess, and then is entropified. AI is a different and much hardier development.

The IMF is on the right track; it is just not making clear how much change is now underway.

Stephen E Arnold, May 16, 2024

Google Search Is Broken

May 10, 2024

ChatGPT and other generative AI engines have screwed up search engines, including the all-powerful Google. The Blaze article, “Why Google Search Is Broken” explains why Internet search is broke, and the causes. The Internet is full of information and the best way to get noticed in search results is using SEO. A black hat technique (it will probably be considered old school in the near future) to manipulate search results is to litter a post with keywords aka “keyword stuffing.”

ChatGPT users realized that it’s a fantastic tool for SEO, because they tell the AI algorithm to draft a post with a specific keyword and it generates a decent one. Google’s search algorithm then reads that post and pushes it to the top of search results. ChatGPT was designed to read and learn language the same way as Google: skin the Internet, scoop up information from Web sites, and then use it to teach the algorithm. This threatens Google’s search profit margins and Alphabet Inc. doesn’t like that:

“By and large, people don’t want to read AI-generated content, no matter how accurate it is. But the trouble for Google is that it can’t reliably detect and filter AI-generated content. I’ve used several AI detection apps, and they are 50% accurate at best. Google’s brain trust can probably do a much better job, but even then, it’s computationally expensive, and even the mighty Google can’t analyze every single page on the web, so the company must find workarounds.

This past fall, Google rolled out its Helpful Content Update, in which Google started to strongly emphasize sites based on user-generated content in search results, such as forums. The site that received the most notable boost in search rankings was Reddit. Meanwhile, many independent bloggers saw their traffic crash, whether or not they used AI.”

Google wants to save money by offloading AI detection/monitoring to forum moderators that usually aren’t paid. Unfortunately SEO experts figured out Google’s new trick and are now spamming user-content driven Websites. Google recently signed a deal with Reddit to acquire its user data to train its AI project, Gemini.

Google hates AI generated SEO and people who game its search algorithms. Google doesn’t have the resources to detect all the SEO experts, but went they are found Google extracts vengeance with deindexing and making better tools. Google released a new update to its spam policies to remove low-quality, unoriginal content made to abuse its search algorithm. The overall goal is to remove AI-generated sites from search results.

If you read between the lines, Google doesn’t want to lose more revenue and is calling out bad actors.

Whitney Grace, May 10, 2024

Which Came First? Cliffs Notes or Info Short Cuts

May 8, 2024

dinosaur30a_thumbThis essay is the work of a dinobaby. Unlike some folks, no smart software improved my native ineptness.

The first online index I learned about was the Stanford Research Institute’s Online System. I think I was a sophomore in college working on a project for Dr. William Gillis. He wanted me to figure out how to index poems for a grant he had. The SRI system opened my eyes to what online indexes could do.

Later I learned that SRI was taking ideas from people like Valerius Maximus (30 CE) and letting a big, expensive, mostly hot group of machines do what a scribe would do in a room filled with rolled up papyri. My hunch is that other workers in similar “documents” figures out that some type of labeling and grouping system made sense. Sure, anyone could grab a roll, untie the string keeping it together, and check out its contents. “Hey,” someone said, “Put a label on it and make a list of the labels. Alphabetize the list while you are at it.”

image

An old-fashioned teacher struggles to get students to produce acceptable work. She cannot write TL;DR. The parents will find their scrolling adepts above such criticism. Thanks, MSFT Copilot. How’s the security work coming?

I thought about the common sense approach to keeping track of and finding information when I read “The Defensive Arrogance of TL;DR.” The essay or probably more accurately the polemic calls attention to the précis, abstract, or summary often included with a long online essay. The inclusion of what is now dubbed TL;DR is presented as meaning, “I did not read this long document. I think it is about this subject.”

On one hand, I agree with this statement:

We’re at a rolling boil, and there’s a lot of pressure to turn our work and the work we consume to steam. The steam analogy is worthwhile: a thirsty person can’t subsist on steam. And while there’s a lot of it, you’re unlikely to collect enough as a creator to produce much value.

The idea is that content is often hot air. The essay includes a chart called “The Rise of Dopamine Culture, created by Ted Gioia. Notice that the world of Valerius Maximus is not in the chart. The graphic begins with “slow traditional culture” and zips forward to the razz-ma-tazz datasphere in which we try to survive.

I would suggest that the march from bits of grass, animal skins, clay tablets, and pieces of tree bark to such examples of “slow traditional culture” like film and TV, albums, and newspapers ignores the following:

  1. Indexing and summarizing remained unchanged for centuries until the SRI demonstration
  2. In the last 61 years, manual access to content has been pushed aside by machine-centric methods
  3. Human inputs are less useful

As a result, the TL;DR tells us a number of important things:

  1. The person using the tag and the “bullets” referenced in the essay reveal that the perceived quality of the document is low or poor. I think of this TL;DR as a reverse Good Housekeeping Seal of Approval. We have a user assigned “Seal of Disapproval.” That’s useful.
  2. The tag makes it possible to either NOT out the content with a TL;DR tag or group documents by the author so tagged for review. It is possible an error has been  made or the document is an aberration which provides useful information about the author.
  3. The person using the tag TL;DR creates a set of content which can be either processed by smart software or a human to learn about the tagger. An index term is a useful data point when creating a profile.

I think the speed with which electronic content has ripped through culture has caused a number of jarring effects. I won’t go into them in this brief post. Part of the “information problem” is that the old-fashioned processes of finding, reading, and writing about something took a long time. Now Amazon presents machine-generated books whipped up in a day or two, maybe less.

TL;DR may have more utility in today’s digital environment.

Stephen E Arnold, May 8, 2024

Kagi Search Beat Down

April 17, 2024

green-dino_thumb_thumb_thumbThis essay is the work of a dumb dinobaby. No smart software required.

People surprise me. It is difficult to craft a search engine. Sure, a recent compsci graduate will tell you, “Piece of cake.” It is not. Even with oodles of open source technology, easily gettable content, and a few valiant individuals who actually want relevant results — search and retrieval are tough to get right. The secret to good search, in my opinion, is to define a domain, preferably a technical field, identify the relevant content, obtain rights, if necessary, and then do the indexing and the other “stuff.”

In my experience, it is a good idea to have either a friend with deep pockets, a US government grant (hello, NSF, said Google decades ago), or a credit card with a hefty credit line. Failing these generally acceptable solutions, one can venture into the land of other people’s money. When that runs out or just does not work, one can become a pay-to-play outfit. We know what that business model delivers. But for a tiny percentage of online users, a subscription service makes perfect sense. The only problem is that selling subscriptions is expensive, and there is the problem of churn. Lose a customer and spend quite a bit of money replacing that individual. Lose big customers spend oodles and oodles of money replacing that big spender.

I read “Do Not Use Kagi.” This, in turn, directed me to “Why I Lost Faith in Kagi.” Okay, what’s up with the Kagi booing? The “Lost Faith” article runs about 4,000 words. The key passage for me is:

Between the absolute blasé attitude towards privacy, the 100% dedication to AI being the future of search, and the completely misguided use of the company’s limited funds, I honestly can’t see Kagi as something I could ever recommend to people.

I looked at Kagi when it first became available, and I wrote a short email to the “Vlad” persona. I am not sure if I followed up. I was curious about how the blend of artificial intelligence and metasearch was going to deal with such issues as:

  1. Deduplication of results
  2. Latency when a complex query in a metasearch system has to wait for a module to do it thing
  3. How the business model was going to work: Expensive subscription, venture funding, collateral sales of the interface to law enforcement, advertising, etc..
  4. Controlling the cost of the pings, pipes, and power for the plumbing
  5. Spam control.

I know from experience that those dabbling in the search game ignore some of my routine questions. The reasons range from “we are smarter than you” to “our approach just handles these issues.”

image

Thanks, MSFT Copilot. Recognize anyone in the image you created?

I still struggle with the business model of non-ad supported search and retrieval systems. Subscriptions work. Well, they worked out of the gate for ChatGPT, but how many smart search systems do I want to join? Answer: Zero.

Metasearch systems are simply sucker fish on the shark bodies of a Web search operator. Bing is in the metasearch game because it is a fraction of the Googzilla operation. It is doing what it can to boost its user base. Just look at the wonky Edge ads and the rumored miniscule gain the additional of smart search has delivered to Bing traffic. Poor Yandex is relocating and finds itself in a different world from the cheerful environment of Russia.

Web content indexing is expensive, difficult, and tricky.

But why pick on Kagi? Beats me. Why not write about dogpile.com, ask.com, the duck thing, or startpage.com (formerly ixquick.com)? Each embodies a certain subsonic vibe, right?

Maybe it is the AI flavor of Kagi? Maybe it is the amateur hour approach taken with some functions? Maybe it is just a disconnect between an informed user and an entrepreneurial outfit running a mile a minute with a sign that says, “Subscribe”?

I don’t know, but it is interesting when Web search is essentially a massive disappointment that some bright GenX’er has not figured out a solution.

Stephen E Arnold, April 17, 2024

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta