An Oath from the Past: Yahoo Web Scale Semantic Search

October 9, 2020

I spotted a link to “Yahoo: Web Scale Semantic Search.” You remember Yahoo, don’t you. This is the outfit with the data breaches, the clueless business model, and the sale to the Baby Bell Verizon. The executives too are memorable: Marissa, Alex, Terry, and the Peanut Butter memo man.

The link displayed a presentation by Edgar Meij, a laborer in Yahoo Labs. The topic was an X ray view from Mt. Olympus intended to reveal Web scale semantic search.

The slide deck requires 62 clicks to traverse. There are many riches in the presentation. I want to highlight three of these, and invite you to make your own determination of these insights.

First, there is a “text” accompanying the deck. It contains a riot of jargon and buzzwords. In fact, I have saved the text, despite a portion being truncated, as a glossary of Web search jive talk; for example “s a sequence of terms s 2 s drawn from the set S, s ? Multinomial(?s) e a set of entities e 2 e.” (I knew you would experience the same thrill I did when I read this line.) True to Slideshare’s attention to detail, the text for slides 32 to 62 has been removed. Great loss indeed.

Second, Yahoo cares about knowledge. Consider this diagram:

image

The idea is that one acquires knowledge (I assume this means scraping and indexing Web site content), knowledge integration (creating a big index), and knowledge consumption (maybe finding something when a user or system sends a query to the search subsystem). The key point is “knowledge” is important. How about that? Yahoo search was focusing on knowledge? Is that why Yahoo floundered in search for many, many years before bowing to failure?

Third, Yahoo’s approach to semantic search requires humans. Here’s proof:

image

When Yahoo announced Vin Diesel was dead, he was alive. So much for smart software.

Why am I mentioning this blast from the past.

Knowledge was talked about in my interview/discussion with Dr. Stavros Macrakis. We tackled the difference between Web search and enterprise search. This Yahoo deck illustrates that talk about knowledge is one thing. Delivering useful results to a user is quite another.

Jargon in search and retrieval has made more progress than search technology itself. That’s why the Yahoo deck could have been crafted yesterday by one of the search vendors still chasing a huge market in the era of Lucene/Solr and “good enough” information access.

Stephen E Arnold, October 9, 2020

Technical Debt: Making Something Ignored Understandable to Suits

September 29, 2020

I have been fortunate to have been on the edges of a several start ups; for example, The Point (Top 5% of the Internet), a system ultimately sold to CMGI / Lycos in the late 1990s. When the small team began work on the product, we used available servers, available software, and methods based on our prior experience. When we started work on The Point in 1994, the task seemed pretty simple: Use what we know and provide an index to curated Web sites. At that time, it was possible to scan a list of new Web sites select ones which seemed promising. This was at first a manual process, but the handful of people working on this figured out ways to reduce the drudgery.

I learned (as one of the resources for hardware, software, and money) that those early decisions were both similar to established business economics and quite different in others. Let me give one brief example and then address the information in “Most Technical Debt Is Just Bullsh*t.”

For The Point one of the wizards on my team used Paradox. I know the Georgia Tech grade and Westinghouse Science winner asked me and I just grunted. Who cared? This was a mere test of an idea, not a project for an outfit like Thomson Corporation or the US government. My partner and I had worked on a CD of bird songs, and The Point seemed similar to that effort. Who knew in 1994? I sure did not.

That Paradox decision created technical debt. The database was okay, but it was not designed for multiple humans and software systems to update the files on a continuous basis. We could not do real time because the cheapo Sparc server I had was designed to run an indexing system called STAR. We figured out how to make Paradox work, but those early decisions had lasting impact.

I realized that making a database decision was similar to Henry Ford’s River Rouge. That concrete and building built at one end of the giant complex was not going anywhere. First, Mr. Ford was busy making cars. Second, Mr. Ford needed the resources to be directed at making more cars. Third, Mr. Ford had to make decisions about now problems, not problems that were not fully understood at the time of making fresh decisions. As a result, River Rouge became a giant thing that was mostly unchanged and unchangeable. The same observation can be made about Google-type companies. (Think of new features as software wrappers, not changes to the plumbing.)

That’s technical debt. The focus, resources, and understanding to change what has been put in place and actually working is not a hot topic for a robust discussion of “Let’s do this over again.” Nope.

The Louwrentius article dances around this reality in my opinion; for example, recasting Ward Cunningham, who coined the bound phrase, the write up states: Technical debt exists:

as a form of prototyping. To try out and test design/architecture to see if it fits the problem space at hand. But it also incorporates the willingness to spend extra time in the future to change the code to better reflect the current understanding of the problem at hand.

The write up ends with this statement:

Although Cunningham meant well, I think the metaphor of technical debt started to take on a life of its own. To a point where code that doesn’t conform to some Platonic ideal is called technical debt. Every mistake, every changing requirement, every tradeoff that becomes a bottleneck within the development process is labeled ‘technical debt’. I don’t think that this is constructive. I think my friend was right: the concept of technical debt has become bullshit. It doesn’t convey any better insight or meaning. On the contrary, it seems to obfuscate the true cause of a bottleneck. At this point, when people talk about technical debt, I would be very skeptical and would want more details. Technical debt doesn’t actually explain why we are where we are. It has become a hollow, hand-wavy ‘explanation’. With all due respect to Cunningham, because the concept is so widely misunderstood and abused, it may be better to retire it.

My personal view is:

  1. Technical debt is a bad way to say, “A software product or service is like a building that can either be a money loser or be torn down.” As long as it works — generates revenue — do as little as possible to keep the revenue flowing.
  2. Technical debt is fungible. It is like the poorly designed intake infrastructure for River Rouge. The bricks and concrete are not going away without significant investment and disruption.
  3. Technical debt is poorly understood. Humans are not very good at not knowing what one does not know. I suppose that Paradox-like. Who knew?

The good news is that CMGI’s check cleared the bank and The Point is now mostly forgotten like its technical debt. Who paid it off? I didn’t.

Stephen E Arnold, September 29, 2020

A Push for ISYS Search. Sorry, Lexmark. Oh, Right, Hyland

September 9, 2020

Those 1980s and enterprise search were a combo. Ian Davies’ search and retrieval system was very good. In fact, a long time ago I visited the old Crow’s Nest offices and sold a small job. After all, how many people from rural Kentucky end up in Sydney wanting to talk about search? Answer: Not too many. I wrangled an invitation because I complained about how the system displayed PDF files in a results list.

Flash forward and ISYS Search moved some operations to the US. Eventually the excitement waned and ISYS Search became a property of Lexmark. Lexington, Kentucky, had spawned a weird enterprise wide content management system which fetched a pretty price, and I assumed that Lexmark wanted its own content-centric technology. The wheels of time turned like a grindstone and Lexmark was caught between the business ends of the grist wheel. ISYS Search was now getting long in the tooth, and the company was sold to Hyland, also in the content management business. At this time, Lexmark had looped the loop from IBM to Chinese ownership.

Hence I was surprised to read “Why a Government Agency Needs Enterprise Search in the Modern World.” This was a message ISYS in the late 1980s was emitting as it marketed its system to law enforcement agencies with reasonable success. The write up by the Hyland’s Australia manager states:

Enterprise Search is becoming an essential ‘uber tool’ for content organization and discovery. More than just adding yet another layer of applications to the department’s arsenal of tools, enterprise search allows for the organized creation, indexing and retrieval of data – both structured and unstructured – through one simple interface.

In an interview with me in 2008, Ian Davies said:

What defined search back then was the significance of the need — users were after information that truly was mission critical.  Now , juxtapose that with today, where search has expanded to address usability and the need to leverage corporate knowledge. What we have is a keen demand for mission critical search and retrieval, content processing, and analysis. In addition, there are large numbers of organizations that are trying to make the best use of the information in digital form. Mission-critical search manipulates information to identify a criminal, which may be a matter of public safety, or extract the key fact from information related to a legal matter. Essential search helps employees find the answer or the information needed to do their work today. Both drive the growth of ISYS. I don’t see either need diminishing going forward.

Similar? Yep.

Observations:

  • Enterprise search is a challenge and shall remain so
  • The lingo used to explain enterprise search is almost timeless
  • The technology and its “promises” have persisted at ISYS for more than two decades.

Why hasn’t ISYS generated greater traction? Why has the core plumbing remained the same for decades? Those are important questions because they reveal much about the enterprise search sector which seems like an easy way to generate oodles of cash.

One issue is that enterprise search, like most policeware and intelware systems as well, is that the market sector is a very difficult one. One of the most popular enterprise search systems is, for instance, open source and free of license fees. That’s new. The sales pitch and arguments for paying for search are not.

Stephen E Arnold, September 9, 2020

SlideShare: Some Work to Do

August 12, 2020

DarkCyber noted “Scribd Acquires Presentation Sharing Service SlideShare from LinkedIn.” In 2004, one could locate presentations on Google by searching for the extension ppt and its variants. In 2006, SlideShare became available. Then something happened. PowerPoints became more difficult to locate. When an online search pointed to a PowerPoint deck, the content was:

  1. Marketing fluff
  2. Incorrectly rendered with weird typography and wonky graphics
  3. Corrupted files.

What about today? DarkCyber’s most recent foray into the slide deck content wilderness produced zero; for example, SlideShare search produced identical pages of search results. The query retrieved slide decks on unrelated topics. Even worse, a query would result in SlideShare’s sending email upon email pointing to other slide decks. The one characteristic of these related slide deck was/is that they were unrelated to the information we sought.

There are online presentation services. There are open source presentation tools like SoftMaker’s. There is the venerable Keynote which never quite converts a PowerPoint file correctly.

Is there a future in a searchable collection of slide decks? In theory, yes. In reality, the cost of finding, indexing, and making searchable presentations faces some big hurdles; for example:

  1. Many organizations — for example, DARPA — standardize on PDF file formats. These are okay, but indexing these can be an interesting challenge
  2. Some presenters put their talks in the cloud, hoping that an Internet connection will allow their slides to display
  3. The Zoom world puts PowerPoints and other presentation materials on the user’s computer, never to make it into a more publicly accessible repository.

Like the dream of collecting conferences, presentations, and poster sessions, some content remains beyond the reach of researchers and analysts. The desire to get anyone looking for a slide deck to subscribe to a service gives operators of this service a chance to engage in spreadsheet fever. Here’s how this works? If there are X researchers, and we get Y percent of them. We can charge Z per year? By substituting guesstimates for the variables, the service becomes a winner.

The reality is that finding information in slide decks is more difficult today than it was in 2004. Access to information is becoming more difficult. DarkCyber would like to experience a SlideShare with useful content, more effective search and retrieval, and far less one page duplicates of ads for books.

Someday. Maybe?

Stephen E Arnold, August 12, 2020

NetDocuments Employs BA Insight Tech for Enterprise Search

August 10, 2020

For a secure, cloud-based data solution, many law firms, legal departments, and compliance teams turn to NetDocuments. Now the platform has adopted technology from a familiar name to simplify its clients’ access to information. A post at PRWeb reveals, “NetDocuments Introduces NetKnowledge Enterprise Search Powered by BA Insight.” We find it interesting that the 16-year-old BA Insight is licensing its askable-knowledge system to create the new tool, NetKnowledge. The press release describes the system’s advantages:

“Eliminate Downloading and Indexing Data for Search: No longer does content within NetDocuments need to be downloaded and indexed to be part of an organization’s enterprise search. Simply search within the NetDocuments platform, and NetKnowledge will find relevant data–along with information from other sources —and present it to users.

“Enforce Access Controls on Sensitive Information: Sensitive information may need to be restricted to certain individuals, but that data also needs to be available to others via enterprise search. NetKnowledge respects data restriction policies at the source and will only present data to individuals with proper access rights.

“Manage Large and Disparate Data Sets Across the Organization: NetKnowledge helps organizations bring all its data together to form a single source of truth, so users do not have to perform multiple searches in different places to get the information they need.”

Founded in 2004, BA Insight is based in Boston, Massachusetts. The company is dedicated to making information easier to find for organizations of all stripes. NetDocuments is headquartered in Lehi, Utah. The company was founded in 1999 and acquired by Clearlake Capital Group in 2017.

Cynthia Murrell, August 10, 2020

Twitter: Another Almost Adult Moment

August 7, 2020

Indexing is useful. Twitter seems to be recognizing this fact. “Twitter to Label State-Controlled News Accounts” reports:

The company will also label the accounts of government-linked media, as well as “key government officials” from China, France, Russia, the UK and US. Russia’s RT and China’s Xinhua News will both be affected by the change. Twitter said it was acting to provide people with more context about what they see on the social network.

Long overdue, the idea of an explicit index term may allow some tweeters to get some help when trying to figure out where certain stories originate.

Twitter, a particularly corrosive social media system, has avoided adult actions. The firm’s security was characterized in a recent DarkCyber video as a clown car operation. No words were needed. The video showed a clown car.

Several questions from the DarkCyber team:

  1. When will Twitter verify user identities, thus eliminating sock puppet accounts? Developers of freeware manage this type of registration and verification process, not perfectly but certainly better than some other organizations’.
  2. When will Twitter recognize that a tiny percentage of its tweeters account for the majority of the messages and implement a Twitch-like system to generate revenue from these individuals? Pay-per-use can be implemented in many ways, so can begging for dollars. Either way, Twitter gets an identification point which may have other functions.
  3. When will Twitter innovate? The service is valuable because a user or sock puppet can automate content regardless of its accuracy. Twitter has been the same for a number of Internet years. Dogs do age.

Is Twitter, for whatever reason, stuck in the management mentality of a high school science club which attracts good students, just not the whiz kids who are starting companies and working for Google type outfits from their parents’ living room?

Stephen E Arnold, August 7, 2020

The Gray Lady: A New Approach to Real News

July 31, 2020

DarkCyber wants to capture a couple of quotes from “Newsonomics: The New York Times’ New CEO, Meredith Levien, on Building a World-Class Digital Media Business — and a Tech Company.”

The first thing DarkCyber noted is that the NYT will pivot from “real” news to a different business: Technology. Publishing companies have a long track record of innovation in technology. The pivot, therefore, is going to be a continuation of this success trend line, right?

We noted this statement:

The publisher [40-year-old A.G. Sulzberger] is a decade younger than me. The thing that I’ve always said about him, which I think is true about both of us, is that we’re both wired as old souls. Most of what we’re both trying to do is to think what this is going to feel like three years from now, five years from now. And I think he thinks, as the whole family thinks, what’s this going to mean 10 years from now, 20 years from now? We might not have the years, but we’re certainly pushing ourselves to have that mindset. It’s been my experience that everybody in the Sulzberger family has that mindset.

Remarkable a techno-news outfit thinking in terms of decades. How long is that in Wall Street time? How long in Internet time?

And a final quote:

Engineering now is the second largest functional area at the New York Times, only behind journalism, and the largest function by far on the business side.

How will the NYT deal with technical debt? Will the NYT emulate Amazon, Apple, Facebook, and Google?

And what about objectivity?

Technology is objective. It is the use of technology which has political, social, and economic consequences. And what about the two decade view?

Wall Street and TikTok types have a somewhat more truncated view of “time” as well. The NYT’s digital history seems to be forgotten. The LexisNexis “exclusive,” the Jeff Pemberton Times Online thing, the indexing operation in New Jersey, etc. etc.

Today’s revolution has taken about 50 years to arrive. The result? A newspaper company becoming a technology company. And technical debt? Right.

Stephen E Arnold, July 30, 2020

Digital Shadows: Cyber Monitoring Inside

July 29, 2020

DarkCyber has pointed out in this blog and in the DarkCyber video news programs that cyber security generates hyperbole. Funding sources pump in cash. Companies buy not one cyber security system; big and mid-sized outfits buy news ones with each change of security professionals.

Why?

Most of the cyber security systems focus on what happened in the past. However, bad actors — some well funded by low profile operators — focus on the here and now.

Not surprisingly, competing claims, pricing plays, and fearful prospects keep the wheel spinning.

Digital Shadows Announces Integration with Atlassian Jira” indicates that the stealthy Digital Shadows has moved inside an issue tracking and project tracking platform. Presumably the “SearchLight Inside” deal will deliver better security to Jira users. Will this tie up boost Atlassian stock?

DarkCyber assumes that other Dark Web and cyber threat indexing services will pursue similar “inside” deals.

The real test comes when licensees of these “inside” cyber threat solutions demonstrate they can avoid Garmin- and Twitter-type security breaches.

Stephen E Arnold, July 29, 2020

The Curious Case of a SEO Expert Who Sees a Link Between Dining and DarkCyber

July 29, 2020

This is another “SEO Follies” write up by the DarkCyber research team. The essay falls into three parts: First, an explanation of why irrelevant backlinks are the rage among search engine optimization experts; second, how language becomes an irritant and a reflection on the search engine optimization company’s business methods; and, third, some reflections on the stupidity of some SEO or search engine optimization sales methods. I want to point out that SEO professionals mostly bilk unsuspecting customers by promising them that their Web page will be more findable in Google. If a company wants traffic, the company will either have to buy ads from Google or remain deep in a search results list.

The Quest for Backlinks

In my first Google monograph “The Google Legacy”, my research team and I compiled a list of publicly disclosed ranking factors. The list has been used by some universities in their information science courses (example: Syracuse University). The majority of these factors are recycled ideas from other search systems, research conducted by IBM Almaden (example: the CLEVER system), and common sense (example: the more links pointing to a Web page, the “better” that Web page is if one uses de Tocqueville’s concept of average as a way to determine what’s “good”).

The current rage for backlinks is little more than an effort to generate a false “good” score for a Web page. The present technique — practiced by charlatans like the Hustler who makes crazy videos and companies like the once prestigious Boston Consulting Group. Since The Google Legacy, two changes have taken place at Google. First, the company has expanded its grip on online advertising despite the best efforts of Amazon and Facebook. Second, the options for getting independent, objective search results from Google have decreased. The reality is that a business either buys traffic or pays a charlatan pitching search engine magic. Either way, a business has to pay for traffic. There are exceptions, but these are forced upon Google due to exogenous circumstances and most organizations cannot rely on an anomaly to publicize their existence.

The trend I have noticed is that requests for backlinks are coming more and more frequently. Here’s an example I received from a company in the UK authored by an SEO marketer delightfully named Izaak Crook. He wrote:

HI Sa,

How’s it going? I’m Izaak from AppInstitute.

I was browsing arnoldit.com and I noticed you’d covered Restaurant Technology before, linking to https://restauranttechnologynews.com/2019/07/online-food-delivery-fraud-increasing-can-tech-address-problem from http://arnoldit.com/wordpress/category/statistics/

I wondered if you’d be interested in checking out our post “7 Restaurant Technology Trends to Watch Out for in 2020”. We take a look at some of the key restaurant technologies to watch out for and how they’re going to change the industry.

If you deem it worthy of a link from arnoldit.com that would be a dream come true.

Either way, it’d be cool to discuss how we can collaborate in the future. Enjoy your Friday and speak soon!

Kind Regards,

Izaak

I had my team poke around and we learned that Mr. Crook (love that name, right?) works at a firm named AppInstitute. According the company’s Web site, the group develops “apps.” Why is an app development company wanting me to link to a story about restaurant fraud. Sure, DarkCyber covers cybercrime, but odd ball references just underline my point about SEO silliness and the belief among SEO experts that backlinks will get significant traction with the new, revenue hungry Google. (Sure, Google generates a great deal of money, but the company is smart enough to realize that the unregulated, anything world of the pre-Trump era is ending. Plus, Google costs are getting very difficult to control. Then Google has to consider the Amazon and Facebook advertising competitors. These companies are not Excite and Lycos.

The Language of the SEO World

In an email exchange with Mr. Crook (wonderful, evocative name, is it not?) he used this phrase:

Okay, Boomer.

I am a septuagenarian, 76 soon to be 77. I had to contact two members of my DarkCyber research team to get a read on the phrase “Okay, Boomer.” I was aware that a baby boomer described people born after World War II. From my team, I learned that it is:

  • An age-biased slur when used to indicate that a person of age is out of touch with someone who is a thumbtyper, TikToker, and Facebook champion
  • A derogatory term for a person who is older than a Gen X or Millennial
  • An indicator that the person called a “boomer” is stupid, out of touch, irrelevant, a nuiscance, etc.

The team told me that if I were called a Boomer in a public setting, I could contact one of my attorneys and pursue the hate speech angle. Hate speech. Directed at a person soon to be 77. Over an overly familiar email asking for something for free.

The AppInstitute email is representative of the SEO junk I receive on a daily basis. I did not like the tone of the email and I was not happy to learn that boomer was a slur.

First, the familiarity of the “Hi” and the use of two of my initials indicates a certain casual mindset, a thought process incapable of understanding how familiarity is interpreted by someone like me as either careless or stupid. Call me old fashioned, but “Mr. Arnold” is what I prefer in business email.

Second, the reference to a Beyond Search/DarkCyber story about restaurants is amusing. I don’t write about restaurants; I eat at restaurants. Anyone who has looked at any of the more than 13,000 articles in this blog can figure out that feeding people is not one of my primary, secondary, or tertiary interests. I am not going to “check out” a frothy, probably substandard report about the restaurant industry. Apparently Izaak has not seen Yelp’s report about the state of the restaurant business. Here’s a story called “Nearly 16,000 Restaurants Have Closed Permanently Due to the Pandemic, Yelp Data Shows.” Read it, Izaak Crook. Yelp’s information obviates the need for a “report”.

Third, note the word choice. Izaak had a thesaurus handy I would wager or his prestigious employer provided him with a spam script and ready-to-roll bot:

One syllable fancy word: deem

Two syllable fancy word: worthy

Four syllable jargony word: collaborate

Then there were colloquial phrases like:

checking out

watch out for

going to change

dream come true

cool.

Izaak adds this thoughtful postscript: “Don’t want emails from us anymore? Reply to this email with the word “UNSUBSCRIBE” in the subject line.”

The entire email screams spam, carelessness, failure to know to whom one is writing, and arrogance. Am I going to do something for an unknown entity named Izaak Crook without sending a bill? Answer: Not a chance.

My research team told me that the Izaak Crook entity is a person. He is the head of marketing and he is a T shaped marketer, a growth hacker, and an “SaaS fanatic.” He’s spent 14 months performing search engine optimization and conversion rate optimization for the first class outfit AppInstitute. He was a digital communications apprentice for a company called Champions UK plc. Before that he was a social media manager. I love the “apprentice” role as part of an SEO’s work history.

Now Mr. Crook (an evocative name, is it not?) markets the App Institute. A quick reveals that this top drawer outfit is an “AppBuilder for busy small business owners.” The CEO is Ian Naylor who is a serial entrepreneur. I have been told my one of my researchers that the company is small and seems to do many things, not just apps. SEO is one of those many things.

Stepping Back

I have written a number of blog posts, articles, and essays about the loss of relevance in ad-supported Web search systems. The erosion of relevance, to summarize, my conclusions is the result of three factors:

  1. A need to generate revenue in order to pay for indexing, updating, and serving answers to users’ queries.
  2. A desire on the part of marketers and webmasters to get coverage in search result pages without having to pay Google for traffic
  3. The more recent imperative of the ad-supported Web search engines to extend their control over flows of user behavior data.

In this environment, clicks, psychological tells via clicks, and surveillance technology mean that comprehensive data collection are essential. Traffic results from feedback loops and intentional presentation of certain content. Free visibility is not part of the game plan.

SEO marketing is going to fail. Some tactics may spoof Mother Google and deliver a short term boost. However, Mother Google wants SEO to fail because it forces the SEO wizard to herd those desperate for traffic into the advertising kill pen.

Who knows this simple game plan? Maybe the SEO expert who also moonlights as an ad sales rep for Google does? I surmise that Google continues to cultivate SEO professionals as part of the company’s ad sales strategy.

Why write me? Certainly Mr. Crook cares not a whit for my blog and content. He wants a link. He wants to make a sale. He wants to get the client to buy more and more SEO and then AppInstitute will probably sell that customer Google advertising and get a commission.

No thanks, Mr. Crook. I want no part of your SEO scam. I don’t want to help out AppInstitute. However, I do hope that the upcoming Congressional hearings lead to meaningful regulation of certain large high technology firms.

But we live in Rona times, and I must admit, the odds of ethical, responsible behavior are long.

And the links Mr. Crook (tasty, evocative name) wants? Here they are:

Izaak Crook: izaak@appinstitute.co

AppInstitute: https://appinstitute.com/

It seems to some of the DarkCyber team that “Boomer” is hate speech.

Slick marketing method indeed.

Stephen E Arnold aka “Boomer”, July 29, 2020

Close Enough for Horse Shoes? Why Drifting Off Course Has Become a Standard Operating Procedure

July 14, 2020

One of the DarkCyber research team sent me a link to a post on Hacker News: “How Can I Quickly Trim My AWS Bill?” In the write up were some suggestions from a range of people, mostly anonymous. One suggestion caught my researcher’s attention and I too found it suggestive.

Here’s the statement the DarkCyber team member flagged for me:

If instead this is all about training / the volume of your input data: sample it, change your batch sizes, just don’t re-train, whatever you’ve gotta do.

Some context. Certain cloud functions are more “expensive” than others. Tips range from dumping GPUs for CPUs to “Buy some hardware and host it at home/office/etc.”

I kept coming back to the suggestion “don’t retrain.”

One of the magical things about certain smart software is that the little code devils learn from what goes through the system. The training gets the little devils or daemons to some out of bed and in the smart software gym.

However, in many smart processes, the content objects processed include signals not in the original training set. Off the shelf training sets are vulnerable just like those cooked up by three people working from home with zero interest in validating the “training data” from the “real world data.”

What happens?

The indexing or metadata assignments “drift.” This means that the smart software devils index a content object in a way that is different from what that content object should be tagged.

Examples range from this person matches that person to we indexed the food truck as a vehicle used in a robbery. Other examples are even more colorful or tragic depending on what smart software output one examines. Detroit facial recognition ring a bell?

Who cares?

I care. The person directly affected by shoddy thinking about training and retraining smart software, however, does not.

That’s what is troubling about this suggestion. Care and thought are mandatory for initial model training. Then as the model operates, informed humans have to monitor the smart software devils and retrain the system when the indexing goes off track.

The big or maybe I should type BIG problem today is that very few individuals want to do this even it an enlightened superior says, “Do the retraining right.”

Ho ho ho.

The enlightened boss is not going to do much checking and the outputs of a smart system just keep getting farther off track.

In some contexts like Google advertising, getting rid of inventory is more important than digging into the characteristics of Oingo (later Applied Semantics) methods. Get rid of the inventory is job one.

For other model developers, shapers, and tweakers, the suggestion to skip retraining is “good enough.”

That’s the problem.

Good enough has become the way to refactor excellence into substandard work processes.

Stephen E Arnold, July 14, 2020

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta