Shaping Data Is Indeed a Thing and Necessary

April 12, 2021

I gave a lecture at Microsoft Research many years ago. I brought up the topic of Kolmogorov’s complexity idea and making fast and slow smart software sort of work. (Remember that Microsoft bought Fast Search & Transfer which danced around making automated indexing really super wonderful like herring worked over by a big time cook.) My recollection of the Microsoft group’s reaction was, “What is this person talking about?” There you go.

If you are curious about the link between a Russian math person once dumb enough to hire one of my relatives to do some grunt work, check out the 2019 essay “Are Deep Neural Networks Dramatically Overfitted?” Spoiler: You betcha.

The essay explains that mathy tests signal when a dataset is just right. No more nor no less data are needed. Thus, if the data are “just right,” the outputs will be on the money, accurate, and close enough for horse shoes.

The write up states:

The number of parameters is not correlated with model overfitting in the field of deep learning, suggesting that parameter counting cannot indicate the true complexity of deep neural networks.

Simplifying: “Oh, oh.”

Then there is a work around. The write up points out:

The lottery ticket hypothesis states that a randomly initialized, dense, feed-forward network contains a pool of subnetworks and among them only a subset are “winning tickets” which can achieve the optimal performance when trained in isolation. The idea is motivated by network pruning techniques — removing unnecessary weights (i.e. tiny weights that are almost negligible) without harming the model performance. Although the final network size can be reduced dramatically, it is hard to train such a pruned network architecture successfully from scratch.

Simplifying again: “Yep, close enough for most applications.”

What’s the fix? Keep the data small.

Doesn’t that create other issues? Sure does. For example, what about real time streaming data which diverge from the data used to train smart software. You know the “change” thing when historical data no longer apply. Smart software is possible as long as the aperture is small and the data shaped.

There you go. Outputs are good enough but may be “blind” in some ways.

Stephen E Arnold, April 12, 2021

An Exploration of Search Code

April 9, 2021

Software engineer Bard de Geode posts an exercise in search coding on his blog—“Building a Full-Text Search Engine in 150 Lines of Python Code.” He has pared down the thousands and thousands of lines of code found in proprietary search systems to the essentials. Of course, those platforms have many more bells and whistles, but this gives one an idea of the basic components. Navigate to the write-up for the technical details and code snippets that I do not pretend to follow completely. The headings de Geode walks us through include Data, Data preparation, Indexing, Analysis, Indexing the corpus, Searching, Relevancy, Term frequency, and Inverse document frequency. He concludes:

“You can find all the code on Github, and I’ve provided a utility function that will download the Wikipedia abstracts and build an index. Install the requirements, run it in your Python console of choice and have fun messing with the data structures and searching. Now, obviously this is a project to illustrate the concepts of search and how it can be so fast (even with ranking, I can search and rank 6.27m documents on my laptop with a ‘slow’ language like Python) and not production grade software. It runs entirely in memory on my laptop, whereas libraries like Lucene utilize hyper-efficient data structures and even optimize disk seeks, and software like Elasticsearch and Solr scale Lucene to hundreds if not thousands of machines. That doesn’t mean that we can’t think about fun expansions on this basic functionality though; for example, we assume that every field in the document has the same contribution to relevancy, whereas a query term match in the title should probably be weighted more strongly than a match in the description. Another fun project could be to expand the query parsing; there’s no reason why either all or just one term need to match.”

Fore more information, de Geode recommends curious readers navigate to MonkeyLearn’s post “What is TF-IDF?” and to an explanation of “Term Frequency and Weighting” posted by Stanford’s NLP Group. Happy coding.

Cynthia Murrell, April 9, 2021

Let Us Now Consider Wonky Data and Tagging

March 31, 2021

As you may know, I find MIT endlessly amusing. From the Jeffrey Epstein matter to smart people who moonlight for other interesting entities, the esteemed university does not disappoint. I noted an article about and MIT finding which is interesting. “MIT”s AI Dataset Study and Startling Findings” reports:

MIT Researchers analyzed 10 test sets from datasets, including ImageNet, and found over 2,900 errors in the ImageNet validation set alone. When used as a benchmark data set, the errors in the dataset were proved to have an incorrect position in correlation to direct observation or ground truth.

So what?

Garbage in, garbage out.

This is not a surprise and it certainly seems obvious. If anything, the researchers’ error rate seems low. There is no information about data pushed into the “exception” folder for indexing systems.

Stephen E Arnold, March 31, 2021

So You Wanna Be a Google?

March 31, 2021

Just a short item which may be of interest to Web indexing wannabes: Datashake has rolled out its Web Scraper API. You can read about how to:

Scrape the web with proxies, CAPTCHA solving, headless browsers and more to avoid being blocked.

You will have to sign up to get “early access” to the service. The service is not free … because scraping Web sites is neither easy nor inexpensive.

There’s not much info about this API as of March 23, 2021, but this type of service beats the pants off trying to cook up our content acquisition scripts in 1993 for the The Point (Top 5% of the Internet). You remember that, don’t you?

Of course, thumbtypers will say, “Okay, boomer, what’s up with that ancient history?”

Sigh.

Stephen E Arnold, March 31, 2021

High Tech Tension: Sparks Visible, Escalation Likely

March 25, 2021

I read Google’s “Our Ongoing Commitment to Supporting Journalism.” The write up is interesting because it seems to be a dig at a couple of other technology giants. The bone of contention is news, specifically, indexing and displaying it.

The write up begins with a remarkable statement:Google has always been committed to providing high-quality and relevant information, and to supporting the news publishers who help create it.
This is a sentence pregnant with baby Googzillas. Note the word “always.” I am not certain that Google is in the “always” business nor am I sure that the company had much commitment. As I recall, when Google News went live, it created some modest conversation. Then Google News was fenced out of the nuclear ad machinery. Over time, Google negotiated and kept on doing what feisty, mom and pop Silicon Valley companies do; namely, keep doing what they want and then ask for forgiveness.

Flash forward to Australia. That country wanted to get money in exchange for Australian news. Google made some growling noises, but in the end the company agreed to pay some money.
Facebook on the other hand resisted, turned off its service, and returned to the Australian negotiating table.

Where was Microsoft in this technical square dance?

Microsoft was a cheerleader for the forces of truth, justice, and the Microsoft way. This Google blog post strikes me as Google’s reminding Microsoft that Google wants to be the new Microsoft. Microsoft has not done itself any favors because the battle lines between these two giants is swathed in the cloud of business war.

Google has mobile devices. Microsoft has the enterprise. Google has the Chromebook. Microsoft has the Surface. And on it goes.

Now Microsoft is on the ropes: SolarWinds, the Exchange glitch, and wonky updates which have required the invention of KIR (an update to remove bad updates).
Microsoft may be a JEDI warrior with the feature-burdened Teams and the military’s go to software PowerPoint. Google knows that every bump and scrape slows the reflexes of the Redmond giant.

Both mom and pop outfits are looking after each firm’s self interests. Fancy words and big ideas are window dressing.

Stephen E Arnold, March 25, 2021

Historical Revisionism: Twitter and Wikipedia

March 24, 2021

I wish I could recall the name of the slow talking wild-eyed professor who lectured about Mr. Stalin’s desire to have the history of the Soviet Union modified. The tendency was evident early in his career. Ioseb Besarionis dz? Jughashvili became Stalin, so fiddling with received wisdom verified by Ivory Tower types should come as no surprise.

Now we have Google and the right to be forgotten. As awkward as deleting pointers to content may be, digital information invites “reeducation”.

I learned in “Twitter to Appoint Representative to Turkey” that the extremely positive social media outfit will interact with the country’s government. The idea is to make sure content is just A-Okay. Changing tweets for money is a pretty good idea. Even better is coordinating the filtering of information with a nation state is another. But Apple and China seem to be finding a path forward. Maybe Apple in Russia will be a  similar success.

A much more interesting approach to shaping reality is alleged in “Non-English Editions of Wikipedia Have a Misinformation Problem.” Wikipedia has a stellar track record of providing fact rich, neutral information I believe. This “real news” story states:

The misinformation on Wikipedia reflects something larger going on in Japanese society. These WWII-era war crimes continue to affect Japan’s relationships with its neighbors. In recent years, as Japan has seen an increase in the rise of nationalism, then­–Prime Minister Shinzo Abe argued that there was no evidence of Japanese government coercion in the comfort women system, while others tried to claim the Nanjing Massacre never happened.

I am interested in these examples because each provides some color to one of my information “laws”. I have dubbed these “Arnold’s Precepts of Online Information.” Here’s the specific law which provides a shade tree for these examples:

Online information invites revisionism.

Stated another way, when “facts” are online, these are malleable, shapeable, and subjective.

When one runs a query on swisscows.com and then the same query on bing.com, ask:

Are these services indexing the same content?

The answer for me is, “No.” Filters, decisions about what to index, and update calendars shape the reality depicted online. Primary sources are a fine idea, but when those sources are shaped as well, what does one do?

The answer is like one of those Borges stories. Deleting and shaping content is more environmentally friendly than burning written records. A python script works with less smoke.

Stephen E Arnold, March24, 2021

IBM Watson: Learn How to Build a Recommendation Engine with Watson NLP

February 17, 2021

I came across this IBM free lesson: “Build a Recommendation Engine with Watson Natural Language Understanding.”

The preliminary set up, according to the write up, takes about an hour. Once that hour has been invested, the IBM Watson Knowledge Studio service will allow you to whip up your own recommendation engine. Plus, with Watson, the system will understand what humans write.

What are the preliminary steps? No big deal. Get an IBM cloud account, then navigate to the IBM Cloud console. Pick a pricing plan. Just choose “free” otherwise the lesson is free, not building the recommendation solution, you silly goose.) Then follow the steps for provisioning a Watson Knowledge Studio instance. Choose “free” again.

Next you have an opportunity to work through six additio0nal steps:

  1. Define entity types and subtypes
  2. Create “Relation Types”
  3. Collect documents that describe your domain language
  4. Annotate Documents
  5. Generate a Machine Learning Model
  6. Deploy model to Natural Language Understanding service.

The system seems to enjoy documents which are no larger than 2,000 words, preferable smaller. And the documents must be in ASCII, PDF, DOC, and HTML. The IBM information says Zip files are supported, but zip files can contain non text objects and long text documents. (That’s why people zip long text files, right?) The student can also upload documents in the UIMA CAS XMI format. If you are not familiar with this file format, you can get oriented by looking at documents like this.)

Once you have worked through steps one through five (obviously without making an error), you will need you Natural Language Understanding API Key which “is located at The Natural Language Understanding API Key and URL can be found by navigating to your Watson Natural Language Understanding instance page and looking in the Credentials section.”

No problem.

But what if the customer support system relies on voice? What if the customer is asked to upload a screenshot or a file containing data displayed when a fault occurs? What if the customer has paid for “premier” support which features a Zoom session? What if the person who wants to learn about Watson recommendation engine for a small trucking company?

Good questions. You may want to set aside some time to work through steps one through five which encapsulate some specialized college courses and hands-on experience with smart software, search, indexing, etc.

Perhaps hiring an IBM partner to set up the system and walk you through its quirks and features is a more practical solution.

On the other hand, check out Amazon’s off the shelf machine learning systems.

Stephen E Arnold, February 17, 2021

Google and Broad Match

February 11, 2021

I read “Google Is Moving on From Broad Match Modifier.” The essay’s angle is search engine optimization; that is, spoofing Google’s now diluted relevance methods. The write up says:

Google says it has been getting better at learning the intent behind a query, and is therefore more confident it can correctly map advertisements to queries. As that ability improves, the differences between Phrase Match and Broad Match Modified diminishes. Moving forward, there will be three match types, each with specific benefits:

  • Exact match: for precision
  • Broad match: for reach
  • Phrase match: in Google’s words, to combine the best of both.

Let’s assume that these are the reasons. Exact match delivers precision. Broad match casts a wide net. No thumbtypers wants a null set. Obviously there is zero information in a null set in the mind of the GenXers and Millennials, right? The phrase match is supposed to combine precision and recall. Oh, my goodness, precision and recall. What happened to cause the Google to reach into the deep history of STAIRS III and RECON for this notion.

Google hasn’t and won’t.

The missing factor in the write up’s analysis is answering the question, “When will each of the three approaches be used, under what conditions, and what happens if the bus drives to the wrong city?” (This bus analogy is my happy way of expressing the idea that Google search results often have little to do with either the words in the user’s query or the “intent” of the user (allegedly determined by Google’s knowledge of each user and the magic of more than 100 “factors” for determining what to present).

The key is the word “reach.” Changes to Google’s methods are, from my point of view, are designed to accomplish one thing: Burn through ad inventory.

By killing off functioning Boolean, deprecating search operators, ignoring meaningful time indexing, and tossing disambiguation into the wind blowing a Google volleyball into Shoreline traffic — the company’s core search methods have been shaped to produce money.

SEO experts don’t like this viewpoint. Google doesn’t care as long as the money keeps flowing. With Google investing less in infrastructure and facing significant pressure from government investigators and outfits like Amazon and Facebook, re-explaining search boils down to showing content which transports ads.

Where’s that leave the SEO experts? Answer: Ad sales reps for the Google. Traffic comes to advertisers. But the big bucks are the big advertisers’ campaigns which expose a message to as many eyeballs as possible. That’s why “broad reach” is the fox in the relevance hen house.

Stephen E Arnold, February 11, 2021

SEO Semantics and the Vibrant Vivid Vees

January 29, 2021

Years ago, one of the executives at Vivisimo, which was acquired by IBM, told me about the three Vees. These were the Vees of Vivisimo’s metasearch system. The individual, who shall remain nameless, whispered: Volume, Velocity, and Variety. He smiled enigmatically. In a short time, the three Vees were popping up in the context of machine learning, artificial intelligence, and content discovery.

The three Vivisimo Vees seem to capture the magic and mystery of digital data flows. I am not on that wheezing bus in Havana.

Volume is indeed a characteristic of online information. Even if one has a trickle of Word documents to review each day, the individual reading, editing, and commenting on a report has a sense that there are more Word documents flying around than the handful in this morning’s email. But in the context of our datasphere, no one knows how much digital data exist, what it contains, who has access, etc. Volume is a fundamental characteristic of today’s datasphere. The only way to contain data is to pull the plug. That is not going to happen unless there is something larger than Google. Maybe a massive cyber attack?

The second Vee is variety. From the point of view of the Vivisimo person, variety referred to the content that text centric system processed. Text, unlike a tidy database file, is usually a mess. Without structure, transform and load outfits have been working for decades to convert the messy into the orderly or at least pull out certain chunks so that one can extract key words, dates, and may entities with reasonable accuracy. Today there is a lot of variety; however, for every new variant old ones become irrelevant. At best, the variety challenge is like a person in a raft trying to paddle to keep from being swamped with intentional and unintentional content types. How about those encrypted message? Another hurdle for the indexing outfit: Decryption, metadata extraction and assignment, and processing throughput. So the variety Vee is handled by focusing on a subset of content. Too bad for those who think that “all” information is online.

The third Vee is a fave among the real time crowd. The idea that streams and flows of data in real time can be processed on the fly, patterns identified, advanced analytics applied, and high value data emitted. This notion is a good one when working in print shop in the 17th century. Those workflows don’t make any sense when figuring out the stream of data produced by an unidentified drone which may be weaponized. Furthermore, if a monitoring device notes a several millisecond pattern before a person’s heart attack, that’s not too helpful when the afflicted individual falls over dead a second later. What is “real time”? Answer: There are many types, so the fix is to focus, narrow, winnow, and go for a high probability signal. Sometimes it works; sometimes it doesn’t.

The three Vees are a clever and memorable marketing play. A company can explain how its system manages each of these issues for a particular customer use case. The one size fit all idea is not what generates information processing revenues. Service fees, subscriptions, and customization are the money spinners.

The write up “The Four V’s of Semantic Search” adds another Vee to the Vivisimo three: Veracity. I don’t want to argue “truth” because in the datasphere for every factoid on one side of an argument, even a Bing search can generate counter examples. What’s interesting is that this veracity Vee is presented as part of search engine optimization using semantic techniques. Here’s a segment I circled:

The fourth V is about how accurate the information is that you share, which speaks about your expertise in the given subject and to your honesty. Google cares about whether the information you share is true or not and real or not, because this is what Googles [sic] audience cares about. That’s why you won’t usually get search results that point to the fake news sites.

Got that. Marketing hoo hah, sloganeering, and word candy —  just like the three Vivisimo Vees.

Stephen E Arnold, January 29, 2021

SEO Trends for the New Year

January 13, 2021

Here at Beyond Search, we are quite skeptical of search engine optimization. In fact, our toothless leader, says, “SEO is a way to get struggling search engine optimization experts to become Google ad sales reps.” True or false, that’s what the old goose says.

Here are the latest pearls of wisdom from the search engine optimization world. Hackernoon discusses the “Top SEO Trends Every Digital Marketer Should Know in 2021.” Writer Muhammad Waleed Siddiqui specifies four factors to be aware of: the importance of mobile-first indexing, the impact of Google’s EAT algorithm, the changes brought by voice search, and the ascension of search intent over keywords. We actually welcome point number two, which ideally will result in more good information and less junk online. Siddiqui writes:

“The E.A.T. word stands for Expertise, Authority, and Trust. The concept itself covers the idea of a copywriter/writer being an original and proven professional in the industry he or she works in. Google will only accept that publicly available content that has no negative impact on people’s lives. With E.A.T. implemented in 2021, Google will genuinely look for quality and authentic content. One best way to succeed with this E.A.T. algorithm is to get good reviews from the customers and the community. If Google finds positive feedback about your website and services, your business will be considered an expert that authorized users can trust. For the SEO geeks, having high quality backlinks will help in this case. Pro-tip: If you can disavow all the bad or suspicious backlinks, it will help build up your E.A.T. score to Google.”

Anything that discourages dummy articles is a step in the right direction, we believe. Navigate to the article for details on the move away from desktop toward mobile, the difference between voiced and typed searches, and searcher intent vs. keywords.

Cynthia Murrell, January 13, 2021

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta