Semantic Search for arXiv Papers

January 12, 2023

An artificial intelligence research engineer named Tom Tumiel (InstaDeep) created a Web site called arXivxplorer.com.

imageAccording to his Twitter message (posted on January 7, 2023), the system is a “semantic search engine.” The service implements OpenAI’s embedding model. The idea is that this search method allows a user to “find the most relevant papers.” There is a stream of tweets at this link about the service. Mr. Tumiel states:

I’ve even discovered a few interesting papers I hadn’t seen before using traditional search tools like Google or arXiv’s own search function or even from the ML twitter hive mind… One can search for similar or “more like this” papers by “pasting the arXiv url directly” in the search box or “click the More Like This” button.

I ran several test queries, including this one: “Google Eigenvector.” The system surfaced generally useful papers, including one from January 2022. However, when I included the date 2023 in the search string, arXiv Xplorer did not return a null set. The system displayed hits which did not include the date.

Several quick observations:

  1. The system seems to be “time blind,” which is a common feature of modern search systems
  2. The system provides the abstract when one clicks on a link. The “view” button in the pop up displays the PDF
  3. Comparing result sets from the query with additional search terms surfaces papers reduces the result set size, a refreshing change from queries which display “infinite scrolling” of irrelevant documents.

For those interested in academic or research papers, will OpenAI become aware of the value of dates, limiting queries to endnotes, and displaying a relationship map among topics or authors in a manner similar to Maltego? By combining more search controls with the OpenAI content and query processing, the service might leapfrog the Lucene/Solr type methods. I think that would be a good thing.

Will the implementation of this system add to Google’s search anxiety? My hunch is that Google is not sure what causes the Google system to perturb ate. It may well be that the twitching, the sudden changes in direction, and the coverage of OpenAI itself in blogs may be the equivalent of tremors, soft speaking, and managerial dizziness. Oh, my, that sounds serious.

Stephen E Arnold, January 12, 2022

The EU Has the Google in Targeting Range for 2023

January 10, 2023

Unlike the United States, the European Union does not allow Google to collect user data. The EU has passed several laws to protect its citizens’ privacy, however, Google can still deploy tools like Google Analytics with stipulations. Tutanota explains how Google operates inside the EU laws in, “Is Google Analytics Illegal In The EU? Yes And No, But Mostly Yes.”

Max Schrems is a lawyer who successfully sued Facebook for violating the privacy of Europeans. He won again, this time against Google. France and Austria decided that Google Analytics is illegal to use in Europe, but Denmark’s and Norway’s data protection authorities developed legally compliant ways to use the analytics service.

Organizations were using Google Analytics to collect user information, but that violated Europeans’ privacy rights because it exposed them to American surveillance. The tech industry did not listen to the ruling, so Schrems sued:

“However, the Silicon Valley tech industry largely ignored the ruling. This has now led to the ruling that Google Analytics is banned in Europe. NOYB says:

‘While this (=invalidation of Privacy Shield) sent shock waves through the tech industry, US providers and EU data exporters have largely ignored the case. Just like Microsoft, Facebook or Amazon, Google has relied on so-called ‘standard Contract Clauses’ to continue data transfers and calm its European business partners.’

Now, the Austrian Data Protection Authority strikes the same chord as the European court when declaring Privacy Shield as invalid: It has decided that the use of Google Analytics is illegal as it violates the General Data Protection Regulation (GDPR). Google is “subject to surveillance by US intelligence services and can be ordered to disclose data of European citizens to them’. Therefore, the data of European citizens may not be transferred across the Atlantic.”

There are alternatives to Google services, including Gmail and Google Analytics based in Europe, Canada, and the United States. This appears to be one more example of the EU lining up financial missiles to strike the Google.

Whitney Grace, January 10, 2023

Thinking about Google in 2023: Hopefully Not Like Stuff Does

January 5, 2023

I have not been thinking about Google per se. I do think about [a] its management methods (Hello, Dr. Timnit Gebru), [b] its attempt to solve death, [c] the Googlers love of American basketball March Madness, [d] assorted semi-quiet settlements for alleged line-crossing activities, [e] the fish bowl culture with some species of fish decidedly further up the Great Chain of Googley Creatures, [f] the efforts to control costs using methods that are mostly invisible like possibly indexing less, embracing the snorkel view of the fish bowl, and abandoning the quaint notions of precision and recall, and [g] efforts to craft remarkable explanations for why the firm’s quantum supremacy and smart software has been on the receiving end of ChatGPT supersonic fly-bys.

The Stuff article “What to Expect from Google in 2023” takes a different approach; specifically, the article highlights a folding Pixel phone (er, hasn’t this been accomplished already?), gaming Chromebooks (er, what about the Stadia money pit, service termination, and refunds?), a Pixel tablet (er, another one-trick limping pony in the mobile device race?), a Pixel watch (er, Apple are you prepared?), Android refresh (er, how about that Android fragmentation?), and a reference to Google’s penchant for killing services. Do you remember Dodgeball or Waze?

My concern with this type of Google in 2023 article is that it misses the major challenges Google faces. I am not sure Google is aware of the challenges it faces. Life is a fish bowl is good until it isn’t. Not even a snappier snorkel will help. And if the water is fouled, what will the rank ordered fish do?

I know. I know. Solve death.

Stephen E Arnold, January 6, 2023

Google and Its View of Copy and Paste: Not Okay, No, No, No!

January 4, 2023

Another day, another hoot. Today (January 4, 2023) I read a “real” news story from the trust outfit Thomson Reuters titled “Google Alleges India Antitrust Body Copied Parts of EU Order on Android Abuse.” Yes, that’s the title. Google. Copying. India. Abuse.

I ran through my mind a few instances of allegations of the Google doing the copying. First, there was the online advertising dust up. My belief is that most people are not aware that Google paid Yahoo to make a dispute about online advertising technology go away. This was in 2004, and the Saul Hansell (who?) story is online at this link. To make a long story short, for me the deal allowed the Google to become an alleged monopoly in online advertising. It also made clear to me that innovation at Google meant copying. Interesting? I think so.

Then there were the hassles with newspapers and publishers about Google News. Wikipedia has a summary of the jousting. You can find the “Controversies with Publishers” thumbnail at this link. I would summarize the history of Google News this way: Others create timely information and Google copies it. Google emphasizes its service to users; publishers talk about copying without payment. The dismal copy paste drama began in 2002 and continues to this day.

I would be remiss if I did not mention Google’s scanning of books. I think of book scanning as similar to my photocopying a journal article when I was in college. I preferred to mark up the copy and create my University of Chicago style manual approved footnotes sitting in a cheap donut shop miles from the university library. After a decade of insisting that copying books was okay, the courts agreed. Google could copy. How are those clicks on Google Books and Google Scholar going in 2023. You can read about this copying decision in “After 10 Years, Google Books Is Legal.”

Copying is good, true, high value, and important to users and obviously to the Google.

Now what did the Reuters’ article tell me today? Let’s take a look:

Google has told a tribunal in India that the country’s antitrust investigators copied parts of a European ruling against the U.S. firm for abusing the market dominance of its Android operating system, arguing the decision be quashed, legal papers show.

Google is objecting to a nation state’s use of legal language copied from a European Union document.

Yep, copied.

Does Google care about copying and the role it has played at Google? In my opinion, no. What Google cares about is the rising tide of litigation and the deafening sound of cash registers ringing as a result of Google’s behavior.

Yep, copying. That’s a hoot. How does Google think laws, regulations, and bills are made? In my experience, it’s control C and control V.

Stephen E Arnold, January 4, 2022

Google Results Are Relevant… to Google and the Googley

January 3, 2023

We know that NoNeedforGPS will not be joining Prabhakar Raghavan (Google’s alleged head of search) and the many Googlers repurposed to deal with a threat, a real threat. That existential demon is ChatGPT. Dr. Raghavan (formerly of the estimable Verity which was absorbed into the even more estimable Autonomy which is a terra incognita unto itself) is getting quite a bit of Google guidance, help, support, and New Year cheer from those Googlers thrown into a Soviet style project to make that existential threat go away.

NoNeedforGPS questioned on Reddit.com the relevance of Google’s ad-supported sort of Web search engine. The plaintive cry in the post is an image, which is essentially impossible to read, says:

Why does Google show results that have nothing to do with what is searched?

You silly goose, NoNeedforGPS. You fail to understand the purpose of Google search, and you obviously are not privy to discussions by search wizards who embrace a noble concept: It is better to return a result than a null result. A footnote to this brilliant insight is that a null result — that is, a search results page which says, “Sorry, no matches for your query” — make it tough to match ads and convince the lucky advertiser on a blank page that a null result conveys information.

What? A null result conveys information! Are you crazy there in rural Kentucky with snow piled to a height of four French bulldogs standing atop one another?

No, I don’t think I am crazy, which is a negative word, according to some experts at Stanford University.

When I run a query like “Flokinet climate activist”, I really want to see a null result set. My hunch is that some folks in Eastern Europe want me to see an empty set as well.

Let me put the display of irrelevant “hits” in response to a query in context:

  1. With a result set — relevant or irrelevant is irrelevant — Google’s super duper ad matcher can do its magic. Once an ad is displayed (even in a list of irrelevant results to the user), some users click on the ads. In fact, some users cannot tell the difference between a relevant hit and an ad. Whatever the reason for the click, Google gets money.
  2. Many users who run a query don’t know what they are looking for. Here’s an example: A person searches Google for a Greek restaurant. Google knows that there is no Greek restaurant anywhere near the location of  the Google user. Therefore, the system displays results for restaurants close to the user. Google may toss in ads for Greek groceries, sponges from Greece, or a Greek history museum near Dunedin, Florida. Google figures one of these “hits” might solve the user’s problem and result in a click that is related to an ad. Thus, there are no irrelevant results when viewed from Google’s UX (user experience) viewpoint via the crystal lenses of Ad Words, SEO partner teams, or a Googler who has his/her/its finger on the scale of Google objectivity.
  3. The quaint notions of precision and recall have been lost in the mists of time. My hunch is that those who remember that a user often enters a word or phrase in the hopes of getting relevant information related to that which was typed into the query processor are not interested in old fashioned lists of relevant content. The basic reason is that Google gave up on relevance around 2006, and the company has been pursuing money, high school science projects like solving death, and trying to manage the chaos resulting from a management approach best described as anti-suit and pro fun. The fact that Google sort of works is amazing to me.

The sad reality is that Google handles more than 90 percent of the online searches in North America. Years ago I learned that in Denmark, Google handles 100 percent of the online search traffic. Dr. Raghavan can lash hundreds of Googlers to the ChatGPT response meetings, but change may be difficult. Google believes that its approach to smart software is just better. Google has technology that is smarter, more adept at creating college admission essays, and blog posts like this one. Google can do biology, quantum computing, and write marketing copy while wearing a Snorkel and letting code do deep dives.

Net net: NoNeedforGPS does express a viewpoint which is causing people who think they are “expert searchers” to try out DuckDuckGo, You.com, and even the Russian service Yandex.com, among others. Thus, Google is scared. Those looking for information may find a system using ChatGPT returns results that are useful. Once users mired in irrelevant results realizes that they have been operating in the dark, a new dawn may emerge. That’s Dr. Raghavan’s problem, and it may prove to be easier to impress those at a high school reunion than advertisers.

Stephen E Arnold, January 3, 2023

Are Facebook and Google Monopolies: Nope, Shrinking Share of Online Ads. Proof!

December 29, 2022

I read an interesting article, but I have my doubts about the numbers. The story is from one of the “last person standing” in the Silicon Valley real news datasphere. In the last month or so, the tone of write ups about two of America’s most lovable and well managed companies has turned south, well, maybe south by southwest.

Share of US Digital Ad Spend, by Company Type” reports:

Google and Meta will together capture 48.4% of all U.S. digital ad revenue this year (28.8% for Google and 19.6% for Meta), down from 54.7% at their peak in 2017 (34.7% for Google and 20.0% for Meta), per data from Insider Intelligence.

And what about the lovable Bezos bulldozer driven pedal to the metal by Andy Jassy? The article states:

  • By far, the biggest threat to their collective ad dominance is Amazon, which has grown its ad business to over $30 billion dollars annually.
  • By 2024, Amazon is expected to capture 12.7% of all U.S. digital ad dollars, while Meta is expected to capture 17.9%.

TikTok is no big whoop. I suppose that’s why the tech giants are becoming pretzels in their effort create short form content.

Several observations:

  1. I am not sure how these data were gathered nor the methods used to present such remarkable precision as 54.7 percent in a prediction is an indication that someone did not pay attention in Statistics 101
  2. Amazon’s ad data are more interesting when the slope between the firm’s ad revenue in 2018 is plotted against Amazon’s ad revenue in 2021. That a slope!
  3. Blowing off TikTok is problematic. Does the data consider influencers who accept some type of compensation in return for merchandise, trips, or some other fungible asset like a super duper hair curling device?

To sum up: I am not prepared to label those wonderful wizards at Facebook and Google as crew on a doomed steamship named MY Failure.

Stephen E Arnold, December 2022

Google: Rank Ordering Its Wizards, Shamen, and Necromancers

December 27, 2022

Okay, six percent of the magic workers are not sufficiently Google. The figure does not count Timnit Gebru types.

Google is not afraid to fire anyone who ignites controversy within the company related to diversity and women. Sometimes it is not bad press that causes Google to lay off its employees, instead it is the economy. The Daily Hunt reports that, “Google Asked Managers To Fire 10,000 ‘Poor Performers’ As Mass Layoffs Hit Tech Sector.”

The US federal government’s raising interest rates and tech companies that make a large portion of their profits from ads are feeling the pain. Meta, Google, Amazon, Twitter, and more companies are firing more workers. Alphabet is telling its managers to lay off all employees who are rated as “poor performers.” The hope is to get rid of at least 10,000 workers and there might be some subterfuge behind it:

“As per a report from Forbes, Google might even bank on these rankings to avoid paying bonuses and stock grants. Google’s managers have been reportedly asked to categorize 10,000 employees as “poor performers” so that 10,000 people can be fired. Alphabet has a total workforce of 187,000 people, which is one of the largest workforces in tech.”

Google’s workforce is described as bloated and pays its employees 70% more than Microsoft compensates its staff or 153% compared to the top twenty big tech companies. Google pays more than its competition to hoard talent and increases its stranglehold on the tech industry.

Googzilla has to pay for NFL football any way it can.

Whitney Grace, December 27, 2022

Can Google React to a Code Red? Yeah, Sure, Jumping Right on It

December 22, 2022

The New York Times, The Guardian, and even the relentlessly innovative Business Insider have embraced the idea of Code Red. What is a Code Red? If you spent time at a cyber security conference a few years ago, Code Red was a snazzy name for computer worm. Have you spent quality time in a hospital in the US, preferably a smaller town? If so you may recall hearing “Code Red”. The idea was to alert the motivated, enthusiastic, and empathetic professionals that there was a barn burner of a fire raging around oxygen tanks adjacent intensive care, operating theaters, or recovery rooms. The term could also refer to bad weather, a billing opportunity’s arrival (aka patient), or something really bad happening like a grain silo explosion in Canton, Illinois, in which local farmers were blasted, burned, or gassed. (Yep, grain dust does go bang.) Code Red to some US Department of Defense types means — at least to some US Marines that the weather is more bad than the previous day’s weather. However, to some trained at Quantico, the term only suggests that the weather will be worse than it was yesterday.

For the “real news” professionals, the idea is that Code Red means emergency. Examples appear in a number of articles like this one: “Google’s Management Has Reportedly Issued a Code Red amid the Rising Popularity of the ChatGPT AI.” The idea is that Google’s estimated 90 percent share of the US and Western European online search market is now in jeopardy. You judge.

Here’s a passage from the write up:

Sundar Pichai, the CEO of Google’s parent company, Alphabet, participated in several meetings around Google’s AI strategy and has directed numerous groups in the company to refocus their efforts on addressing the threat that ChatGPT poses on its search engine business…In particular, teams in Google’s research, Trust and Safety division among other departments have been directed to switch gears to assist in the development and launch of new AI prototypes and products, the Times reported. Some employees have even been tasked to build AI products that generate art and graphics similar to OpenAI’s DALL-E used by millions of people…

Okay, meetings in the midst of holiday season. Perilously close to New Year’s festivities. Google Meet sessions with dogs barking or significant others saying, “Will you get off that call? Right now!”

The idea is that Google is going to face a challenge, maybe an existential threat! Google has to react immediately. Another grain silo will explode. That boom? Yeah, the emergency room oxygen tanks exploded. No one knows how many were injured or even killed. Horrible. More staff shortages! The sky is falling because our billing stream is blocked. Double Code Red!

Smash cut.

This image represents Google, courtesy of the free but legally ambiguous Craiyon.com:

fish in fishbowl

Really original artwork courtesy of https://www.craiyon.com/

Yes, a fish bowl, not a frog. The fish takes the world’s data. I have heard that some pet fish watch television when an influencer is streaming to the big flat panel in the spacious 300 square foot apartment in Florham Park, New Jersey. This metaphorical fish is master of its universe; however, the leaking Russian ISS space capsule is not on its radar or the flaws of companies with “seeing stones” are not on its radar.

If we were to slowly heat the water in this fish’s bowl, our fish may discover too late that fleeing, transforming, or getting on a flight to Argentina are low percentage options. (Kiddies, please, do not test this theory and torture a fish unless you are a PhD student eager to work on live animal testing in a lab near Palo Alto.)

The key point is that until death has its paws on our fish, frantic action does not take place. Nothing stops the grim reaper from having a boiled fish appetizer.

May I share some of my unpopular, historically ignored observations about the Google? Oh, you say, “No.” Tough luck. Here I go:

  1. The Google of today understands its environment within its fish bowl. Like the fish, comprehension of the wider world is if not impossible or distorted due to the nature of the boundary between the watery world and the bigger outside world. Changing a world view ain’t gonna happen? Why? Business process momentum, perceptual acuity, and Googley thinking keep the systems doing what they do: Selling ads.
  2. Google engineers truly believe that their technology is THE BEST THING EVER. Keep in mind that invention can come via acquisition, unauthorized borrowing, or a late night Backrub discussion in a Stanford dorm. Today’s Google has substituted reasonably useful search of textual Web content for hard cash derived from monetizing user clicks. Executive compensation translates to “If it ain’t broke, don’t fix it!”
  3. Google is chugging along, uncertain that the bright light some Googlers have noticed is a stream from Nadine Breaty or a fire in the room housing the fish bowl. From the fish’s perspective, there are no big problems in the fish bowl. Pay attention but carry on. Signals carry noise, so dig out the meaningful signal. Verify. Plan. Test. (Ooops. The fish is now dead. Bad. So sudden. It was a nice fish before it went Madison Avenue, of course.)

The chatter about ChatGPT is interesting to me. The technology is interesting, and its performance is getting useful tweaks. Use cases are emerging. Worriers are letting their worry gene influence their thinking. Entrepreneurs are entrepreneuring because getting rich quick on open source software may be a better idea than applying to be a carpetland dweller at the Twitter thing. Smart software will put lawyers, journalists, and — gasp! — blue chip consultants out of work.

But what’s this suggest about Google?

Keep in mind that I dubbed Google Googzilla in 2003. Big, ferocious, an icon of rapaciousness. True then and truer now. But big reptiles share a common characteristic with gold fish. Trapped in one ecosystem, the creatures don’t know what’s happening until it is too late. Then freneticism marks the onset of death. What’s a frightened, crazed Googzilla like?

We’re not there yet. I think of Google’s Code Red as the first stage of Google’s way of dying. I told you: Unpopular. Nothing new.

The Five Stages of Grief makes clear that Google is just now working through the denial stage. Next up is anger. Then deal making. Depression sweeps through the company. And finally — finally! — staff accept that the run of behavior without consequences has drawn to a close. Elisabeth Kubler Ross and David Kessler left out the final stage is the stuff of popular songs like memories. Tip: Newly minted OSINT experts, move beyond Google.

Stephen E Arnold, December 22, 2022

Google to Microsoft: We Are Trying to Be Helpful

December 16, 2022

Ah, those fun loving alleged monopolies are in the news again. Microsoft — famous in some circles for its interesting approach to security issues — allegedly has an Internet Explorer security problem. Wait! I thought the whole wide world was using Microsoft Edge, the new and improved solution to Web access.

According to “CVE-2022-41128: Type Confusion in Internet Explorer’s JScript9 Engine,” Internet Explorer after decades of continuous improvement and its replacement has a security vulnerability. Are you still using Internet Explorer? The answer may be, “Sure you are.”

With Internet Explorer following Bob down the trail of Microsoft’s most impressive software, the Redmond crowd the Microsoft Office application uses bits and pieces of Internet Explorer. Thrilling, right?

Google explains the Microsoft issue this way:

The JIT compiler generates code that will perform a type check on the variable q at the entry of the boom function. The JIT compiler wrongly assumes the type will not change throughout the rest of the function. This assumption is broken when q is changed from d (an Int32Array) to e (an Object). When executing q[0] = 0x42424242, the compiled code still thinks it is dealing with the previous Int32Array and uses the corresponding offsets. In reality, it is writing to wherever e.e points to in the case of a 32-bit process or e.d in the case of a 64-bit process. Based on the patch, the bug seems to lie within a flawed check in GlobOpt::OptArraySrc, one of the optimization phases. GlobOpt::OptArraySrc calls ShouldExpectConventionalArrayIndexValue and based on its return value will (in some cases wrongly) skip some code.

Got that.

The main idea is that Google is calling attention to the future great online game company’s approach to software engineering. In a word or two, “Poor to poorer.”

My view of the helpful announcement is that Microsoft Certified Professionals will have to explain this problem. Google’s sales team will happily point out this and other flaws in the Microsoft approach to enterprise software.

If you can’t trust a Web browser or remove flawed code from a widely used app, what’s the fix?

Ready for the answer: “Helpful cyber security revelations that make the online ad giant look like a friendly, fluffy Googzilla. Being helpful is the optimal way to conduct business.

Stephen E Arnold, December 16, 2022

Microsoft and the London Stock Exchange: Lock In Maybe?

December 12, 2022

I believe everything I read on the Internet. That’s one way I keep in touch with my inner GenZ self. Sometimes, however, stories ring true; for example, “Microsoft buys Near 4% Stake in London Stock Exchange As Part of 10 Year Cloud Deal.” I read the title via my dinobaby translation system and understood, “Yep, lock in, kiddo. Oh, Amazon AWS and Google Cloud professionals. Do not bother to call us. We will call you, okay.”

You may disagree with my dinobaby translator. That’s okay. I let many flowers bloom, unlike the London Stock Exchange which goes at life in what appear to be 10 year contracts. That’s a long time in techno-cloud land in my opinion.

The write up says:

Scott Guthrie, Microsoft’s executive vice president for the Cloud and AI Group, will be appointed as a non-executive director of LSEG.

I wonder if he will demo Microsoft Teams egames features and the security systems for Microsoft Exchange Server? Will he offer helpful inputs to those who might want to give an off the shelf AWS Sagemaker system a spin? What about the ever reliable Google VPN service which is super reliable and in demand right now?

The answer to these questions strike me as obvious. Azure is better, faster, cheaper, more reliable, and easier. I wonder if these benefits entered into the negotiation. (Personally I like the security angle and the cheaper plus.) My instinct has a tiny voice too. It is whispering to me, “Microsoft will deliver premier service to the London Stock Exchange when (which is unlikely) the system Azure system hiccups.

I noted this passage too:

Microsoft and LSEG will also work together in developing new professional collaboration tools. LSEG has developed a product called Workspace, a data and analytics platform. The two companies will be working on advancing this product and integrating it with Microsoft Teams, the firm’s messaging app.

I am tempted to reference the source of the stake, but I won’t. The parties involved make content marketing hay around the “trust” word.

I have a couple of observations:

  1. Microsoft has added a neon underline to the old marketing concept of “lock in.”
  2. The Redmond security giant can point to a big time financial customer and market its secure cloud solutions. Well, they are secure… at this time.
  3. The Amazon and Google cloud professionals will definitely find a way to respond.

Net net: Isn’t it wonderful that big tech innovation involves owning financial plumbing and access?

Stephen E Arnold, December 12, 2022

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta