Does GitHub Data Grab for AI Training Violate Licenses?

July 22, 2021

Programmer Nora Tindall has taken to Twitter to call out Microsoft property GitHub on violating licenses for algorithm training purposes. She shares a screenshot of an exchange she had with GitHub Support that seems to confirm her charge:

[Tindall] I am specifically asking if any code from my GitHub account, most of which is licensed GPL, was used in the training set. It is a simple question.”

[GitHub] Sorry about the delay in getting back to you. I reached out to the team about this. Apparently all public GitHub code was used in training. We don’t distinguish by license type. I hope that answers your question!

It does indeed answer Tindall’s question, and she vows to pursue legal action. Predictably, the post prompted a flurry of comments, so navigate there to read that debate. It seems like the legality of this data usage is nebulous until courts weigh in. We note this exchange:

[Daniel Monte] Is there any precedent for training an AI on copyrighted content being a violation of said copyright?

[Nora Tindall] No, there’s no precedent in any of this. This is the deciding moment for the future of the copyleft ideal, and of free software in general. Maybe for copyright as a whole, actually, since this has applications outside software.

[Laurie] The law on all of this is basically non-existent. And there aren’t enough people who really understand the nuances who are also lawyers. It’s a whole mess which results in companies getting to decide for themselves. Not good.

[Critical Oil Theory Salesman] Hard agree. I’d imagine that we would see a completely different set of legal interpretations if the open source community trained a GPT3 model on Microsoft’s publicly available code.

Perhaps—that would be an interesting experiment. Is Microsoft really ignoring licenses? If not, Twitter is disseminating incorrect information. If yes, then Microsoft has designs on open source information in a way that outfoxes Amazon-type of open source maneuvers. But Microsoft is busy securing its own code and may want to envelope GitHub is the same cyber goodness.

Cynthia Murrell, July 22, 2021

The Future of Open Source Software: Who Knows? Maybe VCs and Data Aggregators?

July 5, 2021

I read some of the write up about Audacity, an audio editor, maybe a wannabe digital audio workstation. One group of write ups foretell the future of this particular open source software as spyware. A good example is “Audacity 3.0 Called Spyware over Data Collection Changes by New Owner.” This is an interesting premise. In some countries, companies even those wearing open source penguins and waving FOSS flags must comply. I suppose the issue is intentional data collection. From a business perspective, there’s money in them thar data elements. And money, not free, is the name of the game in some circles; for example, developers who cut corners in code and construction.

There’s another side to the argument. A reasonable example is “Audacity Is a Poster Child for What Can Be Achieved with Open-Source Software.” The main idea is that if one does not like Audacity, a skilled person can whip up an Audacity variant. (Is the process as simple as creating a Covid variant?) The write up points out:

the first version of Audacity was released in 1999 (at the time the name was different)…Still, while it might look outdated, Audacity doesn’t lack when it comes to features. If you can find them.

High praise?

The author of “Poster Child” opines:

From what I’ve seen over the last two months, Muse Group seems to have its heart in the right place. And if the opposite comes true, there’s always GitHub’s fork button. For now, though, it seems that not only is Audacity in good hands, but that it might be finally getting that design refresh it desperately needs.

The point I carried away from these two write ups is that open source is at an inflection point.

Making money from software is more difficult than it seems. Services, subscribing, consulting, for-fee extras, training, customizing, and platforming are quite different from picking up a box with a disc inside.

Observations:

  1. Open source software is supposed to give users a way to free themselves from the handcuffs of license agreements. Aren’t the new monetization methods a form of handcuffing?
  2. The value of sucked up data, packaged, and licensed to an aggregator an interesting path forward. Hey, Intuit licensed its small business user data to a quite interesting ConAgra of data collection. What’s good for the proprietary goose, may be very, very good for an open source gander.
  3. Monopolization of software functionality hooked into the ever-secure ever-so-reliable cloud sets the stage for no-code alternatives. Will these be free and open? My hunch: Unlikely.

Net net: Open source, she be changin’.

Stephen E Arnold, July 5, 2021

Search and the Bezos Bulldozer

April 13, 2021

For the last three years, I have been giving lectures about the lock in methods implemented by Amazon. I refer to the company as the online bookstore in order to remind those in my audiences that Amazon has a friendly facet. That’s exemplified by the smile logo. Amazon also has a Wall Street persona which is built upon the precepts of MBAism.

I will be talking about Amazon and its policeware strategy at the 2021 National Cyber Crime Conference. If you want a similar presentation tailored to commercial interests, let me know. I can be reached via benkent2020 at yahoo dot com. My LE and intel work are pro bono; commercial works incurs a fee.

I want to mention a subject I won’t be addressing directly in my upcoming lecture later this month. The subject is an Amazon blog post titled “Introducing OpenSearch.” I would also direct your attention to the comments submitted to the Ycombinator discussion of the announcement. You can find those interesting and varied remarks from hundreds of people at this link.

The news is that Amazon is taking quite predictable steps to recast search and retrieval so that it becomes another of the hundreds of functions, services, and features of Amazon Web Services. AWS hired people from Lucid Imagination (now LucidWorks) years ago. Many have forgotten that Amazon operated A9, a Web search system with a street view function, as well. There are other findability functions embedded in Amazon as well; for example, the “search” function in Amazon’s blockchain inventions. (Yes, I have a for fee lecture about that technology as well. Because money laundering is a growing problem, the Amazon methods are likely to become increasingly important to certain government agencies in the future.)

The little secret about open source software, which many overlook, is that the strongest supporters of FOSS and community supported code are large companies. I did a series of reports for the IDC outfit, and I am not sure what that now dismantled organization did with the data. A couple of chapters were sold on Amazon for $3,000, but the topic was not a magnet when we assembled the information six or seven years ago.

Since Amazon is engaged in a battle for one part of the “enterprise” with Microsoft, the online bookstore is actively seeking ways to attract large organizations as customers, lock them in, and then implement the tactics which benefit from Amazon’s knowledge of its customers’ behavior. The use of the “retail” tactic watch, duplicate, and leverage house brands is documented in the reports from vendors who have had their toes nipped by the bulldozer’s steel caterpillar traction system.

Why’s this germane to “search”? Here are the reasons:

  • Search and retrieval is an essential utility for modern work. Amazon wants to generate revenue and other business benefits by having a “better” and (if possible) community supported software base. Search will become part of the lubricant for other Amazon enterprise services; for example, locating tax avoiders.
  • Search becomes the glue and the circulatory system for information analysis and use. No search; no high value outputs. Machine learning is little more than a supporting technology to finding needed information. Many disagree with me, but marketing clouds many experts’ thinking. Search is a core function and requires many subsystems and technical methods.
  • Once users become habituated to search, change is difficult. Amazon is one of the few outfits to have undermined Google search. Product searches are increasingly under Amazon’s control. The ElasticSearch “play” is going to become the vehicle for a broader utility attack.

I have quipped that Amazon has targeted Elastic and the ElasticSearch “system” because it has the same name as some of Amazon’s services. If Amazon is successful in its search maneuver, Shay Banon’s findability play will be marginalized.

There are larger implications quite beyond a comment made to elicit a laugh at a reception at an enterprise search conference. These include:

  • Seamless integration with SageMaker and other advanced functionalities available from Amazon
  • A lever for technical and financial leverage for innovators who use Amazon as the plumbing for their start ups, not Microsoft technology
  • A model for Amazon and maybe other companies to use for shifting open source software into a variation on the FUD (fear, uncertainty, and doubt) approach to closing deals. The mantra could become “Nobody ever got fired for buying AWS.”

For the companies generating scorecards for enterprise search vendors, significant change is likely. The numerous vendors of proprietary enterprise search will have to make some changes in their approach to Amazon. Many of these Elastic alternatives use AWS for certain functions. What happens if the pricing structure, the legalese, or the access to certain AWS services “evolve”? What will start ups and Amazon partners do if access to search functions becomes free or requires contributions to the AWS version of open source?

Worth watching, right? The answer is, “Nah, you are way off base.” Yep, just as I was in my analysis of Google for BearStearns many years ago. I have a track record of getting thrown out as I head for second base.

Stephen E Arnold, April 13, 2021

Microsoft Exchange After Action Action: Adulting or Covering Up?

March 12, 2021

I read “Researcher Publishes Code to Exploit Microsoft Exchange Vulnerabilities on GitHub.” The allegedly accurate “real” news report states:

On Wednesday, independent security researcher Nguyen Jang published on GitHub a proof-of-concept tool to hack Microsoft Exchange servers that combined two of those vulnerabilities. Essentially, he published code that could be used to hack Microsoft customers, exploiting a bug used by Chinese government hackers—on an open-source platform owned by Microsoft.

What happened?

Microsoft, took down the hacking tool.  “GitHub took down it,” the researcher told Motherboard in an email. “They just send [sic] me an email.” On Thursday, a GitHub spokesperson confirmed to Motherboard that the company removed the code due to the potential damage it could cause.

Interesting.

Two questions crossed my mind:

  1. Is Microsoft showing more management responsibility with regard to the data posted on GitHub? Editorial control is often useful, particularly when the outputting mechanism provides a wealth of information and code. Some of these items can be used to create issues. Microsoft purchased GitHub and may now be forced to take a more adult view of the service.
  2. Is Microsoft covering up the flaws in its core processes? After reading Microsoft’s explanations of the Solarwinds’ misstep, the injection of marketing spin and intriguing rhetoric about responsibility open the door to a bit of Home Depoting; that is, paint, wood panel, and bit of carpet make an an ageing condo look better.

Worth watching both the breaches which are concerning and the GitHub service which can cause some individuals’ brows to furrow.

Stephen E Arnold, March 12, 2021

Elastic and Its Approach to Its Search Business

February 16, 2021

This blog post is about Elastic, the Shay Banon information retrieval company, not Amazon AWS Elastic services. Confused yet? The confusion will only increase over time because the “name” Elastic is going to be difficult to keep intact due to Amazon’s ability to erode brand names.

But that’s just one challenge the Elastic search company founded by the magic behind Compass Search. An excellent analysis of Elastic search’s challenges appears in “Elastic Has Stretched the Patience of Many in Open Source. But Is There Room for a Third Way?”

The write up quotes an open source expert as saying:

Let’s be really clear – it’s a move from open to proprietary as a consequence of a failed business model decision…. Elastic should have though their revenue model through up front. By the time the team made the decision to open source their code, the platform economy existed and their decisions to open source ought to
have been aligned to an appropriate business model.

I circled this statement in the article:

Sympathy for Elastic’s position comes from a perhaps unexpected source. Matt Assay, principal at Elastic’s bête noire AWS, believes it’s time to revisit the idea of “shared source”, a licensing scheme originally dreamed up by Microsoft two decades ago as an answer to the then-novel open source concept. In shared source, code is open – as in visible – but its uses are restricted… The heart of the problem is about who gets to profit from open source software. To help resolve that problem, we just might need new licensing.

Information retrieval is not about precision and recall, providing answers to users, or removing confusion about terms and product names — search is about money. Making big bucks from a utility service continues to lure some and smack down others. Now it is time to be squishy and bouncy I suppose.

Stephen E Arnold, February 16, 2021

Open Source Software: The Community Model in 2021

January 25, 2021

I read “Why I Wouldn’t Invest in Open-Source Companies, Even Though I Ran One.” I became interested in open source search when I was assembling the first of three editions of Enterprise Search Report in the early 2000s. I debated whether to include Compass Search, the precursor to Shay Branon’s Elasticsearch reprise. Over the years, I have kept my eye on open source search and retrieval. I prepared a report for an the outfit IDC, which happily published sections of the document and offering my write ups for $3,000 on Amazon. Too bad IDC had no agreement with me, managers who made Daffy Duck look like a model for MBAs, and a keen desire to find a buyer. Ah, the book still resides on one of my back of drives, and it contains a run down of where open source was getting traction. I wrote the report in 2011 before getting the shaft-a-rama from a mid tier consulting firm. Great experience!

The report included a few nuggets which in 2011 not many experts in enterprise search recognized; for instance:

  1. Large companies were early and enthusiastic adopters of open source search; for example Lucene. Why? Reduce costs and get out of the crazy environment which put Fast Search & Transfer-type executives in prison for violating some rules and regulations. The phrase I heard in some of my interviews was, “We want to get out of the proprietary software handcuffs.” Plus big outfits had plenty of information technology resources to throw at balky open source software.
  2. Developers saw open source in general and contributing to open source information retrieval projects as a really super duper way to get hired. For example, IBM — an early enthusiast for a search system which mostly worked — used the committers as feedstock. The practice became popular among other outfits as well.
  3. Venture outfits stuffed with oh-so-technical MBAs realized that consulting services could be wrapped around free software. Sure, there were legal niceties in the open source licenses, but these were not a big deal when Silicon Valley super lawyers were just a text message away.

There were other findings as well, including the initiatives underway to embed open source search, content processing, and related functions into commercial products. Attivio (formed by former super star managers from Fast Search & Transfer), Lucid Works, IBM, and other bright lights adopted open source software to [a] reduce costs, [b] eliminate the R&D required to implement certain new features, and [c] develop expensive, proprietary components, training, and services.

Read more

Enterprise Search: Flexible and Stretchy. Er, No.

January 21, 2021

Enterprise search, the utility service, thrills users and information technology professionals alike. There are quite a few search and retrieval vendors chasing revenue. Frankly I have given up trying to keep track of outfits like Luigi’s Box, Yext (yes, enterprise search!), and quite a few repackagers of Lucene; e.g., IBM, Attivio, Voyager Search, and more. There are some proprietary outfits as well.

Then there is the Compass Search sibling Elastic and its Elasticsearch. Open source search makes a great deal of sense to:

  • Companies wanting a no cost or low cost way to provide search and retrieval-type functionality to an application
  • Penny pinchers who want “the community” to fix bugs so that cash is freed up to lease fancy cars, receive bonuses, and focus on more important software features which can be offered for a fee and a license handcuff
  • Competitors who want to provide a familiar environment to those with cash to spend and wave the magic wand of open source in front of young believers who think proprietary software is a crime against humanity.

The history of Elasticsearch and Amazon reaches back to the era when Lucid Works (né Lucid Imagination) lost some staff to Amazon’s Burlingame, California, office. That was the bell which sounded when the Bezos bulldozer decided A9 was not enough. Sure, A9 works but the folks from the Lucene/Solr outfit would map the route from A9 to a more open, folksy world of open source search.

The open source version of Lucene was the beating heart of Elastic, the now public company.

Then Amazon does what Amazon does: The company shifted the bulldozer into gear and went for open source search developers who could seamlessly (sort of) move into the newly blazed path to AWS. Once inside, the fruits of the thousand plus services, features, and functions were just a click away. Policeware vendors, start ups, and some big outfits followed the Bezos bulldozer. The updated IBM slogan reads, “Nobody gets fired for buying AWS.”

Elastic was upset.

Amazon: NOT OK – Why We Had to Change Elastic Licensing” picks up this story and explains where Elastic fits into the world crafted by the Bezos bulldozer.

The write up explains:

Our license change is aimed at preventing companies from taking our Elasticsearch and Kibana products and providing them directly as a service without collaborating with us.

Elastic’s essay notes:

We think that Amazon’s behavior is inconsistent with the norms and values that are especially important in the open source ecosystem. Our hope is to take our presence in the market and use it to stand up to this now so others don’t face these same issues in the future.

The essay concludes:

I believe in the core values of the Open Source Community: transparency, collaboration, openness. Building great products to the benefit of users across the world. Amazing things have been built and will continue to be built using Elasticsearch and Kibana. And to be clear, this change most likely has zero effect on you, our users. And no effect on our customers that engage with us either in cloud or on premises.

Several observations:

  1. Commercial behemoths like Amazon use open source the way my neighbor burns firewood, old carpets, and newspapers. The goal is to optimize available cash.
  2. Amazon’s move into Elastic’s territory began more than five years ago. Amazon does kill off loser products like health and food delivery but it keeps others in tall cotton when it pays off.
  3. Those completing [a] Amazon certification, [b] partner indoctrination, or [c] inputs from free or low cost Amazon training arrive ready to do the search thing Amazon’s way.

Net net: Beyond Search understands Elastic’s anguish and actions. Perhaps the license shift and the assumptions about open source are unlikely to stand up to the Bezos bulldozer? Open source Elasticsearch is a bargain. It may be tough to compete with free plus discounts for AWS goodies and other Amazon benefits. Legal or illegal, fair or unfair, open source or closed source — the bulldozer grinds forward.

Stephen E Arnold, January 21, 2021

Mobile and Social Media Users: Check Out the Utility of Metadata

January 15, 2021

Policeware vendors once commanded big, big bucks to match a person of interest to a location. Over the last decade prices have come down. Some useful products cost a fraction of the industrial strength, incredibly clumsy tools. If you are thinking about the hassle of manipulating data in IBM or Palantir products, you are in the murky field of prediction. I have not named the products which I think are the winners of this particular race.

image

Source: https://thepatr10t.github.io/yall-Qaeda/

The focus of this write up is the useful information derived from the deplatformed Parler social media outfit. An enterprising individual named Patri10tic performed the sort of trick which Geofeedia made semi famous. You can check the map placing specific Parler uses in particular locations based on their messages at this link. What’s the time frame? The unusual protest at the US Capitol.

The point of this short post is different. I want to highlight several points:

  1. Metadata can be more useful than the content of a particular message or voice call
  2. Metadata can be mapped through time creating a nifty path of an individual’s movements
  3. Metadata can be cross correlated with other data. (If you attended one of my Amazon policeware lectures, the cross correlation figures prominently.)
  4. Metadata can be analyzed in more than two dimensions.

To sum up, I want to remind journalists that this type of data detritus has enormous value. That is the reason third parties attempt to bundle data together and provide authorized users with access to them.

What’s this have to do with policeware? From my point of view, almost anyone can replicate what systems costing as much as seven figures a year or more from their laptop at an outdoor table near a coffee shop.

Policeware vendors want to charge a lot. The Parler analysis demonstrates that there are many uses for low or zero cost geo manipulations.

Stephen E Arnold, January 15, 2021

Open Source: Does It Mean What You Think It Means?

January 15, 2021

I spotted an article on Newswire called “Tech Giant Technology Is Open Source for the Pandemic, So Why Does It Feel So Closed?” The awkward title intrigued me. Open means, according to Dictionary.com:

not closed or barred at the time, as a doorway by a door, a window by a sash, or a gateway by a gate:to leave the windows open at night.

(of a door, gate, window sash, or the like) set so as to permit passage through the opening it can be used to close.

Pretty obvious. But open appears to mean closed. The “source” refers to software I assumed.

The write up sets me straight:

“The term ‘open source’ is being applied to the final design of an instrument – and I’m pleased to say there has been a willingness during the pandemic to share these final designs – but the design process itself also needs to be open, something it isn’t now,” explains physics researcher Dr Julian Stirling.

Okay, the “design process” has to be available. To get more insight into this open is closed issue, navigate to the original technical paper at this link. So far the paper is open, but as I have learned, open can be closed and often locked up behind a paywall.

Stephen E Arnold, January 15, 2021

Does Open Source Create Open Doors?

December 21, 2020

Here’s an interesting question I asked on a phone call on Sunday, December 20, 2020: “How many cyber security firms rely on open source software?”

Give up?

As far as my research team has been able to determine, no study is available to us to answer the question. I told the team that based on comments made in presentations, at lectures, and in booth demonstrations at law enforcement and intelligence conferences, most of the firms do. Whether it is a utility function like Elasticsearch or a component (code or library) that detects malicious traffic, open source is the go-to source.

The reasons are not far to seek and include:

  • Grabbing open source code is easy
  • Open source software is usually less costly than a proprietary commercial tool
  • Licensing allows some fancy dancing
  • Using what’s readily available and maintained by a magical community of one, two or three people is quick
  • Assuming that the open source code is “safe”; that is, not malicious.

My question was prompted after I read “How US Agencies’ Trust in Untested Software Opened the Door to Hackers.” The write up states:

The federal government conducts only cursory security inspections of the software it buys from private companies for a wide range of activities, from managing databases to operating internal chat applications.

That write up ignores the open source components commercial cyber security firms use. The reason many of the services look and function in a similar manner is due to a reliance on open source methods as well as the nine or 10 work horse algorithms taught in university engineering programs.

What’s the result? A SolarWinds type of challenge. No one knows the scope, no one knows the optimal remediation path, and no one knows how many vulnerabilities exist and are actively being exploited.

Here’s another question, “How many of the whiz kids working in US government agencies communicate the exact process for selecting, vetting, and implementing open source components directly (via 18f type projects) or from vendors of proprietary cyber security software?”

Stephen E Arnold, December 21, 2020

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta