Open Source: Dietary Insights

May 5, 2022

One of the more benign news briefs about Russia these days concerns the eating habits of the country’s secret police. The Verge explains how delivery apps revealed Russian law enforcement’s food preferences: “Data Leak From Russian Delivery App Shows Dining Habits Of The Secret Police.” A massive data leak from Yandex Food, a large food delivery service in Russia, contained names, addresses, phone numbers, and delivery instructions related to the secret police.

Yandex Food is a subsidiary of the Russian search engine of the same name. The data leak occurred on March 1 and Yandex blamed it on the bad actions of one of its employees. The leak did not include users’ login information. The Roskomnadzor, the Russian government agency responsible for mass media, threatened Yandex with a 100,000 ruble fine and it also blocked a map containing citizen and secret police data.

Bellingcat researchers were investigating leads on the poisoning of Alexey Navalny, the Russian opposition leader. They searched the Yandex Food database collected from a prior investigation and discovered a person who was in contact with Russia’s Federal Security Service (FSB) to plan Navalny’s poisoning. The individual used his work email to register with Yandex Food. They also searched for phone numbers linked to Russia’s Main Intelligence Directorate (GRU). Bellingcat found interesting information in the leak:

“Bellingcat uncovered some valuable information by searching the database for specific addresses as well. When researchers looked for the GRU headquarters in Moscow, they found just four results — a potential sign that workers just don’t use the delivery app, or opt to order from restaurants within walking distance instead. When Bellingcat searched for FSB’s Special Operation Center in a Moscow suburb, however, it yielded 20 results. Several results contained interesting delivery instructions, warning drivers that the delivery location is a military base. One user told their driver “Go up to the three boom barriers near the blue booth and call. After the stop for bus 110 up to the end,” while another said ‘Closed territory. Go up to the checkpoint. Call [number] ten minutes before you arrive!’”

The most scandalous information leaked from the Yandex Food breach was information about Putin’s former mistress and their “suspected daughter.”

While it is hilarious to read about Russian law enforcement’s eating habits, it is alarming when the situation is applied to the United States. Imagine all of the information DoorDash, Grubhub, Uber Eats, and other delivery services collect on customers. There was a DoorDash data leak in 2019 that affected 4.9 million people and it was much larger than the Yandex Food leak.

Whitney Grace, May 5, 2022

OSINT for Amateurs

January 13, 2022

Today I had a New Year chat with a person whom I met at specialized services conferences. I relayed to my friend the news that Robert David Steele, whom I knew since 1986, died in the autumn of 2021. Steele, a former US government professional, was described as one of the people who pushed open source intelligence down the bobsled run to broad use in government entities. Was he the “father of OSINT”? I don’t know, He and I talked via voice and email each week for more than 30 years. Our conversations explored the value of open source intelligence and how to obtain it.

After the call I read “How to Find Anyone on the Internet for Free.”

Wow, shallow. Steele would have had sharp words for the article.

The suggestions are just okay. Plus it is clear that a lack of awareness about OSINT exists.

My suggestion is that anyone writing about this subject spend some time learning about OSINT. There are books from professionals like Steele as well as my CyberOSINT: Next Generation Information Access. Also, attending a virtual conference about OSINT offered by those who have a background in intelligence would be useful. Finally, there are numerous resources available from intelligence gathering organizations. Some of these “lists” include a description of each site, service, or system mentioned.

For me and my team’s part, we are working to create 60 second videos which we will make available on Instagram-type services. Each short profile of an OSINT resource will appear under the banner “OSINT Radar.” These will be high value OSINT resources. Some of this information will also be presented in a new series of short articles and videos that Meg Coker, a former senior telecommunications executive, and I will create. Look for these in LinkedIn and other online channels.

Hopefully the information from OSINT Radar and the Coker-Arnold collaboration will provide useful data about OSINT resources which are useful and effective. Free and OSINT can go together, but the hard reality is that an increasing number of OSINT resources charge for the information on offer.

OSINT, unfortunately, is getting more difficult to obtain. Examples include China’s cut offs of technology information and the loss of shipping and train information from Ukraine. And there are more choke points; for example, Iran and North Korea. This means that OSINT is likely to require more effort than previously. The mix of machine and human work is changing. Consequently more informed and substantive information about OSINT will be required in 2022. The OSINT for amateurs approach is an outdated game.

Coker and Arnold are playing a new game.

Stephen E Arnold, January 13, 2022

Cherche: A Neural Search Pipeline

January 10, 2022

For fans of open source search, Cherche is available. The GitHub write up states:

Cherche is meant to be used with small to medium sized corpora. Cherche’s main strength is its ability to build diverse and end-to-end pipelines.

The “neural search” module includes ElasticSearch. The programming team for Cherche consists of Raphaël Sourty and François-Paul Servant. Beyond Search has not fired up the system and run it against our test corpus. We did have in our files a paper called “Knowledge Base Embedding by Cooperative Knowledge Distillation.” That paper states:

Given a set of KBs, our proposed approach KDMKB, learns KB embeddings by mutually and jointly distilling knowledge within a dynamic teacher-student setting. Experimental results on two standard datasets show that knowledge distillation between KBs through entity and relation inference is actually observed. We also show that cooperative learning significantly outperforms the two proposed baselines, namely traditional and sequential distillation.

The idea is that instead of retrieving strings, broader tags (concepts and classifications) appear to provide an advantage; pushing “beyond” old school search.

Stephen E Arnold, January 10, 2022

Microsoft: A Legitimate Point about Good Enough

October 20, 2021

A post by Stefan Kanthak caught my attention. The reason was an assertion that highlights what may be the “good enough” approach to software. The article is “Defense in Depth — the Microsoft Way (Part 78): Completely Outdated, Vulnerable Open Source Component(s) Shipped with Windows 10&11.” I am in the ethical epicenter of the US not too far from some imposing buildings in Washington, DC. This means I have not been able to get one of my researchers to verify the information in the Stefan Kanthak post. I, therefore, want to point out that it may be horse feathers.

Here’s the point I noted in the write up:

Most obviously Microsoft’s processes are so bad that they can’t build a current version and have to ship ROTTEN software instead!

What’s “rotten”?

The super security conscious outfit is shipping outdated versions of two open source software components: Curl.exe and Tar.exe.

If true, Stefan Kanthak may have identified another example of the “good enough” approach to software. If not true, Microsoft is making sure its software is really super duper secure.

Stephen E Arnold, October 20, 2021

Quote to Note: An Open Source Developer Speaks Truth

August 10, 2021

Navigate to “Lessons Learned from 15 Years of SumatraPDF, an Open Source Windows App.” Please, read the article. It is excellent and applicable to commercial software as well.

Here’s the quote I circled and enhanced with an exclamation point:

… changing things takes effort and the path of least resistance is to do nothing.

Keep this statement in mind when Microsoft says it has enhanced the security of its updating method or when Google explains that it has improved its search algorithm.

The author of “Lessons Learned…” quotes Jeff Bezos (the cowboy hat wearing multi billionaire who sent interesting images which were stunning I have heard) as saying:

There will never be a time when users want bloated and slow apps so being small and fast is a permanent advantage.

I would add that moving data rapidly out of an AWS module  evokes an Arnold corollary:

Speed costs more, often a lot more.

The essay is a good one, and I recommend that you read it, not just the quotes I reproduced in this positive comment about the content.

Stephen E Arnold, August 10, 2021

Does GitHub Data Grab for AI Training Violate Licenses?

July 22, 2021

Programmer Nora Tindall has taken to Twitter to call out Microsoft property GitHub on violating licenses for algorithm training purposes. She shares a screenshot of an exchange she had with GitHub Support that seems to confirm her charge:

[Tindall] I am specifically asking if any code from my GitHub account, most of which is licensed GPL, was used in the training set. It is a simple question.”

[GitHub] Sorry about the delay in getting back to you. I reached out to the team about this. Apparently all public GitHub code was used in training. We don’t distinguish by license type. I hope that answers your question!

It does indeed answer Tindall’s question, and she vows to pursue legal action. Predictably, the post prompted a flurry of comments, so navigate there to read that debate. It seems like the legality of this data usage is nebulous until courts weigh in. We note this exchange:

[Daniel Monte] Is there any precedent for training an AI on copyrighted content being a violation of said copyright?

[Nora Tindall] No, there’s no precedent in any of this. This is the deciding moment for the future of the copyleft ideal, and of free software in general. Maybe for copyright as a whole, actually, since this has applications outside software.

[Laurie] The law on all of this is basically non-existent. And there aren’t enough people who really understand the nuances who are also lawyers. It’s a whole mess which results in companies getting to decide for themselves. Not good.

[Critical Oil Theory Salesman] Hard agree. I’d imagine that we would see a completely different set of legal interpretations if the open source community trained a GPT3 model on Microsoft’s publicly available code.

Perhaps—that would be an interesting experiment. Is Microsoft really ignoring licenses? If not, Twitter is disseminating incorrect information. If yes, then Microsoft has designs on open source information in a way that outfoxes Amazon-type of open source maneuvers. But Microsoft is busy securing its own code and may want to envelope GitHub is the same cyber goodness.

Cynthia Murrell, July 22, 2021

The Future of Open Source Software: Who Knows? Maybe VCs and Data Aggregators?

July 5, 2021

I read some of the write up about Audacity, an audio editor, maybe a wannabe digital audio workstation. One group of write ups foretell the future of this particular open source software as spyware. A good example is “Audacity 3.0 Called Spyware over Data Collection Changes by New Owner.” This is an interesting premise. In some countries, companies even those wearing open source penguins and waving FOSS flags must comply. I suppose the issue is intentional data collection. From a business perspective, there’s money in them thar data elements. And money, not free, is the name of the game in some circles; for example, developers who cut corners in code and construction.

There’s another side to the argument. A reasonable example is “Audacity Is a Poster Child for What Can Be Achieved with Open-Source Software.” The main idea is that if one does not like Audacity, a skilled person can whip up an Audacity variant. (Is the process as simple as creating a Covid variant?) The write up points out:

the first version of Audacity was released in 1999 (at the time the name was different)…Still, while it might look outdated, Audacity doesn’t lack when it comes to features. If you can find them.

High praise?

The author of “Poster Child” opines:

From what I’ve seen over the last two months, Muse Group seems to have its heart in the right place. And if the opposite comes true, there’s always GitHub’s fork button. For now, though, it seems that not only is Audacity in good hands, but that it might be finally getting that design refresh it desperately needs.

The point I carried away from these two write ups is that open source is at an inflection point.

Making money from software is more difficult than it seems. Services, subscribing, consulting, for-fee extras, training, customizing, and platforming are quite different from picking up a box with a disc inside.

Observations:

  1. Open source software is supposed to give users a way to free themselves from the handcuffs of license agreements. Aren’t the new monetization methods a form of handcuffing?
  2. The value of sucked up data, packaged, and licensed to an aggregator an interesting path forward. Hey, Intuit licensed its small business user data to a quite interesting ConAgra of data collection. What’s good for the proprietary goose, may be very, very good for an open source gander.
  3. Monopolization of software functionality hooked into the ever-secure ever-so-reliable cloud sets the stage for no-code alternatives. Will these be free and open? My hunch: Unlikely.

Net net: Open source, she be changin’.

Stephen E Arnold, July 5, 2021

Search and the Bezos Bulldozer

April 13, 2021

For the last three years, I have been giving lectures about the lock in methods implemented by Amazon. I refer to the company as the online bookstore in order to remind those in my audiences that Amazon has a friendly facet. That’s exemplified by the smile logo. Amazon also has a Wall Street persona which is built upon the precepts of MBAism.

I will be talking about Amazon and its policeware strategy at the 2021 National Cyber Crime Conference. If you want a similar presentation tailored to commercial interests, let me know. I can be reached via benkent2020 at yahoo dot com. My LE and intel work are pro bono; commercial works incurs a fee.

I want to mention a subject I won’t be addressing directly in my upcoming lecture later this month. The subject is an Amazon blog post titled “Introducing OpenSearch.” I would also direct your attention to the comments submitted to the Ycombinator discussion of the announcement. You can find those interesting and varied remarks from hundreds of people at this link.

The news is that Amazon is taking quite predictable steps to recast search and retrieval so that it becomes another of the hundreds of functions, services, and features of Amazon Web Services. AWS hired people from Lucid Imagination (now LucidWorks) years ago. Many have forgotten that Amazon operated A9, a Web search system with a street view function, as well. There are other findability functions embedded in Amazon as well; for example, the “search” function in Amazon’s blockchain inventions. (Yes, I have a for fee lecture about that technology as well. Because money laundering is a growing problem, the Amazon methods are likely to become increasingly important to certain government agencies in the future.)

The little secret about open source software, which many overlook, is that the strongest supporters of FOSS and community supported code are large companies. I did a series of reports for the IDC outfit, and I am not sure what that now dismantled organization did with the data. A couple of chapters were sold on Amazon for $3,000, but the topic was not a magnet when we assembled the information six or seven years ago.

Since Amazon is engaged in a battle for one part of the “enterprise” with Microsoft, the online bookstore is actively seeking ways to attract large organizations as customers, lock them in, and then implement the tactics which benefit from Amazon’s knowledge of its customers’ behavior. The use of the “retail” tactic watch, duplicate, and leverage house brands is documented in the reports from vendors who have had their toes nipped by the bulldozer’s steel caterpillar traction system.

Why’s this germane to “search”? Here are the reasons:

  • Search and retrieval is an essential utility for modern work. Amazon wants to generate revenue and other business benefits by having a “better” and (if possible) community supported software base. Search will become part of the lubricant for other Amazon enterprise services; for example, locating tax avoiders.
  • Search becomes the glue and the circulatory system for information analysis and use. No search; no high value outputs. Machine learning is little more than a supporting technology to finding needed information. Many disagree with me, but marketing clouds many experts’ thinking. Search is a core function and requires many subsystems and technical methods.
  • Once users become habituated to search, change is difficult. Amazon is one of the few outfits to have undermined Google search. Product searches are increasingly under Amazon’s control. The ElasticSearch “play” is going to become the vehicle for a broader utility attack.

I have quipped that Amazon has targeted Elastic and the ElasticSearch “system” because it has the same name as some of Amazon’s services. If Amazon is successful in its search maneuver, Shay Banon’s findability play will be marginalized.

There are larger implications quite beyond a comment made to elicit a laugh at a reception at an enterprise search conference. These include:

  • Seamless integration with SageMaker and other advanced functionalities available from Amazon
  • A lever for technical and financial leverage for innovators who use Amazon as the plumbing for their start ups, not Microsoft technology
  • A model for Amazon and maybe other companies to use for shifting open source software into a variation on the FUD (fear, uncertainty, and doubt) approach to closing deals. The mantra could become “Nobody ever got fired for buying AWS.”

For the companies generating scorecards for enterprise search vendors, significant change is likely. The numerous vendors of proprietary enterprise search will have to make some changes in their approach to Amazon. Many of these Elastic alternatives use AWS for certain functions. What happens if the pricing structure, the legalese, or the access to certain AWS services “evolve”? What will start ups and Amazon partners do if access to search functions becomes free or requires contributions to the AWS version of open source?

Worth watching, right? The answer is, “Nah, you are way off base.” Yep, just as I was in my analysis of Google for BearStearns many years ago. I have a track record of getting thrown out as I head for second base.

Stephen E Arnold, April 13, 2021

Microsoft Exchange After Action Action: Adulting or Covering Up?

March 12, 2021

I read “Researcher Publishes Code to Exploit Microsoft Exchange Vulnerabilities on GitHub.” The allegedly accurate “real” news report states:

On Wednesday, independent security researcher Nguyen Jang published on GitHub a proof-of-concept tool to hack Microsoft Exchange servers that combined two of those vulnerabilities. Essentially, he published code that could be used to hack Microsoft customers, exploiting a bug used by Chinese government hackers—on an open-source platform owned by Microsoft.

What happened?

Microsoft, took down the hacking tool.  “GitHub took down it,” the researcher told Motherboard in an email. “They just send [sic] me an email.” On Thursday, a GitHub spokesperson confirmed to Motherboard that the company removed the code due to the potential damage it could cause.

Interesting.

Two questions crossed my mind:

  1. Is Microsoft showing more management responsibility with regard to the data posted on GitHub? Editorial control is often useful, particularly when the outputting mechanism provides a wealth of information and code. Some of these items can be used to create issues. Microsoft purchased GitHub and may now be forced to take a more adult view of the service.
  2. Is Microsoft covering up the flaws in its core processes? After reading Microsoft’s explanations of the Solarwinds’ misstep, the injection of marketing spin and intriguing rhetoric about responsibility open the door to a bit of Home Depoting; that is, paint, wood panel, and bit of carpet make an an ageing condo look better.

Worth watching both the breaches which are concerning and the GitHub service which can cause some individuals’ brows to furrow.

Stephen E Arnold, March 12, 2021

Elastic and Its Approach to Its Search Business

February 16, 2021

This blog post is about Elastic, the Shay Banon information retrieval company, not Amazon AWS Elastic services. Confused yet? The confusion will only increase over time because the “name” Elastic is going to be difficult to keep intact due to Amazon’s ability to erode brand names.

But that’s just one challenge the Elastic search company founded by the magic behind Compass Search. An excellent analysis of Elastic search’s challenges appears in “Elastic Has Stretched the Patience of Many in Open Source. But Is There Room for a Third Way?”

The write up quotes an open source expert as saying:

Let’s be really clear – it’s a move from open to proprietary as a consequence of a failed business model decision…. Elastic should have though their revenue model through up front. By the time the team made the decision to open source their code, the platform economy existed and their decisions to open source ought to
have been aligned to an appropriate business model.

I circled this statement in the article:

Sympathy for Elastic’s position comes from a perhaps unexpected source. Matt Assay, principal at Elastic’s bête noire AWS, believes it’s time to revisit the idea of “shared source”, a licensing scheme originally dreamed up by Microsoft two decades ago as an answer to the then-novel open source concept. In shared source, code is open – as in visible – but its uses are restricted… The heart of the problem is about who gets to profit from open source software. To help resolve that problem, we just might need new licensing.

Information retrieval is not about precision and recall, providing answers to users, or removing confusion about terms and product names — search is about money. Making big bucks from a utility service continues to lure some and smack down others. Now it is time to be squishy and bouncy I suppose.

Stephen E Arnold, February 16, 2021

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta