TikTok: Allegations of Data Sharing with China! Why?

June 21, 2022

If one takes a long view about an operation, some planners find information about the behavior of children or older, yet immature, creatures potentially useful. What if a teenager, puts up a TikTok video presenting allegedly “real” illegal actions? Might that teen in three or four years be a target for soft persuasion? Leaking the video to an employer? No, of course not. Who would take such an action?

I read “Leaked Audio from 80 Internal TikTok Meetings Shows That US User Data Has Been Repeatedly Accessed from China.” Let’s assume that this allegation has a tiny shred of credibility. The financially-challenged Buzzfeed might be angling for clicks. Nevertheless, I noted this passage:

…according to leaked audio from more than 80 internal TikTok meetings, China-based employees of ByteDance have repeatedly accessed nonpublic data about US TikTok users…

Is the audio deeply faked? Could the audio be edited by a budding sound engineer?


And what’s with the TikTok “connection” to Oracle? Probably just a coincidence like one of Oracle’s investment units participating in Board meetings for Voyager Labs. A China-linked firm was on the Board for a while. No big deal. Voyager Labs? What does  that outfit do? Perhaps it is the Manchester Square office and the delightful restaurants close at hand?

The write up refers to data brokers too. That’s interesting. If a nation state wants app generated data, why not license it. No one pays much attention to “marketing services” which acquire and normalize user data, right?

Buzzfeed tried to reach a wizard at Booz, Allen. That did not work out. Why not drive to Tyson’s Corner and hang out in the Ritz Carlton at lunch time. Get a Booz, Allen expert in the wild.

Yep, China. No problem. Take a longer-term view for creating something interesting like an insider who provides a user name and password. Happens every day and will into the future. Plan ahead I assume.

Real news? Good question.

Stephen E Arnold, June 21, 2022

Near: A Complement to ClearView AI?

May 26, 2022

Data Intelligence Startup Near, with 1.6B anonymized User IDs, Lists on NASDAQ via SPAC at a $1B Market Cap; Raises $100M” is an interesting story. On one hand, in the midst of some financial headwinds, the outfit Near is a unicorn. That’s exciting for some. The most significant part of the short item is this passage: Near offers

anonymised, location-based profiles of users based on a trove of information that Near sources and then merges from phones, data partners, carriers and its customers. It claims the database has been built “with privacy by design.”

The word merging as in “merging data from different sources” is not jargony enough. The Near write up uses the term “stitching” as in “threads which hold the parts of a football together.” I prefer the term “federating” as in “federating data.”

The idea is a good one. Take information from different sources, index it (assign tags today, of course) and group information about a person under that entity’s “name.” This is a useful workflow, and my hunch is that the system works best for individuals leaving digital footprints and crumbs of ones and zeros behind as these “entities” go about their business.

The successful merging and profiling will give Near a competitive advantage. Like ClearView and many other companies, scraping and licensing commercial datasets can produce a valuable data asset.

On the downside, as ClearView has learned as it explained its business to legal eagles, some concerns for privacy can arise. Assurances of privacy have created some issues for firms performing similar work for government agencies. Law enforcement and intelligence professionals are likely to show some interest in Near’s products and services.

Successfully navigating marketing to commercial outfits and selling to government agencies is like sailing into an unfamiliar port with a very large boat.

Kudos to near for its funding. Now it will be interesting to watch the firm’s management walk the marketing tightrope over the Niagara Falls of cash flow as legal eagles circle.

Stephen E Arnold, May 26, 2022

Synthetic Data: Cheap, Like Fast Food

May 25, 2022

Fabricated data may well solve some of the privacy issues around healthcare-related machine learning, but what new problems might it create? The Wall Street Journal examines the technology in, “Anthem Looks to Fuel AI Efforts with Petabytes of Synthetic Data.” Reporter Isabelle Bousquette informs us Anthem CIO Anil Bhatt has teamed up with Google Cloud to build the synthetic data platform. Interesting choice, considering the health insurance company has been using AWS since 2017.

The article points out synthetic data can refer to either anonymized personal information or entirely fabricated data. Anthem’s effort involves the second type. Bousquette cites Bhatt as well as AI and automation expert Ritu Jyoti as she writes:

“Anthem said the synthetic data will be used to validate and train AI algorithms that identify things like fraudulent claims or abnormalities in a person’s health records, and those AI algorithms will then be able to run on real-world member data. Anthem already uses AI algorithms to search for fraud and abuse in insurance claims, but the new synthetic data platform will allow it to scale. Personalizing care for members and running AI algorithms that identify when they may require medical intervention is a more long-term goal, said Mr. Bhatt. In addition to alleviating privacy concerns, Ms. Jyoti said another advantage of synthetic data is that it can reduce biases that exist in real-world data sets. That said, she added, you can also end up with data sets that are worse than real-world ones. ‘The variation of the data is going to be very, very important,’ said Mr. Bhatt, adding that he believes the variation in the synthetic data will ultimately be better than the company’s real-world data sets.”

The article notes the use of synthetic data is on the rise. Increasing privacy and reducing bias both sound great, but that bit about potentially worse data sets is concerning. Bhatt’s assurance is pleasant enough, but how can will we know whether his confidence pans out? Big corporations are not exactly known for their transparency.

Cynthia Murrell, May 25, 2022

Data Federation? Loser. Go with a Data Lake House

February 8, 2022

I have been the phrase “data lake house” or “datalake house.” I noted some bold claims about a new data lake house approach in “Managed Data Lakehouse Startup Onehouse Launches with $8M in Funding.” The write up states:

One of the flagship features of Onehouse’s lakehouse service is a technology called incremental processing. It allows companies to start analyzing their data soon after it’s generated, which is difficult when using traditional technologies.

The write up adds:

The company’s lakehouse service automatically optimizes customers’ data ingestion workflows to improve performance, the startup says. Because the service is delivered via the cloud on a fully managed basis, customers don’t have to manage the underlying infrastructure.

The idea of course is that traditional methods of handling data are [a] slow, [b] expensive, and [c] difficult to implement.

The premise is that the data lake house delivers more efficient use of data and a way to “future proof the data architected for machine learning / data science down the line.”

When I read this I thought of Vivisimo’s explanation of its federating method. IBM bought Vivisimo, and I assume that it is one of the ingredient in IBM’s secret big data sauce. MarkLogic also suggested in one presentation I sat through that its system would ingest data and the MarkLogic system (once eyed by the Google as a possible acquisition) would allow near real time access to the data. One person in the audience was affiliated with the US Library of Congress, and that individual seemed quite enthused about MarkLogic. And there are companies which facilitate data manipulation; for example, Kofax and its data connectors.

From my point of view, the challenge is that today large volumes of data are available. These data have to be moved from point A to point B. Ideally data do not require transformation. At some point in the flow, data in motion can be processed. There are firms which offer real time or near real time data analytics; for example, Trendalyze.com.

Conversion, moving, saving, and then doing something “more” with the data remain challenges. Maybe Onehouse has the answer?

Stephen E Arnold, February 8, 2022

Coalesce: Tackling the Bottleneck Few Talk About

February 1, 2022

Coalesce went stealth, the fancier and more modern techno slang for “going dark,” to work on projects in secret. The company has returned to the light, says Crowd Fund Insider with a robust business plan and product, plus loads of funding: “Coalesce Debuts From Stealth, Attracts $5.92M For Analytics Platform.”

Coalesce is run by a former Oracle employee and it develops products and services similar to Oracle, but with a Marklogic spin. That is one way to interpret how Coalesce announced its big return with its Coalesce Data Transformation platform that offers modeling, cleansing, governance, and documentation of data with analytical efficiency and flexibility. Do no forger that 11.2 Capital and GreatPoint Ventures raised $5.92 million in seed funding for the new data platform. Coalesce plans to use the funding for engineering functions, developing marketing strategy, and expanding sales.

Coalesce noticed that there is a weak link between organizations’ cloud analytics and actively making use of data:

“ ‘The largest bottleneck in the data analytics supply chain today is transformations. As more companies move to the cloud, the weaknesses in their data transformation layer are becoming apparent,’ said Armon Petrossian, the co-founder and CEO of Coalesce. “Data teams are struggling to keep up with the demands from the business, and this problem has only continued to grow with the volumes and complexity of data combined with the shortage of skilled people. We are on a mission to radically improve the analytics landscape by making enterprise-scale data transformations as efficient and flexible as possible.’”

Coalesce might be duplicating Oracle and MarkLogic, but if they have discovered a niche market in cloud analytics then they are about to rocket from their stealth. Hopefully the company will solve the transformation problem instead of issuing marketing statements as many other firms do.

Whitney Grace, February 1, 2022

Fuzzifying Data: Yeah, Sure

January 19, 2022

Data are often alleged to be anonymous, but they may not be. Expert companies such as LexisNexis, Acxiom, and mobile phone providers argue that as long as personal identifiers, including names, address, etc., are removed from data it is rendered harmless. Unfortunately data can be re-anonymized without too much trouble. Wired posted Justin Sherman’s article, “Big Data May Not Know Your Name. But It Knows Everything Else.”

Despite humans having similar habits, there is some truth in the phrase “everyone is unique.” With a few white hat or black hat tactics, user data can be traced back to the originator. Data proves to be not only individualized based on a user’s unique identity, but there are also minute ways to gather personal information ranging from Internet search history, GPS logs, and IP address. Companies that want to sell you goods and services purchase the data, but also governments and law enforcement agencies do as well.

There are stringent privacy regulations in place, but in the face of the all mighty dollar and governments bypassing their own laws, it is like spitting in the wind. The scariest fact is that nothing is secret anymore:

“The irony that data brokers claim that their “anonymized” data is risk-free is absurd: Their entire business model and marketing pitch rests on the premise that they can intimately and highly selectively track, understand, and micro target individual people.

This argument isn’t just flawed; it’s also a distraction. Not only do these companies usually know your name anyway, but data simply does not need to have a name or social security number attached to cause harm. Predatory loan companies and health insurance providers can buy access to advertising networks and exploit vulnerable populations without first needing those people’s names. Foreign governments can run disinformation and propaganda campaigns on social media platforms, leveraging those companies’ intimate data on their users, without needing to see who those individuals are.”

Companies and organizations need to regulate themselves, while governments need to pass laws that protect their citizens from bad actors. Self-regulation in the face of dollar signs is like asking a person with sweet tooth to stop eating sugar. However, if governments concentrated on types of data and types of data collection and sharing to regulate rather than a blanket solution could protect users.

Let’s think about the implications. No, let’s not.

Whitney Grace January 19, 2022

What Is Better Than One Logic? Two Logics?

December 22, 2021

Search, database, intelligence, data management and analytics firm MarkLogic continues to evolve and grow. Business Wire reveals, “MarkLogic Acquires Leading Metadata Management Provider Smartlogic.” Good choice—we have found Smartlogic to be innovative, reliable, and responsive. We expect MarkLogic will be able to preserve these characteristics, considering Smartlogic’s top brass will be sticking around. The press release tells us:

“As part of the transaction, Smartlogic’s founder and Chief Executive Officer, Jeremy Bentley, as well as other members of the senior management team, will join the MarkLogic executive team. Financial terms of the transaction were not disclosed. Founded in 2006, Smartlogic has deciphered, filtered, and connected data for many of the world’s largest organizations to help solve their complex data problems. Global organizations in the energy, healthcare, life sciences, financial services, government and intelligence, media and publishing, and high-tech manufacturing industries rely on Smartlogic’s metadata and AI platform every day to enrich enterprise information with context and meaning, as well as extract critical facts, entities, and relationships to power their businesses. For the past four years, Smartlogic has been recognized as a leader by Gartner’s Magic Quadrant for Metadata Management Solutions and by Info-Tech as the preeminent leader of the Data Quadrant for Metadata Management (May 2021).”

Based in San Carlos, California, MarkLogic was founded in 2001 and gained steam in 2012 when it picked up former Oracle database division leader Gary Bloom. Smartlogic is headquartered in San Jose, less than 30 miles away. Perhaps MarkLogic’s XML with taxonomy management will triumph in more markets and bring the Oracle outfit to its knees? Perhaps index term management is the killer app?

Cynthia Murrell, December 22, 2021

What Google Knows about the Honest You

December 10, 2021

I read this quote in a Kleenex story about Google’s lists of popular searches:

“You’re never as honest as you are with your search engine. You get a sense of what people genuinely care about and genuinely want to know — and not just how they’re presenting themselves to the rest of the world.”

The alleged Googler crafting this statement is a data editor. You can read more about the highly selective and unverified Google search trends in “What Google’s Trending Searches Say about America in 2021.”

For me, the statement allows several observations:

  1. A person acting in an unguarded way reveals information not usually disseminated in “guarded” settings; for example, a job interview
  2. The word “honest” implies an unvarnished look at the psycho-social factors within a single person
  3. A collection of data points about the psycho-social aspects of a single person makes it possible to tag, classify, and relate that individual to others. Numerical procedures allow a person or system with access to those data to predict certain behaviors, predispositions, or actions.

Thus, the collection of searches, clicks, and items created by an individual using Google services such as Gmail and YouTube create a palette of color from which a data maestro can paint a picture.

Predestination has never been easier, more automatable, or cheaper to convert into an actionable knowledgebase for smart software. Yep, just simple queries. Useful indeed.

Stephen E Arnold, December 10, 2021

Microsoft: Amazing Quote about Support

August 12, 2021

I read “El Reg talks to Azure Data veep as Microsoft flicks the switch on Azure Arc for SQL Managed Instances: Longevity, PostgreSQL, and the Default Relational Database of Choice.” I like the phrase “default relational database of choice.” Okay, confidence can be a positive.

Most of the interview is not-so-surprising stuff: End-of-life assurances, hits of a catholic approach to the Codd structure, and a general indifference to the Amazon database initiatives. That’s okay. The expert is Rohan Kumar, who is going to speak Redmond, a peculiar dialect of jargon which often reveals little relevant to the ordinary person trying to restore a trashed SQL Server table.

I did spot one tiny comment. Here is this remarkable assertion:

“We will never let any of our customers run into challenges because Microsoft decided, ‘hey, we’re not going to support you’.”

No kidding? For real? I mean none of the code blocking, security challenging stuff?

Stephen E Arnold, August 12, 2021

Elasticsearch Versus RocksDB: The Old Real Time Razzle Dazzle

July 22, 2021

Something happens. The “event” is captured and written to the file. Even if you are watching the “something” happening, there is latency between the event and the sensor or the human perceiving the event. The calculus of real time is mostly avoiding too much talk about latency. But real time is hot because who wants to look at old data, not TikTok fans and not the money-fueled lovers of Robinhood.

Rockset CEO on Mission to Bring Real-Time Analytics to the Stack” used lots of buzzwords, sidesteps inherent latency, and avoids commentary on other allegedly real-time analytics systems. Rockset is built on RockDB, an open source software. Nevertheless, there is some interesting information about Elasticsearch; for example:

  • Unsupported factoids like: “Every enterprise is now generating more data than what Google had to index in [year] 2000.”
  • No definition or baseline for “simple”: “The combination of the converged index along with the distributed SQL engine is what allows Rockset to be fast, scalable, and quite simple to operate.”
  • Different from Elasticsearch and RocksDB: “So the biggest difference between Elastic and RocksDB comes from the fact that we support full-featured SQL including JOINs, GROUP BY, ORDER BY, window functions, and everything you might expect from a SQL database. Rockset can do this. Elasticsearch cannot.”
  • Similarities with Rockset: “So Lucene and Elasticsearch have a few things in common with Rockset, such as the idea to use indexes for efficient data retrieval.”
  • Jargon and unique selling proposition: “We use converged indexes, which deliver both what you might get from a database index and also what you might get from an inverted search index in the same data structure. Lucene gives you half of what a converged index would give you. A data warehouse or columnar database will give you the other half. Converged indexes are a very efficient way to build both.”

Amazon has rolled out its real time system, and there are a number of options available from vendors like Trendalyze.

Each of these vendors emphasizes real time. The problem, however, is that latency exists regardless of system. Each has use cases which make their system seem to be the solution to real time data analysis. That’s what makes horse races interesting. These unfold in real time if one is at the track. Fractional delays have big consequences for those betting their solution is the least latent.

Stephen E Arnold, July 22, 2021

Next Page »

  • Archives

  • Recent Posts

  • Meta