Data Federation: K2View Seizes Lance, Mounts Horse, and Sallies Forth

August 13, 2020

DarkCyber noted “K2View Raises $28 million to Automate Enterprise Data Unification.”

Here’s the write up’s explanation of the K2View:

K2View’s “micro-database” Fabric technology connects virtually to sources (e.g., internet of things devices, big data warehouses and data lakes, web services, and cloud apps) to organize data around segments like customers, stores, transactions, and products while storing it in secure servers and exposing it to devices, apps, and services. A graphical interface and auto-discovery feature facilitate the creation of two-way connections between app data sources and databases via microservices, or loosely coupled software systems. K2View says it leverages in-memory technology to perform transformations and continually keep target databases up to date.

The write up contains a block diagram:



  1. It is difficult to determine how much manual (human) work will be required to deal with content objects not recognized by the K2View system
  2. What happens if the Internet connection to a data source goes down?
  3. What is the fall back when a microservice is not available or removed from service?

Many organizations offer solutions to disparate types of data scattered across many systems. Perhaps K2View will slay the digital windmills of silos, different types of data, and unstable connections? Silos have been part of the data landscape as long as Don Quixote has been spearing windmills.

Stephen E Arnold, August 13, 2020

TileDB Developing a Solution to Database Headaches

July 27, 2020

Developers at TileDB are working on a solution to the many problems traditional and NoSQL databases create, and now they have secured more funding to help them complete their platform. The company’s blog reports, “TileDB Closes $15M Series A for Industry’s First Universal Data Engine.” The funding round is led by Two Bear Capital, whose managing partner will be joining TileDB’s board of directors. The company’s CEO, Stavros Papadopoulos, writes:

“The Series A financing comes after TileDB was chosen by customers who experienced two key pains: scalability for complex data and deployment. Whole-genome population data, single-cell gene data, spatio-temporal satellite imagery, and asset-trading data all share multi-dimensional structures that are poorly handled by monolithic databases, tables, and legacy file formats. Newer computational frameworks evolved to offer ‘pluggable storage’ but that forces another part of the stack to deal with data management. As a result, organizations waste resources on managing a sea of files and optimizing storage performance, tasks traditionally done by the database. Moreover, developers and data scientists are spending excessive time in data engineering and deployment, instead of actual analysis and collaboration. …

“We invented a database that focuses on universal storage and data management rather than the compute layer, which we’ve instead made ‘pluggable.’ We cleared the path for analytics professionals and data scientists by taking over the messiest parts of data management, such as optimized storage for all data types on numerous backends, data versioning, metadata, access control within or outside organizational boundaries, and logging.”

So with this tool, developers will be freed from tedious manual steps, leaving more time to innovate and draw conclusions from their complex data. TileDB has also developed APIs to facilitate integration with tools like Spark, Dask, MariaDB and PrestoDB, while TileDB Cloud enables easy, secure sharing and scalability. See the write-up for praise from excited customers-to-be, or check out the company’s website. Readers can also access the open-source TileDB Embedded storage engine on Github. Founded in 2017, TileDB is based in Cambridge, Massachusetts.

Cynthia Murrell, July 27, 2020

Fragmented Data: Still a Problem?

January 28, 2019

Digital transitions are a major shift for organizations. The shift includes new technology and better ways to serve clients, but it also includes massive amounts of data. All organizations with a successful digital implementation rely on data. Too much data, however, can hinder organizations’ performance. The IT Pro Portal explains how data and something called mass data fragmentation is a major issue in the article, “What Is Mass Data Fragmentation, And What Are IT Leaders So Worried About It?”

The biggest question is: what exactly is mass data fragmentation? I learned:

“We believe one of the major culprits is a phenomenon called mass data fragmentation. This is essentially just a technical way of saying, ’data that is siloed, scattered and copied all over the place’ leading to an incomplete view of the data and an inability to extract real value from it. Most of the data in question is what’s called secondary data: data sets used for backups, archives, object stores, file shares, test and development, and analytics. Secondary data makes up the vast majority of an organization’s data (approximately 80 per cent).”

The article compares the secondary data to an iceberg, most of it is hidden beneath the surface. The poor visibility leads to compliance and vulnerability risks. In other words, security issues that put the entire organization at risk. Most organizations, however, view their secondary data as a storage bill, compliance risk (at least that is good), and a giant headache.

When surveyed about the amount of secondary data they have, it was discovered that organizations had multiple copies of the same data spread over the cloud and on premise locations. IT teams are expected to manage the secondary data across all the locations, but without the right tools and technology the task is unending, unmanageable, and the root of more problems.

If organizations managed their mass data fragmentation efficiently it would increase their bottom line, reduce costs, and reduce security risks. With more access points to sensitive data and they are not secure, it increases the risk of hacking and information being stolen.

Whitney Grace, January 28, 2019

Data in One or Two Places: What Could Go Wrong?

July 11, 2018

Silos of data have become the term du jour for many folks thinking about search and machine learning. By piling all that info into one convenient place, we can accomplish amazing feats. However, as those silos get larger and start gobbling up smaller silos, what are we left with? This was a concern brought up in a recent Tech Dirt think piece, “The Death of Google Reader and the Rise of Silos.”

According to the story:

“Many people have pointed to the death of Google Reader as a point at which news reading online shifted from things like RSS feeds to proprietary platforms like Facebook and Twitter. It might seem odd (or ironic) to bemoan a move by one of the companies now considered one of the major silos for killing off a product…”

While this piece holds a pseudo-funeral for Google Reader and, somewhat poignantly, points that this is the downfall of the Internet it overlooks the value of silos. Maybe it’s not all so bad?

That’s what one commentator for the Daily Journal pointed out remarking on the amount of innovation that has come about as a result of these mega silos. Clearly, there’s no perfect balance and we suspect your opinion on silos depends on what industry you are in.

Federation of information seems like a good idea. But perhaps it is even better when federation occurs in two or three online structures? If data are online, those date are accurate. That’s one view.

Patrick Roland, July 11, 2018

Antidot: Fluid Topics

June 5, 2017

I find French innovators creative. Over the years I have found the visualizations of DATOPS, the architecture of Exalead, the wonkiness of Kartoo, the intriguing Semio, and the numerous attempts to federate data and work flow like digital librarians and subject matter experts. The Descartes- and Femat-inspired engineers created software and systems which try to trim the pointy end off many information thorns.

I read “Antidot Enables ‘Interactive’ Tech Docs That Are Easier To Publish, More Relevant To Users – and Actually Get Read.” Antidot, for those not familiar with the company, was founded in 1999. Today the company bills itself as a specialist in semantic search and content classification. The search system is named Taruqa, and the classification component is called “Classifier.”

The Fluid Topics product combines a number of content processing functions in a workflow designed to provide authorized users with the right information at the right time.

According to the write up:

Antidot has updated its document delivery platform with new features aimed at making it easier to create user-friendly interactive docs.  Docs are created and consumed thanks to a combination of semantic search, content enrichment, automatic content tagging and more.

The phrase “content enrichment” suggests to me that multiple indexing and metadata identification subroutines crunch on text. The idea is that a query can be expanded, tap into entity extraction, and make use of text analytics to identify documents which keyword matching would overlook.

The Fluid Topic angle is that documentation and other types of enterprise information can be indexed and matched to a user’s profile or to a user’s query. The result is that the needed document is findable.

The slicing and dicing of processed content makes it possible for the system to assemble snippets or complete documents into an “interactive document.” The idea is that most workers today are not too thrilled to get a results list and the job of opening, scanning, extracting, and closing links. The Easter egg hunt approach to finding business information is less entertaining than looking at Snapchat images or checking what’s new with pals on Facebook.

The write up states:

Users can read, search, navigate, annotate, create alerts, send feedback to writers, with a rich and intuitive user experience.

I noted this list of benefits fro the Fluid Topics’ approach:

  • Quick, easy access to the right information at the right time, making searching for technical product knowledge really efficient.
  • Combine and transform technical content into relevant, useful information by slicing and dicing data from virtually any source to create a unified knowledge hub.
  • Freedom for any user to tailor documentation and provide useful feedback to writers.
  • Knowledge of how documentation is actually used.

Applications include:

  • Casual publishing which means a user can create a “personal” book of content and share them.
  • Content organization which organizes the often chaotic and scattered source information
  • Markdown which means formatting information in a consistent way.

Fluid Topics is a hybrid which combines automatic indexing and metadata extraction, search, and publishing.

More information about Fluid Topics is available at a separate Antidot Web site called “Fluid Topics.” The company provides a video which explains how you can transform your world when you tackle search, customer support, and content federation and repurposing. Fluid Topics also performs text analytics for the “age of limitless technical content delivery.”

Hewlett Packard invested significantly in workflow based content management technology. MarkLogic’s XML data management system can be tweaked to perform similar functions. Dozens of other companies offer content workflow solutions. The sector is active, but sales cycles are lengthy. Crafty millennials can make Slack perform some content tricks as well. Those on a tight budget might find that Google’s hit and miss services are good enough for many content operations. For those in love with SharePoint, even that remarkable collection of fragmented services, APIs, and software can deliver good enough solutions.

I think it is worth watching how Antidot’s Fluid Topics performs in what strikes me as a crowded, volatile market for content federation and information workflow.

Stephen E Arnold, June 5, 2017

Pitching All Source Analysis: Just Do Dark Data. Really?

November 25, 2016

I read “Shedding Light on Dark Data: How to Get Started.” Okay, Dark Data. Like Big Data, the phrase is the fruit of the nomads at Garner Group. The person embracing this sort of old concept is an outfit OdinText. Spoiler: I thought the write up was going to identify outfits like BAE Systems, Centrifuge Systems, IBM Analyst’s Notebook, Palantir Technologies, and Recorded Future (an In-Q-Tel and Google backed outfit). Was I wrong? Yes.

The write up explains that a company has to tackle a range of information in order to be aware, informed, or insightful. Pick one. Here’s the list of Dark Data types, which the aforementioned companies have been working to capture, analyze, and make sense of for almost 20 years in the case of NetReveal (Detica) and Analyst’s Notebook. The other companies are comparative spring chickens with an average of seven years’ experience in this effort.

  • Customer relationship management data
  • Data warehouse information
  • Enterprise resource planning information
  • Log files
  • Machine data
  • Mainframe data
  • Semi structured information
  • Social media content
  • Unstructured data
  • Web content.

I think the company or non profit which tries to suck in these data types and process them may run into some cost and legal issues. Analyzing tweets and Facebook posts can be useful, but there are costs and license fees required. Frankly not even law enforcement and intelligence entities are able to do a Cracker Jack job with these content streams due to their volume, cryptic nature, and pesky quirks related to metadata tagging. But let’s move on. To this statement:

Phone transcripts, chat logs and email are often dark data that text analytics can help illuminate. Would it be helpful to understand how personnel deal with incoming customer questions? Which of your products are discussed with which of your other products or competitors’ products more often? What problems or opportunities are mentioned in conjunction with them? Are there any patterns over time?

Yep, that will work really well in many legal environments. Phone transcripts are particularly exciting.

How does one think about Dark Data? Easy. Here’s a visualization from the OdinText folks:


Notice that there are data types in this diagram NOT included in the listing above. I can’t figure out if this is just carelessness or an insight which escapes me.

How does one deal with Dark Data? OdinText, of course. Yep, of course. Easy.

Stephen E Arnold, November 25, 2016

Facial Recognition Fraught with Inaccuracies

November 2, 2016

Images of more than 117 million adult Americans are with law enforcement agencies, yet the rate of accurately identifying people accurately is minuscule.

A news report by The Register titled Meanwhile, in America: Half of adults’ faces are in police databases says:

One in four American law enforcement agencies across federal, state, and local levels use facial recognition technology, the study estimates. And now some US police departments have begun deploying real-time facial recognition systems.

Though facial recognition software vendors claim accuracy rates anywhere between 60 to 95 percent, statistics tell an entirely different story:

Of the FBI’s 36,420 searches of state license photo and mug shot databases, only 210 (0.6 per cent) yielded likely candidates for further investigations,” the study says. “Overall, 8,590 (4 per cent) of the FBI’s 214,920 searches yielded likely matches.

Some of the impediments for accuracy include low light conditions in which the images are captured, lower procession power or numerous simultaneous search requests and slow search algorithms. The report also reveals that human involvement also reduces the overall accuracy by more than 50 percent.

The report also touches a very pertinent point – privacy. Police departments and other law enforcement agencies are increasingly deploying real-time facial recognition. It not only is an invasion of privacy but the vulnerable networks can also be tapped into by non-state actors. Facial recognition should be used only in case of serious crimes, using it blatantly is an absolute no-no. It can be used in many ways for tracking people, even though they may not be criminals. Thus, it remains to be answered, who will watch the watchmen?

Vishal Ingole, November 2, 2016
Sponsored by, publisher of the CyberOSINT monograph

GAO DCGS Letter B-412746

June 1, 2016

A few days ago, I stumbled upon a copy of a letter from the GAO concerning Palantir Technologies dated May 18, 2016. The letter became available to me a few days after the 18th, and the US holiday probably limited circulation of the document. The letter is from the US Government Accountability Office and signed by Susan A. Poling, general counsel. There are eight recipients, some from Palantir, some from the US Army, and two in the GAO.

palantir checkmate

Has the US Army put Palantir in an untenable spot? Is there a deus ex machina about to resolve the apparent checkmate?

The letter tells Palantir Technologies that its protest of the DCGS Increment 2 award to another contractor is denied. I don’t want to revisit the history or the details as I understand them of the DCGS project. (DCGS, pronounced “dsigs”, is a US government information fusion project associated with the US Army but seemingly applicable to other Department of Defense entities like the Air Force and the Navy.)

The passage in the letter I found interesting was:

While the market research revealed that commercial items were available to meet some of the DCGS-A2 requirements, the agency concluded that there was no commercial solution that could  meet all the requirements of DCGS-A2. As the agency explained in its report, the DCGS-A2 contractor will need to do a great deal of development and integration work, which will include importing capabilities from DCGS-A1 and designing mature interfaces for them. Because  the agency concluded that significant portions of the anticipated DCSG-A2 scope of work were not available as a commercial product, the agency determined that the DCGS-A2 development effort could not be procured as a commercial product under FAR part 12 procedures. The protester has failed to show that the agency’s determination in this regard was unreasonable.

The “importing” point is a big deal. I find it difficult to imagine that IBM i2 engineers will be eager to permit the Palantir Gotham system to work like one happy family. The importation and manipulation of i2 data in a third party system is more difficult than opening an RTF file in Word in my experience. My recollection is that the unfortunate i2-Palantir legal matter was, in part, related to figuring out how to deal with ANB files. (ANB is i2 shorthand for Analysts Notebook’s file format, a somewhat complex and closely-held construct.)

Net net: Palantir Technologies will not be the dog wagging the tail of IBM i2 and a number of other major US government integrators. The good news is that there will be quite a bit of work available for firms able to support the prime contractors and the vendors eligible and selected to provide for-fee products and services.

Was this a shoot-from-the-hip decision to deny Palantir’s objection to the award? No. I believe the FAR procurement guidelines and the content of the statement of work provided the framework for the decision. However, context is important as are past experiences and perceptions of vendors in the running for substantive US government programs.

Read more

Data Fusion: Not Yet, Not Cheap, Not Easy

November 9, 2015

I clipped an item to read on the fabulous flight from America to shouting distance of Antarctica. Yep, it’s getting smaller.

The write up was “So Far, Tepid Responses to Growing Cloud Integration Hariball.” I think the words “hair” and “ball” convinced me to add this gem to my in flight reading list.

The article is based on a survey (nope, I don’t have the utmost confidence in vendor surveys). Apparently the 300 IT “leaders” experience

pain around application and data integration between on premises and cloud based systems.

I had to take a couple of deep breaths to calm down. I thought the marketing voodoo from vendors embracing utility services (Lexmark/Kapow), metasearch (Vivisimo, et al), unified services (Attivio, Coveo, et al), and licensees of conversion routines from outfits ranging from Oracle to “search consulting” in the “search technology” business had this problem solved.

If the vendors can’t do it, why not just dump everything in a data lake and let an open source software system figure everything out. Failing that, why not convert the data into XML and use the magic of well formed XML objects to deal with these issues?

It seems that the solutions don’t work with the slam dunk regularity of a 23 year old Michael Jordan.


The write up explains:

The old methods may not cut it when it comes to pulling things together. Two in three respondents, 59%, indicate they are not satisfied with their ability to synch data between cloud and on-premise systems — a clear barrier for businesses that seek to move beyond integration fundamentals like enabling reporting and basic analytics. Still, and quite surprisingly, there isn’t a great deal of support for applying more resources to cloud application integration. Premise-to-cloud integration, cloud-to-cloud integration, and cloud data replication are top priorities for only 16%, 10% and 10% of enterprises, respectively. Instead, IT shops make do with custom coding, which remains the leading approach to integration, the survey finds.

My hunch is that the survey finds that hoo-hah is not the same as the grunt work required to take data from A, integrate it with data from B, and then do something productive with the data unless humans get involved.


I noted this point:

As the survey’s authors observe. “companies consistently under estimate the cost associated with custom code, as often there are hidden costs not readily visible to IT and business leaders.”

Reality just won’t go away when it comes to integrating disparate digital content. Neither will the costs.

Stephen E Arnold, November 9, 2015

Lexmark Chases New Revenue: Printers to DTM

September 11, 2015

I know what a printer is. The machine accepts instructions and, if the paper does not jam, outputs something I can read. Magic.

I find it interesting to contemplate my printers and visualize them as an enterprise content management system. Years ago, my team and I had to work on a project in the late 1990s involving a Xerox DocuTech scanner and printer. The idea was that the scanner would convert a paper document to an image with many digital features. Great idea, but the scanner gizmo was not talking to the printer thing. We got them working and shipped the software, the machines, and an invoice to the client. Happy day. We were paid.

The gap between that vision from a Xerox unit and the reality of the hardware was significant. But many companies have stepped forward to convert knowledge resident systems relying on experienced middle managers to hollowed out outfits trying to rely on software. My recollection is that Fulcrum Technologies nosed into this thorn bush with DOCSFulcrum a decade before the DocuTech was delivered by a big truck to my office. And, not to forget our friends to the East, the French have had a commitment to this approach to information access. Today, one can tap Polyspot or Sinequa for business process centric methods.

The question is, “Which of these outfits is making enough money to beat the dozens of outfits running with the other bulls in digital content processing land?” (My bet is on the completely different animals described in my new study CyberOSINT: Next Generation Information Access.)

Years later I spoke with an outfit called Brainware. The company was a reinvention of an earlier firm, which I think was called SER or something like that. Brainware’s idea was that its system could process text which could be scanned or in a common file format. The index allowed a user to locate text matching a query. Instead of looking for words, Brainware system used trigrams (sequences of three letters) to locate similar content.

Similar to the Xerox idea. The idea is not a new one.

I read two write ups about Lexmark, which used to be part of IBM. Lexmark is just down the dirt road from me in Lexington, Kentucky. Its financial health is a matter of interest for some folks in there here parts.

The first write up was “How Lexmark Evolved into an Enterprise Content Management Contender.” The main idea pivots on my knowing what content management is. I am not sure what this buzzword embraces. I do know that organizations have minimal ability to manage the digital information produced by employees and contractors. I also know that most organizations struggle with what their employees do with social media. Toss in the penchant units of a company have for creating information silos, and most companies look for silver bullets which may solve a specific problem in the firm’s legal department but leave many other content issues flapping in the wind.

According to the write up:

Lexmark is "moving from being a hardware provider to a broader provider of higher-value solutions, which are hardware, software and services," Rooke [a Lexmark senor manager] said.

Easy to say. The firm’s financial reports suggest that Lexmark faces some challenges. Google’s financial chart for the outfit displays declining revenues and profits:


The Brainware, ISYS Search Software, and Kofax units have not been able to provide the revenue boost I expected Lexmark to report. HP and IBM, which have somewhat similar strategies for their content processing units, have also struggled. My thought is that it may be more difficult for companies which once were good at manufacturing fungible devices to generate massive streams of new revenue from fuzzy stuff like software.

The write up does not have a hint of the urgency and difficulty of the Lexmark task. I learned from the article:

Lexmark is its own "first customer" to ensure that its technologies actually deliver on the capabilities and efficiency gains promoted by the company, Moody [Lexmark senior manager] said. To date, the company has been able to digitize and automate incoming data by at least 90 percent, contributing to cost reductions of 25 percent and a savings of $100 million, he reported. Cost savings aside, Lexmark wants to help CIOs better and more efficiently incorporate unstructured data from emails, scanned documents and a variety of other sources into their business processes.

The sentiment is one I encountered years ago. My recollection is that the precursor of Convera explained this approach to me in the 1980s when the angle was presented as Excalibur Technologies.

The words today are as fresh as they were decades ago. The challenge, in my opinion, remains.

I also read “How to Build an Effective Digital Transaction Management Platform.” This article is also eWeek, from the outfit which published “How Lexmark Evolved” piece.

What does this listicle state about Lexmark?

I learned that I need a digital transaction management system. A what? A DTM looks like workflow and information processing. I get it. Digital printing. Instead of paper, a DTM allows a worker to create a Word file or an email. Ah, revolutionary. Then a DTM automates the workflow. I think this is a great idea, but I seem to recall that many companies offer these services. Then I need to integrate my information. There goes the silo even if regulatory or contractual requirements suggest otherwise. Then I can slice and dice documents. My recollection is that firms have been automating document production for a while. Then I can use esignatures which are trustworthy. Okay. Trustworthy. Then I can do customer interaction “anytime, anywhere.” I suppose this is good when one relies on innovative ways to deal with customer questions about printer drivers. And I cannot integrate with “enterprise content management.” Oh, oh. I thought enterprise content management was sort of a persistent, intractable problem. Well, not if I include “process intelligence and visibility.” Er, what about those confidential documents relative to a legal dispute?

The temporal coincidence of a fluffy Lexmark write up and the listicle suggest several things to me:

  1. Lexmark is doing the content marketing that public relations and advertising professionals enjoy selling. I assume that my write up, which you are reading, will be an indication of the effectiveness of this one-two punch.
  2. The financial reports warrant some positive action. I think that closing significant deals and differentiating the Lexmark services from those of OpenText and dozens of other firms would have been higher on the priority list.
  3. Lexmark has made a strategic decision to use the rocket fuel of two ageing Atlas systems (Brainware and ISYS) and one Saturn system (Kofax’s Kapow) to generate billions in new revenue. I am not confident that these systems can get the payload into orbit.

Net net: Lexmark is following a logic path already stomped on by Hewlett Packard and IBM, among others. In today’s economic environment, how many federating, digital business process, content management systems can thrive?

My hunch is that the Lexmark approach may generate revenue. Will that revenue be sufficient to compensate for the decline in printer and ink revenues?

What are Lexmark’s options? Based on these two eWeek write ups, it seems as if marketing is the short term best bet. I am not sure I need another buzzword for well worn concepts. But, hey, I live in rural Kentucky and know zero about the big city views crafted down the road in Lexington, Kentucky.

Stephen E Arnold, September 11, 2015

Next Page »

  • Archives

  • Recent Posts

  • Meta