November 25, 2016
I read “Shedding Light on Dark Data: How to Get Started.” Okay, Dark Data. Like Big Data, the phrase is the fruit of the nomads at Garner Group. The person embracing this sort of old concept is an outfit OdinText. Spoiler: I thought the write up was going to identify outfits like BAE Systems, Centrifuge Systems, IBM Analyst’s Notebook, Palantir Technologies, and Recorded Future (an In-Q-Tel and Google backed outfit). Was I wrong? Yes.
The write up explains that a company has to tackle a range of information in order to be aware, informed, or insightful. Pick one. Here’s the list of Dark Data types, which the aforementioned companies have been working to capture, analyze, and make sense of for almost 20 years in the case of NetReveal (Detica) and Analyst’s Notebook. The other companies are comparative spring chickens with an average of seven years’ experience in this effort.
- Customer relationship management data
- Data warehouse information
- Enterprise resource planning information
- Log files
- Machine data
- Mainframe data
- Semi structured information
- Social media content
- Unstructured data
- Web content.
I think the company or non profit which tries to suck in these data types and process them may run into some cost and legal issues. Analyzing tweets and Facebook posts can be useful, but there are costs and license fees required. Frankly not even law enforcement and intelligence entities are able to do a Cracker Jack job with these content streams due to their volume, cryptic nature, and pesky quirks related to metadata tagging. But let’s move on. To this statement:
Phone transcripts, chat logs and email are often dark data that text analytics can help illuminate. Would it be helpful to understand how personnel deal with incoming customer questions? Which of your products are discussed with which of your other products or competitors’ products more often? What problems or opportunities are mentioned in conjunction with them? Are there any patterns over time?
Yep, that will work really well in many legal environments. Phone transcripts are particularly exciting.
How does one think about Dark Data? Easy. Here’s a visualization from the OdinText folks:
Notice that there are data types in this diagram NOT included in the listing above. I can’t figure out if this is just carelessness or an insight which escapes me.
How does one deal with Dark Data? OdinText, of course. Yep, of course. Easy.
Stephen E Arnold, November 25, 2016
November 2, 2016
Images of more than 117 million adult Americans are with law enforcement agencies, yet the rate of accurately identifying people accurately is minuscule.
A news report by The Register titled Meanwhile, in America: Half of adults’ faces are in police databases says:
One in four American law enforcement agencies across federal, state, and local levels use facial recognition technology, the study estimates. And now some US police departments have begun deploying real-time facial recognition systems.
Though facial recognition software vendors claim accuracy rates anywhere between 60 to 95 percent, statistics tell an entirely different story:
Of the FBI’s 36,420 searches of state license photo and mug shot databases, only 210 (0.6 per cent) yielded likely candidates for further investigations,” the study says. “Overall, 8,590 (4 per cent) of the FBI’s 214,920 searches yielded likely matches.
Some of the impediments for accuracy include low light conditions in which the images are captured, lower procession power or numerous simultaneous search requests and slow search algorithms. The report also reveals that human involvement also reduces the overall accuracy by more than 50 percent.
The report also touches a very pertinent point – privacy. Police departments and other law enforcement agencies are increasingly deploying real-time facial recognition. It not only is an invasion of privacy but the vulnerable networks can also be tapped into by non-state actors. Facial recognition should be used only in case of serious crimes, using it blatantly is an absolute no-no. It can be used in many ways for tracking people, even though they may not be criminals. Thus, it remains to be answered, who will watch the watchmen?
June 1, 2016
A few days ago, I stumbled upon a copy of a letter from the GAO concerning Palantir Technologies dated May 18, 2016. The letter became available to me a few days after the 18th, and the US holiday probably limited circulation of the document. The letter is from the US Government Accountability Office and signed by Susan A. Poling, general counsel. There are eight recipients, some from Palantir, some from the US Army, and two in the GAO.
Has the US Army put Palantir in an untenable spot? Is there a deus ex machina about to resolve the apparent checkmate?
The letter tells Palantir Technologies that its protest of the DCGS Increment 2 award to another contractor is denied. I don’t want to revisit the history or the details as I understand them of the DCGS project. (DCGS, pronounced “dsigs”, is a US government information fusion project associated with the US Army but seemingly applicable to other Department of Defense entities like the Air Force and the Navy.)
The passage in the letter I found interesting was:
While the market research revealed that commercial items were available to meet some of the DCGS-A2 requirements, the agency concluded that there was no commercial solution that could meet all the requirements of DCGS-A2. As the agency explained in its report, the DCGS-A2 contractor will need to do a great deal of development and integration work, which will include importing capabilities from DCGS-A1 and designing mature interfaces for them. Because the agency concluded that significant portions of the anticipated DCSG-A2 scope of work were not available as a commercial product, the agency determined that the DCGS-A2 development effort could not be procured as a commercial product under FAR part 12 procedures. The protester has failed to show that the agency’s determination in this regard was unreasonable.
The “importing” point is a big deal. I find it difficult to imagine that IBM i2 engineers will be eager to permit the Palantir Gotham system to work like one happy family. The importation and manipulation of i2 data in a third party system is more difficult than opening an RTF file in Word in my experience. My recollection is that the unfortunate i2-Palantir legal matter was, in part, related to figuring out how to deal with ANB files. (ANB is i2 shorthand for Analysts Notebook’s file format, a somewhat complex and closely-held construct.)
Net net: Palantir Technologies will not be the dog wagging the tail of IBM i2 and a number of other major US government integrators. The good news is that there will be quite a bit of work available for firms able to support the prime contractors and the vendors eligible and selected to provide for-fee products and services.
Was this a shoot-from-the-hip decision to deny Palantir’s objection to the award? No. I believe the FAR procurement guidelines and the content of the statement of work provided the framework for the decision. However, context is important as are past experiences and perceptions of vendors in the running for substantive US government programs.
November 9, 2015
I clipped an item to read on the fabulous flight from America to shouting distance of Antarctica. Yep, it’s getting smaller.
The write up was “So Far, Tepid Responses to Growing Cloud Integration Hariball.” I think the words “hair” and “ball” convinced me to add this gem to my in flight reading list.
The article is based on a survey (nope, I don’t have the utmost confidence in vendor surveys). Apparently the 300 IT “leaders” experience
pain around application and data integration between on premises and cloud based systems.
I had to take a couple of deep breaths to calm down. I thought the marketing voodoo from vendors embracing utility services (Lexmark/Kapow), metasearch (Vivisimo, et al), unified services (Attivio, Coveo, et al), and licensees of conversion routines from outfits ranging from Oracle to “search consulting” in the “search technology” business had this problem solved.
If the vendors can’t do it, why not just dump everything in a data lake and let an open source software system figure everything out. Failing that, why not convert the data into XML and use the magic of well formed XML objects to deal with these issues?
It seems that the solutions don’t work with the slam dunk regularity of a 23 year old Michael Jordan.
The write up explains:
The old methods may not cut it when it comes to pulling things together. Two in three respondents, 59%, indicate they are not satisfied with their ability to synch data between cloud and on-premise systems — a clear barrier for businesses that seek to move beyond integration fundamentals like enabling reporting and basic analytics. Still, and quite surprisingly, there isn’t a great deal of support for applying more resources to cloud application integration. Premise-to-cloud integration, cloud-to-cloud integration, and cloud data replication are top priorities for only 16%, 10% and 10% of enterprises, respectively. Instead, IT shops make do with custom coding, which remains the leading approach to integration, the survey finds.
My hunch is that the survey finds that hoo-hah is not the same as the grunt work required to take data from A, integrate it with data from B, and then do something productive with the data unless humans get involved.
I noted this point:
As the survey’s authors observe. “companies consistently under estimate the cost associated with custom code, as often there are hidden costs not readily visible to IT and business leaders.”
Reality just won’t go away when it comes to integrating disparate digital content. Neither will the costs.
Stephen E Arnold, November 9, 2015
September 11, 2015
I know what a printer is. The machine accepts instructions and, if the paper does not jam, outputs something I can read. Magic.
I find it interesting to contemplate my printers and visualize them as an enterprise content management system. Years ago, my team and I had to work on a project in the late 1990s involving a Xerox DocuTech scanner and printer. The idea was that the scanner would convert a paper document to an image with many digital features. Great idea, but the scanner gizmo was not talking to the printer thing. We got them working and shipped the software, the machines, and an invoice to the client. Happy day. We were paid.
The gap between that vision from a Xerox unit and the reality of the hardware was significant. But many companies have stepped forward to convert knowledge resident systems relying on experienced middle managers to hollowed out outfits trying to rely on software. My recollection is that Fulcrum Technologies nosed into this thorn bush with DOCSFulcrum a decade before the DocuTech was delivered by a big truck to my office. And, not to forget our friends to the East, the French have had a commitment to this approach to information access. Today, one can tap Polyspot or Sinequa for business process centric methods.
The question is, “Which of these outfits is making enough money to beat the dozens of outfits running with the other bulls in digital content processing land?” (My bet is on the completely different animals described in my new study CyberOSINT: Next Generation Information Access.)
Years later I spoke with an outfit called Brainware. The company was a reinvention of an earlier firm, which I think was called SER or something like that. Brainware’s idea was that its system could process text which could be scanned or in a common file format. The index allowed a user to locate text matching a query. Instead of looking for words, Brainware system used trigrams (sequences of three letters) to locate similar content.
Similar to the Xerox idea. The idea is not a new one.
I read two write ups about Lexmark, which used to be part of IBM. Lexmark is just down the dirt road from me in Lexington, Kentucky. Its financial health is a matter of interest for some folks in there here parts.
The first write up was “How Lexmark Evolved into an Enterprise Content Management Contender.” The main idea pivots on my knowing what content management is. I am not sure what this buzzword embraces. I do know that organizations have minimal ability to manage the digital information produced by employees and contractors. I also know that most organizations struggle with what their employees do with social media. Toss in the penchant units of a company have for creating information silos, and most companies look for silver bullets which may solve a specific problem in the firm’s legal department but leave many other content issues flapping in the wind.
According to the write up:
Lexmark is "moving from being a hardware provider to a broader provider of higher-value solutions, which are hardware, software and services," Rooke [a Lexmark senor manager] said.
Easy to say. The firm’s financial reports suggest that Lexmark faces some challenges. Google’s financial chart for the outfit displays declining revenues and profits:
The Brainware, ISYS Search Software, and Kofax units have not been able to provide the revenue boost I expected Lexmark to report. HP and IBM, which have somewhat similar strategies for their content processing units, have also struggled. My thought is that it may be more difficult for companies which once were good at manufacturing fungible devices to generate massive streams of new revenue from fuzzy stuff like software.
The write up does not have a hint of the urgency and difficulty of the Lexmark task. I learned from the article:
Lexmark is its own "first customer" to ensure that its technologies actually deliver on the capabilities and efficiency gains promoted by the company, Moody [Lexmark senior manager] said. To date, the company has been able to digitize and automate incoming data by at least 90 percent, contributing to cost reductions of 25 percent and a savings of $100 million, he reported. Cost savings aside, Lexmark wants to help CIOs better and more efficiently incorporate unstructured data from emails, scanned documents and a variety of other sources into their business processes.
The sentiment is one I encountered years ago. My recollection is that the precursor of Convera explained this approach to me in the 1980s when the angle was presented as Excalibur Technologies.
The words today are as fresh as they were decades ago. The challenge, in my opinion, remains.
I also read “How to Build an Effective Digital Transaction Management Platform.” This article is also eWeek, from the outfit which published “How Lexmark Evolved” piece.
What does this listicle state about Lexmark?
I learned that I need a digital transaction management system. A what? A DTM looks like workflow and information processing. I get it. Digital printing. Instead of paper, a DTM allows a worker to create a Word file or an email. Ah, revolutionary. Then a DTM automates the workflow. I think this is a great idea, but I seem to recall that many companies offer these services. Then I need to integrate my information. There goes the silo even if regulatory or contractual requirements suggest otherwise. Then I can slice and dice documents. My recollection is that firms have been automating document production for a while. Then I can use esignatures which are trustworthy. Okay. Trustworthy. Then I can do customer interaction “anytime, anywhere.” I suppose this is good when one relies on innovative ways to deal with customer questions about printer drivers. And I cannot integrate with “enterprise content management.” Oh, oh. I thought enterprise content management was sort of a persistent, intractable problem. Well, not if I include “process intelligence and visibility.” Er, what about those confidential documents relative to a legal dispute?
The temporal coincidence of a fluffy Lexmark write up and the listicle suggest several things to me:
- Lexmark is doing the content marketing that public relations and advertising professionals enjoy selling. I assume that my write up, which you are reading, will be an indication of the effectiveness of this one-two punch.
- The financial reports warrant some positive action. I think that closing significant deals and differentiating the Lexmark services from those of OpenText and dozens of other firms would have been higher on the priority list.
- Lexmark has made a strategic decision to use the rocket fuel of two ageing Atlas systems (Brainware and ISYS) and one Saturn system (Kofax’s Kapow) to generate billions in new revenue. I am not confident that these systems can get the payload into orbit.
Net net: Lexmark is following a logic path already stomped on by Hewlett Packard and IBM, among others. In today’s economic environment, how many federating, digital business process, content management systems can thrive?
My hunch is that the Lexmark approach may generate revenue. Will that revenue be sufficient to compensate for the decline in printer and ink revenues?
What are Lexmark’s options? Based on these two eWeek write ups, it seems as if marketing is the short term best bet. I am not sure I need another buzzword for well worn concepts. But, hey, I live in rural Kentucky and know zero about the big city views crafted down the road in Lexington, Kentucky.
Stephen E Arnold, September 11, 2015
September 4, 2015
The Dark Web is not only used to buy and sell illegal drugs, but it is also used to perpetuate sex trafficking, especially of children. The work of law enforcement agencies working to prevent the abuse of sex trafficking victims is detailed in a report by the Australia Broadcasting Corporation called “Secret ‘Dark Net’ Operation Saves Scores Of Children From Abuse; Ringleader Shannon McCoole Behind Bars After Police Take Over Child Porn Site.” For ten months, Argos, the Queensland, police anti-pedophile taskforce tracked usage on an Internet bulletin board with 45,000 members that viewed and uploaded child pornography.
The Dark Web is notorious for encrypting user information and that is one of the main draws, because users can conduct business or other illegal activities, such as view child pornography, without fear of retribution. Even the Dark Web, however, leaves a digital trail and Argos was able to track down the Web site’s administrator. It turned out the administrator was an Australian childcare worker who had been sentenced to 35 years in jail for sexually abusing seven children in his care and sharing child pornography.
Argos was able to catch the perpetrator by noticing patterns in his language usage in posts he made to the bulletin board (he used the greeting “hiya”). Using advanced search techniques, the police sifted through results and narrowed them down to a Facebook page and a photograph. From the Facebook page, they got the administrator’s name and made an arrest.
After arresting the ringleader, Argos took over the community and started to track down the rest of the users.
” ‘Phase two was to take over the network, assume control of the network, try to identify as many of the key administrators as we could and remove them,’ Detective Inspector Jon Rouse said. ‘Ultimately, you had a child sex offender network that was being administered by police.’ ”
When they took over the network, the police were required to work in real-time to interact with the users and gather information to make arrests.
Even though the Queensland police were able to end one Dark Web child pornography ring and save many children from abuse, there are still many Dark Web sites centered on child sex trafficking.
July 25, 2015
Short honk: I scanned my Twitter feed this morning. What did I see? An impossible assertion from the marketing crazed folks at IBM Watson. Let me tell you, IBM Watson and its minions output a hefty flow of tweets. A year or so ago, IBM relied on mid tier consulting firms experts like Dave Schubmehl (yep, the fellow who sold my research on Amazon without my permission). Now there are other voices.
But the message, not just the medium, are important. IBM’s assertion is that there will be no more “data silos in enterprise search.” You can learn about IBM’s “reality” in a webcast.
Now, I am not planning on sitting through a webcast. I would, however, like to enumerate several learnings from my decades of enterprise information access work. You can use this list as a jump start for your questions to the IBM wizards. Here goes:
- In an enterprise, what happens when an indexing system makes available in a federated search system information to a legal matter which is not supposed to be available to anyone except the attorneys involved in the matter?
- In an enterprise, what happens if information pertinent to a classified government project is made available in a federated search system which has not be audited for access control compliance?
- What happens when personnel information containing data about a medical issue is indexed and made available in an enterprise search system when email attachments are automatically indexed?
- How does the federated system deal with content in servers located in a research facility engaged in new product research?
- What happens when sales and pricing data shared among key account executives is indexed and made available to a contractor advising the company?
- What is the process for removing pointers to data which are not supposed to be in the enterprise search system?
- What security measures are in place to ensure that a lost or stolen mobile device does not have access to an enterprise search system?
- How much manual work is required before an organization turns on the Watson indexing system?
These will get you started on the cross silo issues?
Oh, the answer to these questions is that the person identified as responsible for making the data available may get to find a future elsewhere. Amazon warehouses are hiring in southern Indiana.
Alternatively one can saddle up a white stallion, snag a lance, and head for the nearest windmill.
Stephen E Arnold, July 25, 2015
October 21, 2014
Through a post at their blog Coveo Insights, enterprise-search firm Coveo urges, “Power Your Customer Service with Unified Search Driven Knowledge.” The write-up gives a few reasons why such “omni-channel” (federated) search functionality is a wise choice for customer service. Writer and Coveo marketing director Tucker Hall explains:
“Customers … engage with companies across a growing number of channels — from self-service portals and contact centers, to social media and field service engagements. Today’s savvy customer expects (and deserves) a seamless and consistent service experience across all of these channels. Omni-channel customer service has now become essential for companies hoping to maximize customer engagement, satisfaction, and retention.
“Successful omni-channel customer service can prove difficult regardless of the specific technologies and systems an organization has in place. That’s because success demands that customers and support personnel alike have swift, intuitive access to the case-resolving knowledge and expertise they need, when and how they need it.”
Hall asserts that many companies are missing out because they “fail to appreciate” the reasons to choose federated search: data and expertise are located in many systems, crowd-sourcing is a thing, and analytics must be actionable. But you, dear reader, already knew those, didn’t you? More on these points can be found in Coveo’s solution brief on the subject (registration required).
Coveo serves organizations large, medium, and small with solutions that aim to be agile and easy to use yet scalable, fast, and efficient. The company was founded in 2005 by members of the team which developed Copernic Desktop Search. Coveo maintains offices in the U.S., Netherlands, and Quebec.
Cynthia Murrell, October 21, 2014
September 14, 2013
Do we dare broach the subject about heath care information and electronic media records? Yes, we do and we take into account “Dr. Karl Kochendorfer: Bridging The Knowledge Gap In Health Care” from Federated Search Blog. Dr. Karl Kochendorfer wants there to be an official federated search for the national health care system. His idea is to connect health care professionals to authoritative information with an instantaneous return. He cites that doctors and nurses are relying on Wikipedia and Google searches rather than authorized databases, because it is faster. Notice the danger?
Dr. Kochendorfer mentions this fact in a TED talk he gave in April called “Seek And Ye Shall.” He presents the idea for a federated search in this discussion, along with more of these facts:
- “There are 3 billion terabytes of information out there.
- There are 700,000 articles added to the medical literature every year.
- Information overload was described 140 years ago by a German surgeon: “It has become increasingly difficult to keep abreast of the reports which accumulate day after day … one suffocates through exposure to the massive body of rapidly growing information.”
- With better search tools, 275 million improved decisions could be made.
- Clinicians spend 1/3 of their time looking for information.”
Dr. Kochendorfer ‘s idea is grand, but how many academic databases are lining up to offer their information for free or without a hefty subscription fee? Academia is already desperate for money, asking them to share their wealth of knowledge without green will not go over too highly. Should there be a federated search with authoritative information and instantaneous results? Yes. Will it happen? Keep fixing the plumbing.
Whitney Grace, September 14, 2013
September 12, 2013
The general search engines available on the web are simply not adequate for healthcare professionals looking for the latest pertinent information (let alone personalized data on their patients). The Federated Search Blog shares an important Tedx Talk in its piece, “Dr. Karl Kochendorfer: Bridging the Knowledge Gap in Health Care,” which advocates the adoption of federated search for the healthcare industry. I recommend the video not only for those in the healthcare or search fields, but for anyone interested in getting the best care for themselves and their families. The write-up tells us:
“As a family physician and leader in the effort to connect healthcare workers to the information they need, Dr. Kochendorfer acknowledges what those of us in the federated search world already know – Google and the surface web contain so little of the critical information your doctor and his staff need to support important medical decision-making.”
The write-up summarizes highlights from the talk, including the statistic that says a third of clinicians’ time is spent hunting down information. No wonder doctors are spending less time with patients! The article continues:
“And, the most compelling reason to get federated search into healthcare is the sobering thought by Dr. Kochendorfer that doctors are now starting to use Wikipedia to get answers to their questions instead of the best evidence-based sources out there just because Wikipedia is so easy for them to use. Scary.”
Yes, scary is a good word for it. It is true that data reservoirs that feed federated searches can contain errors—a point Kochendorfer does not address in this video. Still, I have to agree with the write-up: the doctor makes a compelling case on this important issue. The video concludes with a call for listeners to support the development of federated healthcare search tools like MedSocket and open standards like Infobuttons. Sounds like a good idea to me.
Cynthia Murrell, September 12, 2013