Short Honk: Eleasticsearch Information
July 10, 2015
Short honk: For information about Elastic’s Elasticsearch, the open source search system which has proprietary vendors of search systems cowering in fear, navigate to Elasticsearch: The Definitive Guide. Elasticsearch is not perfect, but what software is? Ask United Airlines, the New York Stock Exchange, and the Wall Street Journal about their systems. The book includes useful information about geolocation functions plus some meaty stuff about administering the system once you are up and running. Worth a look.
Stephen E Arnold, July 10, 2015
Researchers Glean Audio from Video
July 10, 2015
Now, this is fascinating. Scary, but fascinating. MIT News explains how a team of researchers from MIT, Microsoft, and Adobe are “Extracting Audio from Visual Information.” The article includes a video in which one can clearly hear the poem “Mary Had a Little Lamb” as extrapolated from video of a potato chip bag’s vibrations filmed through soundproof glass, among other amazing feats. I highly recommend you take four-and-a-half minutes to watch the video.
Writer Larry Hardesty lists some other surfaces from which the team was able reproduce audio by filming vibrations: aluminum foil, water, and plant leaves. The researchers plan to present a paper on their results at this year’s Siggraph computer graphics conference. See the article for some details on the research, including camera specs and algorithm development.
So, will this tech have any non-spying related applications? Hardesty cites MIT grad student, and first writer on the team’s paper, Abe Davis as he writes:
“The researchers’ technique has obvious applications in law enforcement and forensics, but Davis is more enthusiastic about the possibility of what he describes as a ‘new kind of imaging.’
“‘We’re recovering sounds from objects,’ he says. ‘That gives us a lot of information about the sound that’s going on around the object, but it also gives us a lot of information about the object itself, because different objects are going to respond to sound in different ways.’ In ongoing work, the researchers have begun trying to determine material and structural properties of objects from their visible response to short bursts of sound.”
That’s one idea. Researchers are confident other uses will emerge, ones no one has thought of yet. This is a technology to keep tabs on, and not just to decide when to start holding all private conversations in windowless rooms.
Cynthia Murrell, July 10, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
SAS Text Miner Promises Unstructured Insight
July 10, 2015
Big data is tools help organizations analyze more than their old, legacy data. While legacy data does help an organization study how their process have changed, the data is old and does not reflect the immediate, real time trends. SAS offers a product that bridges old data with the new as well as unstructured and structured data.
The SAS Text Miner is built from Teragram technology. It features document theme discovery, a function the finds relations between document collections; automatic Boolean rule generation; high performance text mining that quickly evaluates large document collection; term profiling and trending, evaluates term relevance in a collection and how they are used; multiple language support; visual interrogation of results; easily import text; flexible entity options; and a user friendly interface.
The SAS Text Miner is specifically programmed to discover data relationships data, automate activities, and determine keywords and phrases. The software uses predictive models to analysis data and discover new insights:
“Predictive models use situational knowledge to describe future scenarios. Yet important circumstances and events described in comment fields, notes, reports, inquiries, web commentaries, etc., aren’t captured in structured fields that can be analyzed easily. Now you can add insights gleaned from text-based sources to your predictive models for more powerful predictions.”
Text mining software reveals insights between old and new data, making it one of the basic components of big data.
Whitney Grace, July 10, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
Enterprise Search and the Mythical Five Year Replacement Cycle
July 9, 2015
I have been around enterprise search for a number of years. In the research we did in 2002 and 2003 for the Enterprise Search Report, my subsequent analyses of enterprise search both proprietary and open source, and the ad hoc work we have done related to enterprise search, we obviously missed something.
Ah, the addled goose and my hapless goslings. The degrees, the experience, the books, and the knowledge had a giant lacuna, a goose egg, a zero, a void. You get the idea.
We did not know that an enterprise licensing an open source or proprietary enterprise search system replaced that system every 60 months. We did document the following enterprise search behaviors:
- Users express dissatisfaction about any installed enterprise search system. Regardless of vendor, anywhere from 50 to 75 percent of users find the system a source of dissatisfaction. That suggests that enterprise search is not pulling the hay wagon for quite a few users.
- Organizations, particularly the Fortune 500 firms we polled in 2003, had more than five enterprise search systems installed and in use. The reason for the grandfathering is that each system had its ardent supporters. Companies just grandfathered the system and looked for another system in the hopes of finding one that improved information access. No one replaced anything was our conclusion.
- Enterprise search systems did not change much from year to year. In fact, the fancy buzzwords used today to describe open source and proprietary systems were in use since the early 1980s. Dig out some of Fulcrum’s marketing collateral or the explanation of ISYS Search Software from 1986 and look for words like clustering, automatic indexing, semantics, etc. A short cut is to read some of the free profiles of enterprise search vendors on my Xenky.com Web site.
I learned about a white paper, which is 21st century jargon for a marketing essay, titled “Best Practices for Enterprise Search: Breaking the Five-Year Replacement Cycle.” The write up comes from a company called Knowledgent. The company describes itself this way on its Who We Are Web page:
Knowledgent [is] a precision-focused data and analytics firm with consistent, field-proven results across industries.
The essay begins with a reference to Lexis, which along with Don Wilson (may he rest in peace) and a couple of colleagues founded. The problem with the reference is that the Lexis search engine was not an enterprise search and retrieval system. The Lexis OBAR system (Ohio State Bar Association) was tailored to the needs of legal researchers, not general employees. Note that Lexis’ marketing in 1973 suggested that anyone could use the command line interface. The OBAR system required content in quite specific formats for the OBAR system to index it. The mainframe roots of OBAR influenced the subsequent iterations of the LexisNexis text retrieval system: Think mainframes, folks. The point is that OBAR was not a system that was replaced in five years. The dog was in the kennel for many years. (For more about the history of Lexis search, see Bourne and Hahn, A History of Online information Services, 1963-1976. By 2010, LexisNexis had migrated to XML and moved from mainframes to lower cost architectures. But the OBAR system’s methods can still be seen in today’s system. Five years. What are the supporting data?
The white paper leaps from the five year “assertion” to an explanation of the “cycle.” In my experience, what organizations do is react to an information access problem and then begin a procurement cycle. Increasingly, as the research for our CyberOSINT study shows, savvy organizations are looking for systems that deliver more than keyword and taxonomy-centric access. Words just won’t work for many organizations today. More content is available in videos, images, and real time almost ephemeral “documents” which can difficult to capture, parse, and make findable. Organizations need systems which provide usable information, not more work for already overextended employees.
The white paper addresses the subject of the value of search. In our research, search is a commodity. The high value information access systems go “beyond search.” One can get okay search in an open source solution or whatever is baked in to a must have enterprise application. Search vendors have a problem because after decades of selling search as a high value system, the licensees know that search is a cost sinkhole and not what is needed to deal with real world information challenges.
What “wisdom” does the white paper impart about the “value” of search. Here’s a representative passage:
There are also important qualitative measures you can use to determine the value and ROI of search in your organization. Surveys can quickly help identify fundamental gaps in content or capability. (Be sure to collect enterprise demographics, too. It is important to understand the needs of specific teams.) An even better approach is to ask users to rate the results produced by the search engine. Simply capturing a basic “thumbs up” or “thumbs down” rating can quickly identify weak spots. Ultimately, some combination of qualitative and quantitative methods will yield an estimate of search, and the value it has to the company.
I have zero clue how this set of comments can be used to justify the direct and indirect costs of implementing a keyword enterprise search system. The advice is essentially irrelevant to the acquisition of a more advanced system from an leading edge next generation information access vendor like BAE Systems (NetReveal), IBM (not the Watson stuff, however), or Palantir. The fact underscored by our research over the last decade is tough to dispute: Connecting an enterprise search system to demonstrable value is a darned difficult thing to accomplish.
It is far easier to focus on a niche like legal search and eDiscovery or the retrieval of scientific and research data for the firm’s engineering units than to boil the ocean. The idea of “boil the ocean” is that a vendor presents a text centric system (essentially a one trick pony) as an animal with the best of stallions, dogs, tigers, and grubs. The spam about enterprise search value is less satisfying than the steak of showing that an eDiscovery system helped the legal eagles win a case. That, gentle reader, is value. No court judgment. No fine. No PR hit. A grumpy marketer who cannot find a Web article is not value no matter how one spins the story.
HP: A Trusted Source for Advice about Big Data?
July 9, 2015
Remember that Hewlett Packard bought Autonomy. As part of that process, the company had access to data, Big Data. There were Autonomy financials; they were documents from various analysts and experts; there were internal analyses. The company’s Board of Directors and the senior management of the HP organization decided to purchase Autonomy for $11 billion in October 2011. I assume that HP worked through these data in a methodical, thorough manner, emulating the type of pre-implosion interest in detail that made Arthur Anderson a successful outfit until the era of the 2001 Enron short cut and the firm’s implosion. A failure to deal with data took out Anderson, and I harbor a suspicion that HP’s inability to analyze the Autonomy data has been an early warning of issues at HP.
I was lugging my mental baggage with me when I read “Six Signs That Your big Data Expert, Isn’t?” I worked through the points cited in the write up which appeared in the HP Big Data Blog. Let me highlight three of these items and urge you, gentle reader, to check out the article for the six pack of points. I do not recommend drinking a six pack when you peruse the source because the points might seem quite like the statements in Dr. Benjamin Spock’s book on child rearing.
Item 2 from the list of six: “They [your Big Data experts] “talk about technology, rather than the business.” Wow, this hit a chord with me when I considered HP’s spending $11 billion and then writing off $7 or $8 billion, blaming Autonomy for tricking Hewlett Packard. My thought was, “Maybe HP is the ideal case study to be cited when pointing out that someone is focusing on the wrong thing. For example, Autonomy’s “black box” approach is nifty, but it has been in the market since 1995-1996. The system requires care and feeding, and it can be a demanding task mistress to set up, configure, optimize, and maintain. For a buyer not to examine the “Big Data” relevant to 15 years of business history strikes me as important and basic step in the acquisition process. Did HP talk about the Autonomy business, or did HP get tangled in the Digital Reasoning Engine, the Integrated Data Operating Layers, patents, Bayesian-LaPlacian-Markovian methods?
Item 4 from the list of six: “They [your Big Data experts] talk about conclusions rather than correlations.” Autonomy, as I reported in the first three editions of the late, lamented Enterprise Search Report, grew its revenue through a series of savvy acquisitions. The cost and sluggishness of enterprise software sales related to IDOL needed some vitamin supplements. Dr. Mike Lynch and his capable management team built the Autonomy revenue base by nosing into video, content management, and fraud detection. IDOL was the “brand,” and the revenue flowed from the sale of a diverse line up of products and services. My hypothesis is that the HP acquisition team looked at the hundreds of millions in Autonomy revenue and concluded, “Hey, this is a no brainer for us. We can sell much more of this IDOL thing. Our marketing is much more effective than that of the wonks in Cambridge. Our HP sales force is more capable than the peddlers Autonomy has on the street.” HP then jumped to the conclusion that it could take $700 or $800 million in existing revenue and push it into the stratosphere. Well, how is that working out? Again, I would suggest that the case to reference in this Item 4 is HP itself.
Item 6 from the list of six: “They [your Big Data experts] talk about data quality, rather than data validity.” This is an interesting item. In the land of databases, the meaning of data quality is often conflated with consistency; that is, ingesting the data does not generate exceptions during processing. An exception is a record which the content processing system kicks out as malformed. The notion of data validity means that the data that makes it into a database is accurate by some agreed upon yardstick. Databases can be filled with misinformation, disinformation, and reformed information like a flood of messages from Mr. Putin’s social media campaigns. HP may have accepted estimates from Qatalyst Partners, its own in house team, and from third party due diligence firms. HP’s senior management used these data, which I assume were neither too little nor too big to shore up their decision to buy Autonomy for $11 billion. As HP learned, data, whether meaty or scrawny, may be secondary to the reasoning process applied to the information. Well, HP demonstrated that it made a slip up in its understanding of Autonomy. I would have liked to see this point include a reference to HP’s Autonomy experience.
Net net: HP is pitching advice related to Big Data. That’s okay, but I find that a company which appears to have struggled with Big Data related to the Autonomy acquisition may not be the best, most objective, and most reliable source of advice.
Talk is easy. Performance is difficult. HP is mired in a break up plan. The company has not convinced me that it is able to deal with Big Data. Verbal assurance are one thing; top line performance and profits, happy customers, and wide adoption of Autonomy technology are another.
The other three points can be related to Autonomy. I will leave it to you, gentle reader, to map HP’s adult-sounding advice to HP’s actual use of Big Data. As the HP blog’s cartoon says, “Well, maybe.”
Stephen E Arnold, July 9, 2015
IBM Ultradense Computer Chips. Who Will Profit from the Innovation?
July 9, 2015
I don’t want to rain on IBM’s seven nanometer chips. To experience the rah rah wonder of the innovation, navigate to “IBM Announces Computer Chips More Powerful Than Any in Existence.” Note: You may have to purchase a dead tree edition of the Gray Lady or cough up money to deal with the pay wall.
The write up reveals, just like on an automobile rebuilding television program without Chip Foose, who shows the finished vehicle, not a component:
The company said on Thursday that it had working samples of chips with seven-nanometer transistors. It made the research advance by using silicon-germanium instead of pure silicon in key regions of the molecular-size switches. The new material makes possible faster transistor switching and lower power requirements. The tiny size of these transistors suggests that further advances will require new materials and new manufacturing techniques. As points of comparison to the size of the seven-nanometer transistors, a strand of DNA is about 2.5 nanometers in diameter and a red blood cell is roughly 7,500 nanometers in diameter. IBM said that would make it possible to build microprocessors with more than 20 billion transistors.
Okay. Good.
My question is, “Has IBM the capability to manufacture these chips, package them in hardware that savvy information technology professionals will want, and then support the rapidly growing ecosystem?”
Like the pre Judge Green Bell Labs, IBM can invent or engineer something nifty. But the Bell Labs’ folks were not the leaders in the productization field. IBM seems to connect its “international consortium” and the $3 billion in “a public private partnership” as evidence that revenue is just around the corner.
Like the Watson PR, IBM’s ability to get its tales of technical prowess in front of me may be greater than the company’s ability to generate substantial top line growth and a healthy pile of cash after taxes.
From my vantage point in rural Kentucky, my hunch is that the outfits which build the equipment, work out the manufacturing processes, and then increase chip yields will be the big winners. The proven ability to make things may have more revenue potential than the achievement, which is significant, than a seven nanometer chip.
Who will be the winner? The folks at Samsung who could use a win? The contractors involved in the project? IBM?
No answers, but my hunch is that core manufacturing expertise might be a winner going forward. Once a chip is made smaller, others know it can be done which allows the followers to move forward. IBM, however, has more than an innovator’s dilemma. Will Watson become more of a market force with these new chips? If so, when? One week, one year, 10 years?
Also, IBM has to deal with the allegedly accurate statements about the company which appear in the Alliance@IBM blog.
Stephen E Arnold, July 9, 2015
Cloud is Featured in SharePoint 2016
July 9, 2015
Users are eager to learn all they can about the upcoming release of SharePoint Server 2016. Mark Kashman recently gave a presentation and additional information which is covered in the Redmond Channel Partner article, “Microsoft: Cloud Will Play Prominent Role in SharePoint 2016.”
The article begins:
“Microsoft recently detailed its vision for SharePoint Server 2016, which appears to be very cloud-centric. Microsoft is planning a beta release of the new SharePoint Server 2016 by the end of this year, with final product release planned for Q2 2016. Mark Kashman, a senior product manager at Microsoft on the SharePoint team, gave more details about Microsoft’s plans for the server during a June 17 presentation at the SPBiz Conference titled ‘SharePoint Vision and Roadmap.’”
Users are still waiting to hear how this “cloud-centric” approach affects the overall usability of the product. As more details become available, stay tuned to ArnoldIT.com for the highlights. Stephen E. Arnold is a longtime leader in search, and his distillation of SharePoint new, tips, and tricks on his dedicated SharePoint feed is a way for users to stay on top of the changes without a huge investment in time.
Emily Rae Aldridge, July 9, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
Watson Still Has Much to Learn About Healthcare
July 9, 2015
If you’ve wondered what is taking Watson so long to get its proverbial medical degree, check out IEEE Spectrum’s article, “IBM’s Dr. Watson Will See You… Someday.” When IBM’s AI Watson won Jeopardy in 2011, folks tasked with dragging healthcare into the digital landscape naturally eyed the software as a potential solution, and IBM has been happy to oblige. However, “training” Watson in healthcare documentation is proving an extended process. Reporter Brandon Keim writes:
“Where’s the delay? It’s in our own minds, mostly. IBM’s extraordinary AI has matured in powerful ways, and the appearance that things are going slowly reflects mostly on our own unrealistic expectations of instant disruption in a world of Uber and Airbnb.”
Well that, and the complexities of our healthcare system. Though the version of Watson that beat Jeopardy’s human champion was advanced and powerful, tailoring it to manage medicine calls for a wealth of very specific tweaking. In fact, there are now several versions of “Doctor” Watson being developed in partnership with individual healthcare and research facilities, insurance companies, and healthcare-related software makers. The article continues:
“Watson’s training is an arduous process, bringing together computer scientists and clinicians to assemble a reference database, enter case studies, and ask thousands of questions. When the program makes mistakes, it self-adjusts. This is what’s known as machine learning, although Watson doesn’t learn alone. Researchers also evaluate the answers and manually tweak Watson’s underlying algorithms to generate better output.
“Here there’s a gulf between medicine as something that can be extrapolated in a straightforward manner from textbooks, journal articles, and clinical guidelines, and the much more complicated challenge of also codifying how a good doctor thinks. To some extent those thought processes—weighing evidence, sifting through thousands of potentially important pieces of data and alighting on a few, handling uncertainty, employing the elusive but essential quality of insight—are amenable to machine learning, but much handcrafting is also involved.”
Yes, incorporating human judgement is time-consuming. See the article for more on the challenges Watson faces in the field of healthcare, and for some of the organizations contributing to the task. We still don’t know how much longer it will take for the famous AI (and perhaps others like it) to dominate the healthcare field. When that day arrives, will it have been worth the effort?
Cynthia Murrell, July 9, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
Is SharePoint A Knowledge Management Tool
July 9, 2015
One of the biggest questions information experts are asked a lot is, “is SharePoint a knowledge management tool?” The answer, according to Lucidea, is: it depends. The answer is vague, but a blog post on Lucidea’s Web site explains why: “But Isn’t SharePoint A KM Application?”
SharePoint’s usefulness is explained in this one quote:
“SharePoint is a very powerful and flexible platform for building all sorts of applications. Many organizations have adopted SharePoint because of its promise to displace all sorts of big and little applications. With SharePoint, IT can learn one framework and build out applications on an as-needed basis, rather than buying and then maintaining 1001 different applications, all with various system requirements, etc. But the key thing is that you need someone to build out the SharePoint platform and actually turn it into a useful application.”
The post cannot stress enough the importance of customizing SharePoint to make it function as a knowledge management tool. If that was not enough, in order to keep SharePoint working well it needs to continuously be developed.
Lucidea does explain that SharePoint is not a good knowledge management application if you expect it to be implemented in a short time frame, focuses on a single problem, the users improve the system, and can meet immediate knowledge management needs.
The biggest thing to understand is that knowledge management is a process. There are applications that can take control of immediate knowledge management needs, but for long term the actual terms “knowledge” and “management” need to be defined to get what actually needs to be controlled.
Whitney Grace, July 9, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
Does America Want to Forget Some Items in the Google Index?
July 8, 2015
The idea that the Google sucks in data without much editorial control is just now grabbing brain cells in some folks. The Web indexing approach has traditionally allowed the crawlers to index what was available without too much latency. If there were servers which dropped a connection or returned an error, some Web crawlers would try again. Our Point crawler just kept on truckin’. I like the mantra, “Never go back.”
Google developed a more nuanced approach to Web indexing. The link thing, the popularity thing, and the hundred plus “factors” allowed the Google to figure out what to index, how often, and how deeply (no, grasshopper, not every page on a Web site is indexed with every crawl).
The notion of “right to be forgotten” amounts to a third party asking the GOOG to delete an index pointer in an index. This is sort of a hassle and can create some exciting moments for the programmers who have to manage the “forget me” function across distributed indexes and keep the eager beaver crawler from reindexing a content object.
The Google has to provide this type of third party editing for most of the requests from individuals who want one or more documents to be “forgotten”; that is, no longer in the Google index which the public users’ queries “hit” for results.
According to “Google Is Facing a Fight over Americans’ Right to Be Forgotten.” The write up states:
Consumer Watchdog’s privacy project director John Simpson wrote to the FTC yesterday, complaining that though Google claims to be dedicated to user privacy, its reluctance to allow Americans to remove ‘irrelevant’ search results is “unfair and deceptive.”
I am not sure how quickly the various political bodies will move to make being forgotten a real thing. My hunch is that it will become an issue with legs. Down the road, the third party editing is likely to be required. The First Amendment is a hurdle, but when it comes times to fund a campaign or deal with winning an election, there may be some flexibility in third party editing’s appeal.
From my point of view, an index is an index. I have seen some frisky analyses of my blog articles and my for fee essays. I am not sure I want criticism of my work to be forgotten. Without an editorial policy, third party, ad hoc deletion of index pointers distorts the results as much, if not more, than results skewed by advertisers’ personal charm.
How about an editorial policy and then the application of that policy so that results are within applicable guidelines and representative of the information available on the public Internet?
Wow, that sounds old fashioned. The notion of an editorial policy is often confused with information governance. Nope. Editorial policies inform the database user of the rules of the game and what is included and excluded from an online service.
I like dinosaurs too. Like a cloned brontosaurus, is it time to clone the notion of editorial policies for corpus indices?
Stephen E Arnold, July 8, 2015