Lucidworks (Really?) Does Fusion Too
July 23, 2015
I read “Lucidworks Delivers Fusion 2.0 with Spark Integration.” The idea is that search is not exactly flying off the shelves. Why not download Elasticsearch and move on? The way to make search relevant is to make it a Big Data thing. This is the hard to believe path IBM took with Vivisimo’s technology. Where is Vivisimo in the IBM revenue picture? Well, that picture seems gloomy. Maybe the Big Data thing doesn’t work particularly well.
In terms of venture backed Lucidworks, the write up explains:
Fusion 2.0 provides an organization with access to a streamlined, consumer-like search experience with enterprise-grade speed and scalability. The new release integrates Lucidworks’ Fusion with Apache Spark to enable real-time data analytics. Fusion 2.0 also features a new version of the company’s SiLK user interface (UI) that simplifies dashboard visualizations and enhances the user experience. The SiLK UI runs on top of Fusion and the Apache Solr search platform, upon which Fusion is based. SiLK gives users the power to perform ad-hoc search and analysis of massive amounts of multi-structured and time series data. Users can swiftly transform their findings into visualizations and dashboards.
I think I understand. Wrappers of software provide more developer-friendly tools. The may be one slight hitch in the git along. Those familiar with the technology of open source and fluent in the mumbo jumbo jargon that Lucid and other repositioning enterprise search vendors employ may not comprise a giant pool of prospects.
In short, writing wrappers is hard work. Dealing with fusion in an effective manner is harder work. Eliminating the latency that accompanies layers and handoffs is the hardest work of all.
The challenge will be generating substantial organic revenue and having enough profit to satisfy the investors which have been very patient with the Lucidworks outfit. No, really.
Stephen E Arnold, July 23, 2015
IBM SAP Versus SAS: A Faux Dust Up
July 22, 2015
Ah, the freebie statistics are like gnats. One or two make no difference when one is eating a chicken leg. Toss in 20,000 or more and the leg eating becomes a chore.
I read an oblique write up called “SAS UK Chief: Envious Rivals, Skills Gap and Analytics in the Cloud.” The topics are interesting because they are mixed together, a fruit salad to go with that picnic chicken.
The write up begins a statement attributed to an IBM SAP executive along the lines: “SAS could be entirely replaced.” That seems a bit of fortune telling which might not be entirely in line with some SAS users’ plans. IBM, as you may know, is fresh from 13 straight quarters of revenue decline. I interpreted the feisty comment as a signal to IBM management that the much loved SAP division is replete with machismo and doing its bit to increase revenues. There’s nothing like a statistics squabble to pump up the sales spice.
As I understand the write up, that allegedly “put ‘em up, chump” statement caused an SAS executive to flounder. SAS’s problem is that it is still a little chunk of graduate school. SAS faces competition from upstarts like Talend. SAP, on the other hand, is chasing consulting and giant IBM cloud-type things. But the two outfits are old school operations. For proof just ask a graduate student in statistics.
The reality is that both SAP and SAS may be victims of the same market shifts. In order to get either company’s products to deliver a perfect grilled chicken, one has to know about statistics and have resources (money, gentle reader).
Big companies are okay with these requirements. But the buzz in the analytics world is for open source, point and click, ready to run solutions. The outputs of these next generation systems may not meet the standards of the SAPs and the SASs of the world, but the customers don’t care.
These two firms are facing many gnats. Neither is going to have a pleasant meal. The good old days of sunshine, blue skies, and a bug free experience are gone.
Stephen E Arnold, July 22, 2015
Big Data Vendor List
July 19, 2015
I scanned the Big Data list. I won’t linger too long. You can too. (Apologies to Robert Frost and “The Pasture.” The clarity part I will leave to you.)
The list appears in this article: “42 Big Data Startups.” One reader added 16 other companies. I am unclear. I tried to “wait to watch the water clear” but it did not.
Main thoughts:
- What’s a start up? A number in the companies in the list have been around for a while; for example, Talend was founded in 2005. Let’s see, despite the muddy water, that works out to a decade.
- Why is there just one company with “search” solutions on the list. The search-aware outfit is Datastax. But the company’s information access capability was not mentioned. The list totters as a result like the “little calf that’s standing by the mother.”
- What’s the rationale for clumping in an earthworm type laundry list services, software, applications that sit on top of data management systems, and outfits which focus on a niche like geolocation or search engine optimization? There are no horses, sheep, or pigs in the Frost poem. At least, I did not discern any nor did the person who came along.
Listicles can be interesting, humorous, and informative. Lists without logic are not particularly useful unless one is eager to demonstrate the importance of specified criteria and sort of useful classification of items in the list.
Stephen E Arnold, July 19, 2015
Kashman to Host Session at SharePoint Fest Seattle
July 14, 2015
Mark Kashman, Senior Product Manager at Microsoft, will deliver a presentation at the upcoming SharePoint Fest Seattle in August. All eyes remain peeled for any news about the new SharePoint Server 2016 release, so his talk entitled, “SharePoint at the Core of Reinventing Productivity,” should be well watched. Benzinga gives a sneak peek with their article, “Microsoft’s Mark Kashman to Deliver Session at SharePoint Fest Seattle.”
The article begins:
“Mark Kashman will deliver a session at SharePoint Fest Seattle on August 19, 2015. His session will be held at the Washington State Convention Center in downtown Seattle. SharePoint Fest is a two-day training conference (plus an optional day of workshops) that will have over 70 sessions spread across multiple tracks that brings together SharePoint enthusiasts and practitioners with many of the leading SharePoint experts and solution providers in the country.”
Stephen E. Arnold is also keeping an eye out for the latest news surrounding SharePoint and its upcoming release. His Web service ArnoldIT.com efficiently synthesizes and summarizes essential tips, tricks, and news surrounding all things search, including SharePoint. The dedicated SharePoint feed can save users time by serving as a one-stop-shop for the most pertinent pieces for users and managers alike.
Emily Rae Aldridge, July 14, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
SAS Explains Big Data. Includes Cartoon, Excludes Information about Cost
July 13, 2015
I know that it is easy to say Big Data. It is easy to say Hadoop. It is easy to make statements in marketing collateral, in speeches, and in blogs written by addled geese. Honk!
I wish to point out that any use of these terms in the same sentence require an important catalyst: Money. Money that has been in the words of the government procurement officer, “Allocated, not just budgeted.”
Here are the words:
- Big Data
- Hadoop
- Unstructured data.
Point your monitored browser at “Marketers Ask: What Can Hadoop Do That My Data Warehouse Can’t?” The write up originates with SAS. When a company anchored in statistics, I expect some familiarity with numbers. (yep, just like the class you have blocked from your mind. The mid term? What mid term?)
The write up points out that unstructured data comes in many flavors. This chart, complete with cartoon, identifies 15 content types. I was amazed. Just 15. What about the data in that home brew content management system or tucked in the index of the no longer supported DEC 20 TIPS system. Yes, that data.
How does Hadoop deal with the orange and blue? Pretty well but you and the curious marketer must attend to three steps. Count ‘em off, please:
- Identify the business issue. I think this means know what problem one is trying to solve. This is a good idea, but I think most marketing problems boil down to generating revenue and proving it to senior management. Marketing looks for silver bullets when the sales are not dropping from the sky like packages for the believers in the Cargo Cult.
- Get top management support. Yep, this is a good idea because the catalyst—money—has to be available to clean, acquire, and load the goodies in the blue boxes and the wonky stuff from the home brew CMS.
- Develop a multi play plan. I think this means that the marketer has zero clue how complicated the Hadoop magic is. The excitement of extract, transform, and load. The thrill of batch processing awaits. Then the joy of looking at outputs which baffle the marketer more comfortable selecting colors and looking at Adwords’ reports than Hadoop data.
My thought is that SAS understands data, statistical methods, and the reality of a revolution which is taking place without the strictures of SAS approaches.
I do like the cartoon. I do not like the omission of the money part of the task. Doing the orange and blue thing for marketers is expensive. Do the marketers know this?
Nope.
Stephen E Arnold, July 13, 2015
Business Intelligence: The Grunt Work? Time for a Latte
July 10, 2015
I read “One Third of BI Pros Spend Up to 90% of Time Cleaning Data.” Well, well, well. Good old and frail eWeek has reported what those involved in data work have known for what? Decades, maybe centuries? The write up states with typical feather duster verbiage:
A recent survey commissioned by data integration platform provider Xplenty indicates that nearly one-third of business intelligence (BI) professionals are little more than “data janitors,” as they spend a majority of their time cleaning raw data for analytics.
What this means is that the grunt work in analytics still has to be done. This is difficult and tedious work even with normalization tools and nifty hand crafted scripts. Who wants to do this work? Not the MBAs who need slick charts to nail their bonus. Not the frantic marketer who has to add some juice to the pale and wan vice president’s talk at the Rotary Club. Not anyone, except those who understand the importance of scrutinizing data.
The write up points out that extract, transform, and load functions or ETL in the jingoism of Sillycon Valley is work. Guess what? The eWeek story uses these words to explain what the grunt work entails:
- Integrating data from different platforms
- Transforming data
- Cleansing data
- Formatting data.
But here’s the most important item in the article: If the report on which the article is based is correct, 21 percent of the data require special care and feeding. How’s that grab you for a task when you are pumping a terabyte of social media or intercept data a day? Right. Time for a bit of Facebook and a trip to Starbuck’s.
What happens if the data are not ship shape? Well, think about the fine decisions flowing from organizations which are dependent on data analytics. Why not chase down good old United Airlines and ask the outfit if anyone processed log files for the network which effectively grounded all flights? Know anyone at the Office of Personnel Management? You might ask the same question.
Ignoring data or looking at outputs without going through the grunt work is little better than guessing. No, wait. Guessing would probably return better outcomes. Time for some Foosball.
Stephen E Arnold, July 10, 2015
HP: A Trusted Source for Advice about Big Data?
July 9, 2015
Remember that Hewlett Packard bought Autonomy. As part of that process, the company had access to data, Big Data. There were Autonomy financials; they were documents from various analysts and experts; there were internal analyses. The company’s Board of Directors and the senior management of the HP organization decided to purchase Autonomy for $11 billion in October 2011. I assume that HP worked through these data in a methodical, thorough manner, emulating the type of pre-implosion interest in detail that made Arthur Anderson a successful outfit until the era of the 2001 Enron short cut and the firm’s implosion. A failure to deal with data took out Anderson, and I harbor a suspicion that HP’s inability to analyze the Autonomy data has been an early warning of issues at HP.
I was lugging my mental baggage with me when I read “Six Signs That Your big Data Expert, Isn’t?” I worked through the points cited in the write up which appeared in the HP Big Data Blog. Let me highlight three of these items and urge you, gentle reader, to check out the article for the six pack of points. I do not recommend drinking a six pack when you peruse the source because the points might seem quite like the statements in Dr. Benjamin Spock’s book on child rearing.
Item 2 from the list of six: “They [your Big Data experts] “talk about technology, rather than the business.” Wow, this hit a chord with me when I considered HP’s spending $11 billion and then writing off $7 or $8 billion, blaming Autonomy for tricking Hewlett Packard. My thought was, “Maybe HP is the ideal case study to be cited when pointing out that someone is focusing on the wrong thing. For example, Autonomy’s “black box” approach is nifty, but it has been in the market since 1995-1996. The system requires care and feeding, and it can be a demanding task mistress to set up, configure, optimize, and maintain. For a buyer not to examine the “Big Data” relevant to 15 years of business history strikes me as important and basic step in the acquisition process. Did HP talk about the Autonomy business, or did HP get tangled in the Digital Reasoning Engine, the Integrated Data Operating Layers, patents, Bayesian-LaPlacian-Markovian methods?
Item 4 from the list of six: “They [your Big Data experts] talk about conclusions rather than correlations.” Autonomy, as I reported in the first three editions of the late, lamented Enterprise Search Report, grew its revenue through a series of savvy acquisitions. The cost and sluggishness of enterprise software sales related to IDOL needed some vitamin supplements. Dr. Mike Lynch and his capable management team built the Autonomy revenue base by nosing into video, content management, and fraud detection. IDOL was the “brand,” and the revenue flowed from the sale of a diverse line up of products and services. My hypothesis is that the HP acquisition team looked at the hundreds of millions in Autonomy revenue and concluded, “Hey, this is a no brainer for us. We can sell much more of this IDOL thing. Our marketing is much more effective than that of the wonks in Cambridge. Our HP sales force is more capable than the peddlers Autonomy has on the street.” HP then jumped to the conclusion that it could take $700 or $800 million in existing revenue and push it into the stratosphere. Well, how is that working out? Again, I would suggest that the case to reference in this Item 4 is HP itself.
Item 6 from the list of six: “They [your Big Data experts] talk about data quality, rather than data validity.” This is an interesting item. In the land of databases, the meaning of data quality is often conflated with consistency; that is, ingesting the data does not generate exceptions during processing. An exception is a record which the content processing system kicks out as malformed. The notion of data validity means that the data that makes it into a database is accurate by some agreed upon yardstick. Databases can be filled with misinformation, disinformation, and reformed information like a flood of messages from Mr. Putin’s social media campaigns. HP may have accepted estimates from Qatalyst Partners, its own in house team, and from third party due diligence firms. HP’s senior management used these data, which I assume were neither too little nor too big to shore up their decision to buy Autonomy for $11 billion. As HP learned, data, whether meaty or scrawny, may be secondary to the reasoning process applied to the information. Well, HP demonstrated that it made a slip up in its understanding of Autonomy. I would have liked to see this point include a reference to HP’s Autonomy experience.
Net net: HP is pitching advice related to Big Data. That’s okay, but I find that a company which appears to have struggled with Big Data related to the Autonomy acquisition may not be the best, most objective, and most reliable source of advice.
Talk is easy. Performance is difficult. HP is mired in a break up plan. The company has not convinced me that it is able to deal with Big Data. Verbal assurance are one thing; top line performance and profits, happy customers, and wide adoption of Autonomy technology are another.
The other three points can be related to Autonomy. I will leave it to you, gentle reader, to map HP’s adult-sounding advice to HP’s actual use of Big Data. As the HP blog’s cartoon says, “Well, maybe.”
Stephen E Arnold, July 9, 2015
Computational Constraints: Big Data Are Big
July 8, 2015
Navigate to “Genome Researchers Raise Alarm over Big Data.” The point of the write up is that “genome data will exceed the computing challenges of YouTube and Twitter.” This may be a surprise to the faux Big Data experts. The write up points out:
… they [computer wizards] agree that the computing needs of genomics will be enormous as sequencing costs drop and ever more genomes are analyzed. By 2025, between 100 million and billion human genomes could have been sequenced, according to the report, which is published in the journal PLoS Biology. The data-storage demands for this alone could run to as much as 2^40 exabytes (1 exabyte is 1018 bytes), because the number of data that must be stored for a single genome are 30 times larger than the size of the genome itself, to make up for errors incurred during sequencing and preliminary analysis.
Until computing resources are sufficiently robust and affordable, the write up states:
Nevertheless, Desai [an expert] says, genomics will have to address the fundamental question of how much data it should generate. “The world has a limited capacity for data collection and analysis, and it should be used well. Because of the accessibility of sequencing, the explosive growth of the community has occurred in a largely decentralized fashion, which can’t easily address questions like this,” he says. Other resource-intensive disciplines, such as high-energy physics, are more centralized; they “require coordination and consensus for instrument design, data collection and sampling strategies”, he adds. But genomics data sets are more balkanized, despite the recent interest of cloud-computing companies in centrally storing large amounts of genomics data.
Will the reality of Big Data increase awareness of the need for Little Data; that is, trimmed sets? Nah, probably not.
Stephen E Arnold, July 8, 2015
Walmart and the Big Data Elephant Riders
July 7, 2015
Navigate to the Capitalist Tool’s write up “Walmart: The big Data Skills Crisis and Recruiting Analytics Talent.” Stating the obvious is something that most jargon delivery mechanisms avoid. Why be clear when obfuscation provides so many MBA-type chuckles?
The write up states about Big Data:
There just aren’t enough people with the required skills to analyze and interpret this information–transforming it from raw numerical (or other) data into actionable insights – the ultimate aim of any Big Data-driven initiative.
I had to sit down. Imagine. Specific skills are required to assemble data, formulate hypotheses, configure the numerical recipes, obtain outputs, and then analyze what the magic of math delivers.
Who would have thought that the average marketer might be a tiny bit under equipped to deal with Big Data in the here and now?
The write up states:
Last year, they [sic. The reference is to the single firm Walmart] turned to crowd sourced analytics competition platform Kaggle. At Kaggle, an army of “armchair data scientists” apply their skills to analytical problems submitted by companies, with the designer of the best solution being rewarded – sometimes financially, in this case with a job.
That’s a great solution. No problem with confidentiality in the crowdsourcing ecosystem. But Walmart hired candidates. Walmart explains what it seeks:
“Fundamentally,” says Thakur [Walmart manager], “we need people who are absolute data geeks–people who love data, and can slice it, dice it and make it do what they want it to do.
Walmart also uses an “analytics rotation program.” I assume this is designed to ensure that the Big Data analytics wizard can “run in the right direction.”
Walmart, it appears, is the leader in using crowd sourced methods for finding talent. Perhaps Walmart perceives itself as one of the leaders in the use of this method. It is good to be a visionary in Walmart land. What is Walmart’s next innovation? I cannot anticipate the next revolutionary breakthrough from the retailer many local retail stores perceives as a good neighbor made better with Big Data.
Stephen E Arnold, July 7, 2015
An Oddly Mystical, Whimsical Listicle Combining Big Data and Search
July 4, 2015
Some listicles are clearly the work of college students after a tough beer pong tournament. Others seem as if they emanate from beyond Pluto’s orbit. I am not sure where on this spectrum between the addled and extraterrestrial the listicle in “Top 11 Open Source big Data Enterprise Search Software” falls.
Here’s the list for your contemplation. I have added some questions after each company’s name. Consult the original write up for the explanation the inclusion of these systems in the list. I found the write ups without much heft or “wood” to use a Google term.
- Apache Solr. Yep, uses Lucene libraries, right. Performance? Exciting sometimes.
- Apache Lucene Core. Ah, Lego blocks for the engineer with some aspirations for continuous employment.
- Elasticsearch. The leader in search and retrieval. To do big data, there are some other components required. Make sure your programming and engineering expertise are up to the job.
- Sphinx. Okay, workable for structured data. Work required to stuff unstructured content into this system.
- Constellio. Isn’t this a part time project of a consulting firm focused on Canadian government work?
- DataparkSearch Engine. Yikes.
- ApexKB. Okay, a script. For enterprise applications. Big Data? Wow.
- Searchdaimon ES. Useful, speedier than either Lucene or Elasticsearch. Not a big data engine without some extra work. Come to think of it. A lot of work.
- mnoGoSearch. Well, maybe for text.
- Nutch. Old in the tooth. Why not use Lucene?
- Xapian. Very robust. Make certain that you have programming expertise and engineering knowledge. Often ignored which is too bad. But be prepared for some heavy lifting or paying a wizard with a mental fork lift to do the job.
Now which of these systems can do “big data.” In one sense, if you are exceptionally gifted with engineering and programming skills, I suppose any of these can do tricks. As Samuel Johnson allegedly observed to his biographer:
“Sir, a woman’s preaching is like a dog’s walking on his hind legs. It is not done well; but you are surprised to find it done at all.”
On the other hand, these programs can be used as a utility within a more robust content processing system which has been purpose built to deal with large flows of structured and unstructured content. But even that takes work.
Anyone want to give Constellio a shot at processing real time Facebook posts? Anyone want to use any of these systems to solve that type of search problem? Show of hands, please?
Stephen E Arnold, July 4, 2015