Basho Riak Gets Developer Love: Syslog Indexing
April 25, 2012
If you are not familiar with Basho Riak, you can work through the www.basho.com Web site, or you can navigate to www.opensearchnews.com and request our profile of the company. (Click on the “Profile” link at the top of the page.) You may want to check out “Full Text Indexing of Syslog Messages with Riak.” The article describes a tool call riak-syslog. The utility sucks up syslog messages and allows the user to search those messages using the Riak full text search system. The write up has a post which points to indexing syslog messages with Solr. Useful.
Stephen E Arnold, April 25, 2012
Sponsored by PolySpot
Big Data: Implications for Open Source and Proprietary Tools
April 21, 2012
During a Web cast in the OpenWorld Tokyo this month, Oracle President Mark Hurd zeroed in on the developments his company has made in the area of analytics. The overall theme of the presentation appears in “Oracle’s Mark Hurd Spells Out Analytics Vision”.
Hurd framed his remarks around the perils and promise held in ever-increasing amounts of digital information. “The amount of data on the planet is just huge,” he said. “I have bad news. It’s going to get worse.” He added:
The true question is how to get the right information to the right person at the right time to make the right decision. This is hard.
Come to think of it, all of the other major players in analytics – Microsoft, IBM, and SAP – talk about it in a similar light. The gist is that they’re making Big Data analytics technology available to businesses so that they can delve into both structured and unstructured data to unearth actionable knowledge. That is, minus the risks traditionally associated with it.
Included in the updates that Hurd announced was the upgrade to the Hyperion Enterprise Performance Management (EPM), that is, version 11.1.2.2. This new version has modules for account reconciliation and financial planning, support for Exalytics, and enhanced user experience, among others. Oracle also announced the release of Endeca Information Discovery, which is a system that’s capable of combining both unstructured and structured data sans modeling.
However, Oracle isn’t the only analytics player that is continuously expanding its feature set. SAP recently launched ActiveEmbedded. But several open source analytics players are going strong. Examples of these are Ikanow and Revolution Analytics.
So what does this mean for proprietary solutions?
Enterprises continue to struggle with the amount of data that they have to manage as that amount skyrockets into the petabyte stage. Hence, they also have to upgrade their infrastructure which means bigger costs on top of the license fees of proprietary tools. Open source analytics, aside from being free, allows businesses to create their own custom-fit analytics solution.
However, I believe that that while open-source analytics will eventually be more widely used, proprietary technologies will remain viable and over time, we’ll see a blend of both being used by companies to handle big data.
Lauren Llamanzares, April 24, 2012
Sponsored by PolySpot
Cleo: Open Source Search Tools from LinkedIn
March 10, 2012
LinkedIn’s Engineering page provides insights into the site’s inner workings in “Cleo: the Open Source Technology behind LinkedIn’s Typeahead Search.” Open sourced by LinkedIn under the Apache Software License 2.0, Cleo is a “flexible software library for enabling rapid development of partial, out-of-order, real-time typeahead and autocomplete services.”
The typeahead services fall into two broad categories. Generic Tyapeahead does not take a member’s social network into account. Network Typeahead, on the other hand, does just that; it filters according to the degree of connections in a member’s social network.
LinkedIn Principal Engineer Jingwei Wu reveals:
“Cleo updates in real time: as soon as new members, companies, or groups join LinkedIn, they become immediately searchable through LinkedIn typeahead services. This provides a natural extension to the user search experience and makes it easy for members to engage in social activities such as discovering and connecting with professionals, following companies, and joining groups.”
The article goes into depth from high-level design to samples of code on the inner workings of the typeahead service. See the post for more details. Is LinkedIn an open source player, or is the company positioning itself for more than findability tools?
Cynthia Murrell, March 10, 2012
Sponsored by Pandia.com
Inteltrax: Top Stories, February 20 to February 24
February 27, 2012
Inteltrax, the data fusion and business intelligence information service, captured three key stories germane to search this week, specifically, the happenings within the ever-expanding world of unstructured data.
The first look we took was our story “Unstructured Data Growing at Astronomical Rates” gives proof that the info known as unstructured data (tweets, videos, blog posts, etc) are growing wild, but thankfully smart tools are here to tame them.
Another look, “Unstructured Data is Never Perfect,” offers proof of how hard it is to make sense of this info, but sheds light on companies like Digital Reasoning, who are conquering the unstructured.
In a related realm, “Unstructured Data Storage Demands Equally Powerful Software,” shows that, in order to make the most of unstructured data a combination of powerful software and massive storage is necessary.
Unstructured data is easily the biggest buzz word in big data analytics. That’s because our collection of this useful ephemera is growing at massive rates. Luckily, there are countless tools used to help us make helpful insights using this confusing data and we’ll be tracking it every step of the way.
Follow the Inteltrax news stream by visiting www.inteltrax.com
Patrick Roland, Editor, Inteltrax, February 27, 2012
Inforbix Cracks Next Generation Search for SolidWorks Users
February 13, 2012
Search means advertising to most Google users. In an enterprise—according to the LinkedIn discussions about enterprise search—the approach is anchored in the 1990s. The problem is that finding information requires a system which can handle content types that are of little interest to lawyers, accountants, and MBAs running a business today.
Without efficient access to such content as engineering drawings, specifications, quality control reports, and run-of-the-mill office information—costs go up. What’s worse is that more time is needed to locate a prior version of a component or locate the supplier who delivered on time and on budget work to the specification. So expensive professionals end up performing what I call Easter egg hunt research. The approach involves looking for colleagues, paging through lists of file names, and the “open, browse, close” approach to information retrieval.
Not surprisingly, the so called experts steer clear of pivotal information retrieval problems. Most search systems pick the ripe apples which are close to the ground. This means indexing Word documents, the versions of information in a content management system, or email.
I learned today that Inforbix, a company we have been tracking because it takes search to the next level, has rolled out two new products. These innovations are data apps which seamlessly aggregate product data from different file types, sources, and locations. The new Inforbix apps will help SolidWorks’ users get more out of their product data and become more productive while improving decision-making. Plus, Inforbix said that it would expand the data access to SolidWords EPDM, making it possible for SolidWords customers to get more from data managed by their PDM system.
The two products are Inforbix Charts and Inforbix Dashboard. Both complement Inforbix Tables which was released in October 2011.
Oleg Shilovitsky, founder of Inforbix, told me:
Manufacturing companies are drowning in the growing amount of product data generated and found within different file types, sources, and company data-silos. They are increasingly using a mix of vendor packages and solutions, all which generate, contain, manage, or store product data, creating a hodgepodge of resources to be combed through. Product data generated in a typical manufacturing company can be both unstructured (valuable BOM and assembly information spread out across different CAD drawings) and structured (CAD drawings within a PDM or PLM system). Our apps are tools that address specific product data tasks such as finding, re-using, and sharing product data. Inforbix can access product data within PDM systems such as ENOVIA SmarTeam and Autodesk Vault and make it available in meaningful ways to CAD and non-CAD users.
When I reviewed the system, I noted that Inforbix’s apps utilize product data semantic technology that automatically infer relationships between disparate sources of data. For example, Inforbix can semantically connect or link a SolidWorks CAD assembly found within EPDM with a related Excel file containing a BOM table stored on a file server in another department.
Inforbix Charts visualizes and presents data saved from Inforbix Tables. The product data is presented in charts that include information to help engineers better manage and run processes by identifying trends and patterns and improving data control. For example, Inforbix Charts visually presents the approval statuses of CAD and ECO documents by author, date approved, last modified date, etc.
Inforbix Dashboard dynamically collects and presents important statistics about engineering and manufacturing data and processes, such as how many versions of a particular CAD drawing currently exist, how many design revisions did it take to complete a CAD drawing, or the number of ECOs processed on time. Easy and intuitive to use, Inforbix Dashboard is an ideal tool for project managers.
SolidWords users can access Inforbix apps and their product data online. Current Inforbix customers can immediately begin using the Inforbix iPad app, available for free on the Apple App Store at http://www.inforbix.com/inforbix-mobile-search-for-cad-and-product-data-on-the-ipad/. Account access taps existing Inforbix credentials. New users are encouraged to register with Inforbix to enable the iPad app to access product data within their company. The apps soon will be available on Android devices.
A video preview of the iPad app is posted at http://www.inforbix.com/inforbix-ipad-app-first-preview/. For more information on Inforbix apps, visit http://www.inforbix.com.
Inforbix is a company on the move.
Stephen E Arnold, February 13, 2012
Sponsored by Pandia.com
MapMaking Used to Prevent Public Health Threats
February 10, 2012
Science Blogs recently reported on a new tool that blows Google Maps out of the water in the article, “New Mapping Tools Bring Public Health Surveillance to the Masses.”
According to the article, HealthMap is a team of researchers, epidemiologists and software developers at Children’s Hospital Boston who use online sources to track disease outbreaks and deliver real-time surveillance on emerging public health threats. They also utilize the help of local residents to help with research.
Blogger, Kim Krisberg writes:
“HealthMap, which debuted in 2006, scours the Internet for relevant information, aggregating data from online news services, eyewitness reports, professional discussion rooms and official sources. The result? The possibility to map disease trends in places where no public health or health care infrastructures even exist, Brownstein told me. And because HealthMap works non-stop, continually monitoring, sorting and visualizing online information, the system can also serve as an early warning system for disease outbreaks.”
Mapmaking and public health are hardly strangers. Public health practitioners use maps to guide interventions. Despite the complexity of most disease outbreaks, maps can still help health professionals raise public awareness about prevention and target interventions in ways that make the most of limited resources.
Jasmine Ashton, February 10, 2012
Sponsored by Pandia.com
Pingar Sets Up Shop in Silicon Valley
February 1, 2012
Pingar, smaller than Google’s catering staff, sets up shop in Silicon Valley. The Bay of Plenty Times announces, “Tauranga Firm Sets Up Silicon Valley Base.” The New Zealand publication reports that co-founders Peter and Jacqui Wren-Hilton were impressed by the size of the big dogs’ campuses when they visited. Pingar follows three other New Zealand tech companies into Silicon Valley: Endace, Xero, and SLI Systems.
Pingar, which, in addition to the Valley, has offices in two New Zealand locations and in London, Hong Kong, Bangalore, and, soon, Singapore. Its innovative search engine works by asking specific questions. The company also offers an API, with 18 components accessible to developers. It is looking to break into the scanner market, with a unique product that automatically applies metadata to scanned documents. Yes, that would be helpful!
The company was recognized by the Silicon Valley Association of Startup Entrepreneurs as one of 30 hot emerging tech companies from around the world. Pingar is growing into its success; the article notes:
Twelve months ago Pingar employed 12 people, now the number is 30 and Mr Wren-Hilton predicts the staff will double to 60 by the end of next year; involving 20 in research and development, and 40 in business development, marketing and support services.
“Twenty-five of them will be based in Auckland and Tauranga, and 35 will be overseas, including seven in Silicon Valley.
Nicely played, Pingar.
Cynthia Murrell, February 1, 2012
Sponsored by Pandia.com
File Extension List
January 28, 2012
Need a handy list of all known file extensions and types? Look no further. Nosa Lee at Seek The Sun Slowly has kindly provided such a list in “The Known File Extensions/ Types References – A” through “Z.” In a translation from the original Chinese, the listing explains:
Now, I collected all the known file extensions/types for your reference, I grouped them according to the first character due to there are too many file extensions/types.
Yes, there’s a page for each letter, and even “Number” and “Symbol.” To download them all in one fell swoop, click here.
I knew there were a lot of file types, but seeing them all in one place really puts the matter into perspective.
Cynthia Murrell, Janaury 28, 2012
Sponsored by Pandia.com
Talend Pitches Holistic Integration
December 21, 2011
Connectors get some new lingo; holistic integration is a term we learned from Talend’s press release, “Talend V5: Democratizing Holistic Integration.” The company defends its coinage of the term:
Frankly, IT often uses loosely some terms from the general corpus. But in this case, holistic does the trick. . . . The promise of Talend v5 is to enable IT organizations to converge traditionally disparate integration efforts and practices through a common set of products, tools and best practices. When an organization deploys Talend v5, it will deploy essentially one platform, regardless of the integration need: data integration, application integration, process integration.
That does fit the definition of the term, but it is a little grand, don’t you think? Hmm, maybe not in a field titled “Big Data.”
Talend positions this release as the result of the changes its products have undergone since it bought the German Sopera this time last year. The company is quick to point out that this comprehensive approach does not result in bloatware. Each product included in the platform works independently; customers must only deploy the parts they need.
The write up emphasizes that Talend’s products are still based on the open source underpinnings on which they were founded. The company boasts of being a leader in the open source data management market.
Cynthia Murrell, December 21, 2011
Sponsored by Pandia.com
Predictions on Big Data Miss the Real Big Trend
December 18, 2011
Athena the goddess of wisdom does not spend much time in Harrod’s Creek, Kentucky. I don’t think she’s ever visited. However, I know that she is not hanging out at some of the “real journalists’” haunts. I zipped through “Big Data in 2012: Five Predictions”. These are lists which are often assembled over a lunch time chat or a meeting with quite a few editorial issues on the agenda. At year’s end, the prediction lunch was a popular activity when I worked in New York City, which is different in mental zip from rural Kentucky.
The write up churns through some ideas that are evident when one skims blog posts or looks at the conference programs for “big data.” For example—are you sitting down?—the write up asserts: “Increased understanding of and demand for visualization.” There you go. I don’t know about you, but when I sit in on “intelligence” briefings in the government or business environment, I have been enjoying the sticky tarts of visualization for years. Nah, decades. Now visualization is a trend? Helpful, right?
Let me identify one trend which is, in my opinion, an actual big deal. Navigate to “The Maximal Information Coefficient.” You will see a link and a good summary of a statistical method which allows a person to process “big data” in order to determine if there are gems within. More important, the potential gems pop out of a list of correlations. Why is this important? Without MIC methods, the only way to “know” what may be useful within big data was to run the process. If you remember guys like Kolmogorov, the “we have to do it because it is already as small as it can be” issue is an annoying time consumer. To access the original paper, you will need to go to the AAAS and pay money.
The abstract for “Detecting Novel Associates in Large Data Sets by David N. Reshef1,2,3,*,†, Yakir A. Reshef, Hilary K. Finucane, Sharon R. Grossman, Gilean McVean, Peter Turnbaugh, Eric S. Lander, Michael Mitzenmacher, Pardis C. Sabet, Science, December 16, 2011 is:
Identifying interesting relationships between pairs of variables in large data sets is increasingly important. Here, we present a measure of dependence for two-variable relationships: the maximal information coefficient (MIC). MIC captures a wide range of associations both functional and not, and for functional relationships provides a score that roughly equals the coefficient of determination (R^2) of the data relative to the regression function. MIC belongs to a larger class of maximal information-based nonparametric exploration (MINE) statistics for identifying and classifying relationships. We apply MIC and MINE to data sets in global health, gene expression, major-league baseball, and the human gut microbiota and identify known and novel relationships.
Stating a very interesting although admittedly complex numerical recipe in a simple way is difficult, I think this paragraph from “The Maximal Information Coefficient” does a very good job:
The authors [Reshef et al] go on showing that that the MIC (which is based on “gridding” the correlation space at different resolutions, finding the grid partitioning with the largest mutual information at each resolution, normalizing the mutual information values, and choosing the maximum value among all considered resolutions as the MIC) fulfills this requirement, and works well when applied to several real world datasets. There is a MINE Website with more information and code on this algorithm, and a blog entry by Michael Mitzenmacher which might also link to more information on the paper in the future.
Another take on the MIC innovation appears in “Maximal Information Coefficient Teases Out Multiple Vast Data Sets”. Worth reading as well.
Forbes will definitely catch up with this trend in a few years. For now, methods such as MIC point the way to making “big data” a more practical part of decision making. Yep, a trend. Why? There’s a lot of talk about “big data” but most organizations lack the expertise and the computational know how to perform meaningful analyses. Similar methods are available from Digital Reasoning and the Google love child Recorded Future. Palantir is more into the make pictures world of analytics. For me, MIC and related methods are not just a trend; they are the harbinger of processes which make big data useful, not a public relations, marketing, or PowerPoint chunk of baloney. Honk.
Stephen E Arnold, December 18, 2011
Sponsored by Pandia.com, a company located where high school graduates actually can do math.