December 18, 2013
Interesting—it seems the venerated Thomas Bayes is now with us in database land. BayesDB is being developed, in conjunction with an analysis method called CrossCat, by a team of folks from MIT‘s Probabilistic Computing Project and the Shafto Lab at the University of Louisville.
The project’s page explains:
“BayesDB, a Bayesian database table, lets users query the probable implications of their data as easily as a SQL database lets them query the data itself. Using the built-in Bayesian Query Language (BQL), users with no statistics training can solve basic data science problems, such as detecting predictive relationships between variables, inferring missing values, simulating probable observations, and identifying statistically similar database entries.
BayesDB is suitable for analyzing complex, heterogeneous data tables with up to tens of thousands of rows and hundreds of variables. No preprocessing or parameter adjustment is required, though experts can override BayesDB’s default assumptions when appropriate.
BayesDB’s inferences are based in part on CrossCat, a new, nonparametric Bayesian machine learning method, that automatically estimates the full joint distribution behind arbitrary data tables.”
The database is designed for two types of folks: those with no statistics chops who nonetheless have tabular data to analyze, and those proficient with statistics who have a non-standard problem or who have no time or patience for custom modeling. The team credits CrossCat in part with making BayesDB possible, but also say the BQL language was key to its development.
The description includes examples, a discussion of which types of data and problems the database addresses best, reasons to trust the results, why they named it BayesDB, and more. Check out the page for all the details.
Cynthia Murrell, December 18, 2013
December 16, 2013
Through their blog, Attivio weighs in on the HealthCare.gov service: “Could IBM or Oracle Have Been the Miracle Cure for Healthcare.gov?” The telling subtitle is reads, “if you believe that, then I have a bridge to sell you.” Yes, Attivio comes out against pinning all the blame on a refusal to go with the tried and true (or outdated and limited, depending on one’s perspective.)
Senior Attivio marketing VP (and blogger) MaryAnne Sinville observes that the latest trend in the finger-pointing crusade is to assert that the site’s database component should have gone to an old stalwart like IBM, Oracle, or Microsoft instead of to the NoSQL firm MarkLogic. Not because those databases are better suited to the project, necessarily, but because it is easier to find technicians familiar with those systems.
“Does anyone really believe a better solution to a project involving many disparate sources of information, complex logic, and a dynamic interface, which must be built in a very short timeframe would have been to select IBM, Microsoft or Oracle? The idea that legacy mega-vendors have the agility required for a project of this scope is absurd, as the states of Oregon, Pennsylvania and the US Air Force have all recently learned the hard way.
Let’s take a look at the real issues at play here. Selecting a NoSQL database like MarkLogic, or more precisely in this case, an XML database, means that all of the Healthcare.gov data sources would have to be converted to XML. Of course that’s a monumental task, but it’s no more difficult and time consuming than the arduous extract, transform and load (ETL) processes required by traditional relational databases because of their fixed schema. The enormous time and cost associated with ETL is precisely why new technologies are emerging.”
For a nation that prides itself on innovation, we seem to have a lot of folks afraid of progress. Granted, Attivio has a stake in encouraging organizations to break away from traditional database providers. Still, I agree that a project this size called for the most up-to-date approach available. Let us turn our accusatory gaze from MarkLogic, which after all represents a small fraction of the vendors involved with this website, to where it belongs: on our government’s unwieldy and outdated procurement process. Granted, addressing that will be much tougher than assigning a scapegoat, but the approach has a singular advantage—it might actually fix a problem currently poised to cause us trouble for years to come.
Cynthia Murrell, December 16, 2013
November 24, 2013
I read “DB-Engines Ranking.” What struck me is that search engines were included in the list. More remarkable, some of the search systems are not data management systems at all. One data management system bills itself as a search engine. I was surprised to find the Google Search Appliance listed. The system is expensive and garners only basic support from the “search experts” at Google.
Let me highlight the search related notes I made as I worked through the list of 171 systems.
- At position 12 is Solr. This is the open source faceted search engine that can be downloaded and installed—usually.
- At position 21d is ElasticSearch. The person who created Compass whipped up ElasticSearch and made some changes to enhance system performance. With $39 million in venture funding, ElasticSearch can be many things, but for me the company does search and retrieval.
- At position 27 is Sphinx Search. This system makes it easy to retrieve information from MySQL and some other databases without writing formal SQL queries.
- At position 38, MarkLogic is the polymath among the group. The company bills itself as enterprise search, XML data management system, and business intelligence vendor. The company also enjoys some notoriety due to its contributions to the exceptional Healthcare.gov project.
- In position 44 is the Google Search Appliance. The system is among the most expensive appliances I have examined. Is the GSA an end of life project? Is the GSA a database system? My view is that it is a somewhat limited way to get Google style results for users who see Google as the champion in the search derby.
- At position 104 is Xapian. Again, I don’t think of Xapian and its enthusiastic supporters as card carrying members of the database society. For me, Xapian evokes thoughts of Flax.
- At position 124 is CloudSearch. Amazon’s somewhat old fashioned search system. Frankly I think of Amazon as more of a database services outfit than a search outfit.
- At position 127 is the end of life Compass Search. This was the precursor to ElasticSearch. There are those who are happy with an old school open source solution. Good for them.
- At position 149 is SearchBlox. Now SearchBlox uses ElasticSearch. Interesting?
- At position 163 is SRCH2. This vendor is one that has some organizational challenges. The focus of the company seems to be shifting to mobile search.
Quite an eclectic list. Some of the systems mentioned are search engines; for example, Basho Riak. In terms of list “points”, ElasticSearch looks like the big winner. Shay Bannon made the list with Compass. ElasticSearch is moving up the charts. SearchBlox uses ElasticSearch in its product. What happened to LucidWorks and reflexive search?
Which of these systems would you select for data management? My thought is that one should check out the software before taking a list at face value.
The confusion about search is evident in this list. No wonder the LinkedIn discussion groups want to do surveys to figure out what search means.
Stephen E Arnold
November 13, 2013
Basho has released a technical preview of Riak 2.0, the company announced at the Ricon West developers’ conference last month in San Francisco. Several key improvements have been made to the open source distributed database: additional Riak data types; the option for strong consistency; full-text search integration with Apache Solr; more flexibility in security administration; simplified configuration management; and the option of storing fewer replicas across multiple data centers. See the article for details on each of these changes.
The press release emphasizes that this is not the final release of Riak 2.0, and that Basho would like users’ feedback:
“Please note that this is only a Technical Preview of Riak 2.0. This means that it has been tested extensively, as we do with all of our release candidates, but there is still work to be completed to ensure its production hardened. Between now and the final release, we will be continuing manual and automated testing, creating detailed use cases, gathering performance statistics, and updating the documentation for both usage and deployment. As we are finalizing Riak 2.0, we welcome your feedback for our Technical Preview. We are always available to discuss via the Riak Users mailing list, IRC (#riak on freenode), or contact us.”
Riak is developed by Basho Technologies, who naturally offers a commercial edition of the NOSQL database. They also offer Riak CS, a cloud-based object storage system deployable on top of Riak. The company positions their enterprise version as the solution for companies whose needs go beyond the traditional database or who have wrestled with scalability constraints within relational databases. Founded in 2008, Basho is headquartered in in Cambridge, Massachusetts, and maintains offices in London, San Francisco, Tokyo, and Washington D.C.
Cynthia Murrell, November 13, 2013
November 13, 2013
We already knew that MarkLogic is good at search. Now the company is being recognized for its database management chops, we learn from “MarkLogic Featured in the Gartner Magic Quadrant for Operational Database Management Systems” at BWW Geeks World.
The press release tells us:
“MarkLogic has been positioned for its ability to execute and is the only Enterprise NoSQL database vendor featured in the report that integrates search and application services. . . .
MarkLogic is the only schema-agnostic Enterprise NoSQL database that integrates semantics, search and application services with the enterprise features customers require for production applications. This combination helps enterprises make better-informed decisions and create robust, scalable applications to drive revenue, streamline operations, manage risk and make the world safer. MarkLogic features ACID transactions, horizontal scaling, real-time indexing, high availability, disaster recovery, and government-grade security.”
CEO Gary Bloom does not let us forget his company’s search success. He points out that they also captured a place on Gartner‘s 2013 Magic Quadrant for Enterprise Search roster, and that they are the only company to be included in both reports. He understandably takes this achievement as evidence that MarkLogic is on the right track with its integrated approach. The company focuses on scalability, enterprise-readiness, and leveraging the latest technology. Founded in 2001, MarkLogic is headquartered in Silicon Valley and maintains offices around the world.
Cynthia Murrell, November 13, 2013
October 26, 2013
We would like to let you in on a curious database out of the U.K., Springfield! Springfield!, where a vast entertainment wasteland is (semi-)documented. If you have ever had a fit of nostalgia and wished you could find scripts for television shows or films from years gone by, this is the site to check. Ditto if you’d like to chew over the plot of a show you saw last week. The description on the home page is concise:
“Springfield! Springfield! hosts a database containing thousands of TV show episode scripts and movie scripts. TV show episode scripts are available for all the latest top TV shows including… Breaking Bad, Doctor Who, Family Guy, Game of Thrones, How I Met Your Mother, Glee, My Little Pony, Orange Is the New Black, Pretty Little Liars, Sons of Anarchy, The Walking Dead, The Simpsons, True Blood, The Big Bang Theory”
Now that is a diverse list. As one might expect from the name, the site celebrates the longest-running scripted TV series in the U.S. (Springfield is the name of the town in which The Simpsons is set, for those unfamiliar with the show.) However, scripts from many, many other shows and movies can be found at the site, dating an impressive way back. It does not have a search function, but the scripts are listed in a straight-forward, browsable alphabetical format. There is related content, too, like episode guides, character lists, and screen grabs. Whether you want to take a walk down memory lane or catch up on recent episode of your current favorites, check out Springfield! Springfield!
Cynthia Murrell, October 26, 2013
August 30, 2013
Specialized hardware vendor MaxxCAT offers a SQL connector, allowing their appliances to directly access SQL databases. We read about that tool, named BobCAT, at the company’s Search Connect page. We would like to note that the company’s web site has made it easier to locate their expanding range of appliances for search and storage.
Naturally, BobCAT can be configured for use with Microsoft SQL Server, Oracle, and MySQL, among other ODBC databases. The connector ‘s integration with MaxxCAT’s appliances makes it easier to establish crawls and customize output using tools like JSON, HTML and SQL. The write-up emphasizes:
“The results returned from the BobCAT connector can be integrated into web pages, applications, or other systems that use the search appliance as a compute server performing the specialized function of high performance search across large data sets.
“In addition to indexing raw data, The BobCAT connector provides the capability for raw integrators to index business intelligence and back office systems from disparate applications, and can grant the enterprise user a single portal of access to data coming from customer management, ERP or proprietary systems.”
MaxxCAT does not stop with its SQL connector. Their Lynx Connector facilitates connection to their enterprise search appliances by developers, integrators, and connector foundries. The same Search Connect page explains:
“The connector consists of two components, the input bytestream and a subset of the MaxxCAT API that controls the processing of collections and the appliance.
“There are many applications of the Lynx Connector, including building plugins and connector modules that connect MaxxCAT to external software systems, document formats and proprietary cloud or application infrastructure. Users of the Lynx Connector have a straightforward path to take advantage of MaxxCAT’s specialized and high performance retrieval engine in building solutions.”
Developers interested in building around the Lynx framework are asked email the company for more information, including a line on development hardware and support resources. MaxxCAT was founded in 2007 to capitalize on the high-performance, specialized hardware corner of the enterprise search market. The company manages to offer competitive pricing without sacrificing its focus on performance, simplicity, and ease of integration. We continue to applaud MaxxCAT’s recently launched program for nonprofits.
Cynthia Murrell, August 30, 2013
August 26, 2013
Despite enterprise companies moving away from SQL databases to the more robust NoSQL, Oracle has updated its database to include new features, including a XQuery Full Text search. We found an article that examines how the new function will affect Oracle and where it seems to point. The article from Amis Technology Blog: “Oracle Database 12c: XQuery Full Text” explains that the XQuery Full Text search was made to handle unstructured XML content. It does so by extending the XQuery XMLDB language. This finally makes Oracle capable of working with all types of XML. The rest of the article focuses on the XQuery code.
When the new feature was used on Wikipedia Content with XML content as well the test results were positive:
“During tests it proved very fast on English Wikipedia content (10++ Gb) and delivered the results within less than a second. But such a statement will only be picked up very efficiently if the new, introduced in 12c, corresponding Oracle XQuery Full-Text Index has been created.”
Oracle is trying to improve its technology as more of its users switch over to NoSQL databases. Improving the search function as well as other features keeps Oracle in the competition as well as proves that relational tables still have some kick in them. Interestingly enough Oracle appears to be focusing its energies on MarkLogic’s technology to keep in the race.
Whitney Grace, August 26, 2013
August 16, 2013
Without further ado from Basho.com,“Basho Announces Availability Of Riak 1.4,” the popular NoSQL database. Technology news Web sites have been reeling about the new Riak upgrade and what it will offer its users. According to the article, version 1.4 offers more functionality, resolves issues, and adds functions as requested by its users. Also it gives a small taste of what to expect in version 2.0 that will be available for download later in 2013.
Here is what the upgrade features:
· Secondary Indexing Improvements: Query results are now sorted and paginated, offering developers much richer semantics
· Introducing Counters in Riak: Counters, Riak’s first distributed data type, provide automatic conflict resolution after a network partition
· Simplified Cluster Management With Riak Control: New capabilities in Riak’s GUI-based administration tool improve the cluster management page for preparing and applying changes to the cluster
· Reduced Object Storage Overhead: Values and associated metadata are stored and transmitted using a more compact format, reducing disk and network overhead
· Handoff Progress Reporting: Makes operating the cluster, identifying and troubleshooting issues, and monitoring the cluster simpler
· Improved Backpressure: Riak responds with an overload message if a vnode has too many messages in queue
Users will be happy with how Riak 1.4 will provide better functionality and management for clusters and datacenters. The updates and the 2.0 sample are enough to be excited about. There does not seem to be a thing NoSQL databases can do.
Whitney Grace, August 16, 2013
August 16, 2013
Before the better Internet we have today, school children used to have rely on poor search and hacked together Web sites to cheat on their homework. The Wolfram Alpha database did not exist and it made school children rely on their own skills. Wolfram Alpha is a powerhouse database with a snarky attitude that can answer veritably any question. Makeuseof.com points out “10 Surprising Things You Didn’t Know Wolfram Alpha Could Do” and how you can harness the tool to do more than cheat on chemistry homework. Originally built as a computational math engine, geeks have added other and often fun features to Wolfram Alpha. You can upload an image to see how it would look as a comic book, through a dog’s eyes, or via color blindness.
Want a Morse code translator or statistics on everything associated with NFL for the past twenty-five years? Look no further. You can also get a head start on your Christmas shopping by using it as product comparison tool:
“Instead of using filters on any shopping website, you can try an English language query in the search box and see if it helps narrow down your shopping choices. Wolfram Alpha handshakes with Best Buy’s API to source the results, so the results are America and Canada centric. You can also use Wolfram Alpha to make a direct comparison between two products in the same product line by typing in their brand names and model numbers. The results page includes enough details to help you bore down to the right choices.”
Calorie burning calculator, anniversary gift recommender, and medical prescription decoder are yet even more ways. The most artful and mathematical way takes Wolfram Alpha back to its original purpose…almost. The database can take any image and render it into a mathematical equation. What does the Mona Lisa look like in numbers? Play around with Wolfram Alpha and do not forget to ask it a Douglas Adams inspired question.
Whitney Grace, August 16, 2013