A Lament for the State of Analysis Tech in Economics

September 18, 2013

It seems like state-of-the-art analysis tools would be a priority in the data-rich field of finance. That’s why it is startling to learn that the technology being used by economic analysts and consultants seems to be stuck in the era of Windows 95. About Data shares a data-loving former economist’s lament in, “Bridging Economics and Data Science.”

Blogger Sam Bhagwat majored in economics because he was intrigued by innovative uses of data in that field; for example, a professor of his had gleaned conclusions about European patent law from a set of 19th century industrial-fair records. As he progressed, though, Bhagwat came to the disappointing realization that his field still relies on technology for which “outdated” is putting it mildly. He writes:

“When I graduated, the questions had changed, but the fundamental tools of analysis remained constant. Half of my classmates, including me, were headed to consulting or investment banking. These are ‘spreadsheet monkey’ positions analyzing client financial and operational data.

“In terms of relationship-building, this is great. Joining high strategy or high finance, you walk through the halls of power and learn to feel comfortable there. But in terms of technical skill-set, not so great. You begin to specialize in spreadsheets, a tool which hasn’t significantly improved since 1995.

“For someone like me, who wants to solve the most interesting problems out there, dealing with gigabytes and terabytes of data, realizing this was bitter medicine. Computational data analysis has changed a lot in the last twenty years, but my career track — economics, consulting, finance — hadn’t.”

So that is how one inquiring mind decided to make the leap from economics to data science. Bhagwat says he taught himself programming so he could pursue work he actually found challenging. I wonder, though—will he use his dual expertise to help bridge the gap between the two disciplines, or has he moved on, never to look back?

Cynthia Murrell, September 18, 2013

Sponsored by ArnoldIT.com, developer of Augmentext

Academic Integrity Questioned Because of Forgotten Supplemental Note

September 16, 2013

The academic community is supposed to represent integrity, research, and knowledge. When a project goes awry, researchers can understandably get upset, because it could mean several things are on the line: job, funding, tenure, etc. In order to make the findings go the way they want, researchers may be tempted to falsify data. A recent post on Slashdot points to a questionable academic situation: “Request To Falsify Data Published In Chemistry Data.” Is this one situation where data was falsified? Read the original post:

“A note inadvertently left in the ‘supplemental information’ of a journal article appears to instruct a subordinate scientist to fabricate data. Quoting: ‘The first author of the article, “Synthesis, Structure, and Catalytic Studies of Palladium and Platinum Bis-Sulfoxide Complexes,” published online ahead of print in the American Chemical Society (ACS) journal Organometallics, is Emma E. Drinkel of the University of Zurich in Switzerland. The online version of the article includes a link to this supporting information file. The bottom of page 12 of the document contains this instruction: “Emma, please insert NMR data here! where are they? and for this compound, just make up an elemental analysis …” We are making no judgments here. We don’t know who wrote this, and some commenters have noted that “just make up” could be an awkward choice of words by a non-native speaker of English who intended to instruct his student to make up a sample and then conduct the elemental analysis. Other commenters aren’t buying it.'”

“Make up an elemental analysis…,” does that statement sound credible to you? Researchers are supposed to question and analyze every iota of data until there is nothing left to explore. Making something up only leads to false data and will cause future studies to be inaccurate. Is this how all academics are or is it just an isolated incident?
Whitney Grace, September 16, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

Pipe Information Dreams Often Forget

September 14, 2013

Do we dare broach the subject about heath care information and electronic media records? Yes, we do and we take into account “Dr. Karl Kochendorfer: Bridging The Knowledge Gap In Health Care” from Federated Search Blog. Dr. Karl Kochendorfer wants there to be an official federated search for the national health care system. His idea is to connect health care professionals to authoritative information with an instantaneous return. He cites that doctors and nurses are relying on Wikipedia and Google searches rather than authorized databases, because it is faster. Notice the danger?

Dr. Kochendorfer mentions this fact in a TED talk he gave in April called “Seek And Ye Shall.” He presents the idea for a federated search in this discussion, along with more of these facts:

  1. “There are 3 billion terabytes of information out there.
  2. There are 700,000 articles added to the medical literature every year.
  3. Information overload was described 140 years ago by a German surgeon: “It has become increasingly difficult to keep abreast of the reports which accumulate day after day … one suffocates through exposure to the massive body of rapidly growing information.”
  4. With better search tools, 275 million improved decisions could be made.
  5. Clinicians spend 1/3 of their time looking for information.”

Dr. Kochendorfer ‘s idea is grand, but how many academic databases are lining up to offer their information for free or without a hefty subscription fee? Academia is already desperate for money, asking them to share their wealth of knowledge without green will not go over too highly. Should there be a federated search with authoritative information and instantaneous results? Yes. Will it happen? Keep fixing the plumbing.

Whitney Grace, September 14, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

Do Not Disregard Salience

August 27, 2013

Here is a story for all of you object relational fans out there from the Lexalytics Development Blog: “Exploratory Text Analytics Using Object-Relational Mappings.” The post starts out explaining how salience, a process that examines content such as mention of a specific item or detecting a document’s tone, is a business intelligence tool with a lot potential. Many users, however, do not know what salience can actually do with their data. There is another problem is that salience is a low-level engine in any customer application and the user needs to design a better application to extract and analyze the data.

The good news is that there is a viable solution:

“…[A] couple of enabling technologies have been developed that allow customers to take the initial data analysis phase back into their own hands.  The first enabling technology is that of automated object/relation mapping (ORM) frameworks.  ORM frameworks store the internal data objects produced by object-oriented programming languages (like Java or C#) into a relational database, where they can be made accessible to any application.  ORM frameworks have been around for decades, but they required (painful) manual configuration to set them up.  Modern ORM frameworks now have automated mapping capabilities that them to configure themselves from the structure of the data objects.  What this means for Salience is that is now easy to dump everything that Salience extracts—everything—into a database.”

The post runs through an ORM implementation and how to get a salience application set up. Salience sounds a lot like big data. Could this be the next big data trend, salience detection apps?

Whitney Grace, August 27, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

Compressor Contest

August 15, 2013

If you want to squish text, here’s a useful resource. Blogger and tech strategist Matt Mahoney hosts a  contest that puts  lossless data compression programs to the test. Using a particular text dump, the English version of Wikipedia from March 3, 2006, he examines the compressed size of the data‘s first billion bytes. He explains the reason for the initiative:

“The goal of this benchmark is not to find the best overall compression program, but to encourage research in artificial intelligence and natural language processing (NLP). A fundamental problem in both NLP and text compression is modeling: the ability to distinguish between high probability strings like recognize speech and low probability strings like reckon eyes peach. . . .

“Compressors are ranked by the compressed size of enwik9 (109 bytes) plus the size of a zip archive containing the decompresser. Options are selected for maximum compression at the cost of speed and memory. Other data in the table does not affect rankings. This benchmark is for informational purposes only. There is no prize money for a top ranking.”

Still, bragging rights themselves will be worth it for the winner. See the write-up for all the technical details, including a detailed rundown of each compressor.

Cynthia Murrell, August 15, 2013

Sponsored by ArnoldIT.com, developer of

Solr Update Goes Live

August 12, 2013

The Solr 4.4 update has been big news, exciting the open source search community. LucidWorks builds its LucidWorks Search and LucidWorks Big Data solutions on top of the Apache Lucene Solr project. Their developer devoted blog, SearchHub, pays attention to the latest update in its article, “Solr 4.4 Went Live this Week! – A Brief Summary.”

The article jumps right into the changes users can expect in Solr 4.4:

“Probably the biggest news is the new schemaless mode, or perhaps more aptly named schema guessing, where Solr tries to figure out what data-types to use based on the data you submit. While this puts to rest one of the bigger remaining complaints about Solr, it’s not recommended for production; this is the same recommendation that other schemaless engines advise. For example, if you’re going to be sorting and faceting over millions and millions of documents, you want to use the most compact numeric type that will suffice for your numeric fields.”

The good news for value-added open source enterprise solutions like LucidWorks is that when the open source foundation receives an update, the benefits automatically trickle down into the value-added solutions. Users of LucidWorks know that they can count on the latest technology that is fully supported by an award-winning company.

Emily Rae Aldridge, August 12, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

Red Hat Partners with MongoDB

August 8, 2013

Red Hat is a major leader in the world of open source. Founded in 1993, they are considered one of the major forerunners to the present day open source boom. So the latest Red Hat news is usually a harbinger, and is worth following. Read the latest news from Red Hat in the PC World article, “Red Hat Enterprise Linux Gets Cozy with MongoDB.”

The article describes the recent Red Hat partnership with MongoDB:

“Easing the path for organizations to launch big data-styled services, Red Hat has coupled the 10gen MongoDB data store to its new identity management package for the Red Hat Enterprise Linux (RHEL) distribution . . . Although it already has been fairly easy to set up a copy of MongoDB on RHEL — by using Red Hat’s package installation tools — the new integration minimizes a lot of work of initializing new user and administrator accounts in the data store software.”

The partnership between Red Hat and MongoDB can only mean good things for the open source community. In fact, we have been seeing more and more of these likeminded partnerships over the last several months. LucidWorks announced a partnership with MapR to strengthen their LucidWorks Big Data offering. LucidWorks is worth keeping an eye on, as they are constantly seeking innovation and advancement for the open source community.

Emily Rae Aldridge, August 8, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

Dangerous Glitches Still to be Worked Out in Electronic Medical Records

July 30, 2013

Electronic medical records are part of the constantly evolving big-data landscape, and there have been some issues, according to Bloomberg‘s “Digital Health Records’ Risks Emerge as Deaths Blamed on Systems.” Not surprisingly, the period just after a new EMR system is implemented has been found to be the most dangerous time. See the article for examples of harm caused by errors. Journalist Jordan Robertson writes:

“Electronic health records are supposed to improve medical care by providing physicians quick and easy access to a patient’s history, prescriptions, lab results and other vital data. While the new computerized systems have decreased some kinds of errors, such as those caused by doctors’ illegible prescriptions, the shift away from paper has also created new problems, with sometimes dire consequences.”

Perhaps it would help if docs could access necessary data. Yet, even now, medical practices have trouble prying clinical data from their EMR systems. Apparently, a lack of data integration is the culprit, according to “Why do Docs Struggle with Population Health Data?” at Government HealthIT. That article summarizes:

“Today, with modern EHR systems, clinicians may have an easier time getting clinical data — but not all of it, which is a problem for providers pursuing population health goals. It’s also a problem as federal health officials and patient-safety organizations like the National Quality Forum try to transition from process-based quality measurements. . . to outcomes-based metrics.”

Are such digital-data challenges confined to health records alone? Unlikely, though in few (if any)other fields is big data consistently a life-and-death issue. We must remember that digital health records have also been shown to improve outcomes, but are we netting more good than bad? I suspect so, but we will probably never know for certain. One thing is sure: there’s no turning back now. Surely, mistakes will decline as systems are refined and staff acclimated. I know that is cold comfort to anyone who has suffered such a preventable loss.

Cynthia Murrell, July 30, 2013

Sponsored by ArnoldIT.com, developer of Augmentext

Teradata Uses Revelytix Loom for Smarter Hadoop

July 26, 2013

The popular open source database technology system Hadoop has many strategic partners that have created portfolios for integrating with Hadoop. We heard about another one in Data Center Knowledge’s article “Teradata Announces Portfolio for Hadoop.” Teradata specializes in analytic technologies like data warehousing and business intelligence.

Teradata’s portfolio for Hadoop features Hadoop-based product platforms, software, consulting services, training and customer support. It uses Revelytix Loom. The press release “Revelytix Announces General Availability of Loom for Hadoop” tells us more about that.

According to Scott Gnau, president, Teradata Labs:

‘“Teradata is now off and running as a trusted single source for all things Hadoop with many leading customers such as Dell, Inc., Otto Group, PT XL Axiata Tbk mobile telecommunications, Swisscom Schweiz AG, and Wells Fargo Bank. We built The Teradata Portfolio for Hadoop to support organizations struggling with Hadoop implementations by taking the complexity and cost out of deploying and managing the solutions.’”

As far as the Revelytix Loom, it provides a smarter Hadoop for the Teradata Appliance for Hadoop. What does that mean? Just a little something like dynamic dataset management and automatic parsing for new files.

Megan Feil, July 26, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

Mondeca Adds to Linked Open Vocabularies

July 26, 2013

The growing web of linked data not only grows in volume of data, but also in a growing set of vocabularies. We recently saw on Open Knowledge Foundation’s site that Mondeca’s Linked Open Vocabularies (LOV) have been updated. A collection of vocabulary spaces.

Users are able to find vocabularies listed and individually described by metadata, classified by vocabulary spaces and interlinked using the dedicated vocabulary VOAF.

We learned more about what LOV is about:

“Most popular ones form now a core of Semantic Web standards de jure (SKOS, Dublin Core, FRBR …) or de facto (FOAF, Event Ontology …). But many more are published and used. Not only linked data leverage a growing set of vocabularies, but vocabularies themselves rely more and more on each other through reusing, refining or extending, stating equivalences, declaring metadata. LOV objective is to provide easy access methods to this ecosystem of vocabularies, and in particular by making explicit the ways they link to each other and providing metrics on how they are used in the linked data cloud, help to improve their understanding, visibility and usability, and overall quality.”

There are a myriad of ways that those interested can feed their inner controlled vocabulary demon. One of which is to suggest a new vocabulary to add to LOV.

Megan Feil, July 26, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta