Spreadsheet Fever May Suffer Spreadsheet Goofs

July 7, 2014

The data-analysis work of recently prominent economist Thomas Pikkety receives another whack, this time from computer scientist and blogger Daniel Lemire in, “You Shouldn’t Use a Spreadsheet for Important Work (I Mean It).” Pikkety is not alone in Lemire’s reproach; last year, he took Harvard-based economists Carmen Reinhart and Kenneth Rogoff to task for building their influential 2010 paper on an Excel spreadsheet.

The article begins by observing that Pikkety’s point, that in today’s world the rich get richer and the poor poorer, is widely made but difficult to prove. Though he seems to applaud Pikkety’s attempt to do so, Lemire really wishes the economist had chosen specialized software, like STATA, SAS, or “even” R or Fortran. He writes:

“What is remarkable regarding Piketty’s work, is that he backed his work with comprehensive data and thorough analysis. Unfortunately, like too many people, Piketty used speadsheets instead of writing sane software. On the plus side, he published his code… on the negative side, it appears that Piketty’s code contains mistakes, fudging and other problems….

“I will happily use a spreadsheet to estimate the grades of my students, my retirement savings, or how much tax I paid last year… but I will not use Microsoft Excel to run a bank or to compute the trajectory of the space shuttle. Spreadsheets are convenient but error prone. They are at their best when errors are of little consequence or when problems are simple. It looks to me like Piketty was doing complicated work and he bet his career on the accuracy of his results.”

The write-up notes that Piketty admits there are mistakes in his work, but asserts they are “probably inconsequential.” That’s missing the point, says Lemire, who insists that a responsible data analyst would have taken more time to ensure accuracy. My parents always advised me to use the right tool for a job: that initial choice can make a big difference in the outcome. It seems economists may want to heed that common (and common sense) advice.

Cynthia Murrell, July 07, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

Written by Stephen E. Arnold · Filed Under Analytics, Data, News | Comments Off on Spreadsheet Fever May Suffer Spreadsheet Goofs

Building Lovely Data Visualizations

June 25, 2014

Data is no longer just facts, figures, and black and white graphs. Data visualizations are becoming an increasingly important way that data (and even Big Data) is demonstrated and communicated. A few data visualization solutions are making big waves, and Visage is one on the rise. It is highlighted in the FastCompany article, “A Tool For Building Beautiful Data Visualizations.”

The article begins:

“Visage, a newly launched platform, provides custom templates for graphics. There are myriad tools on the market that do this (for a gander at 30 of them, check out this list), but Visage is the latest, and it’s gaining traction with designers at Mashable, MSNBC, and A&E. That’s due in part to Visage’s offerings, which are designed to be more flexible, and more personalized, than other services.”

More and more companies are working on ways to help organizations decipher and make sense of Big Data. But what good is the information if it cannot be effectively communicated? This is where data visualizations come in – helping to communicate complex data through clean visuals.

Emily Rae Aldridge, June 25, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

Written by Stephen E. Arnold · Filed Under Data, News, visualization | Comments Off on Building Lovely Data Visualizations

Connotate Shows Growth And Webdata Browser

June 20, 2014

In February 2014, NJTC TechWire wrote an article on “Connotate Announces 25% YOY Growth In Total Contract Value For 2013.” Connotate has made a name for itself by being a leading provider of Webdata extraction and monitoring solutions. The company’s revenue grew 25% in 2013 and among other positives for Connotate were the release of Connotate 4.0, a new Web site, and new multi-year deal renewals. On top of the record growth, BIIA reports that “Connotate Launches Connotate4,” a Web browser that simplified and streamlines Webdata extraction. Connotate4 will do more than provide users with a custom browser:

? “Inline data transformations within the Agent development process is a powerful new capability that will ease data integration and customization.

? Enhanced change detection with highlighting can be requested during the Agent development process via a simple point-and-click checkbox, enabling highlighted change detection that is easily illustrated at the character, word or phrase level.

? Parallel extraction tasks makes it faster to complete tasks, allowing even more scalability for even larger extractions.

? Build and expand capabilities turn the act of re-using a single Agent for related extraction tasks a one-click event, allowing for faster Agent creation.

? A simplified user interface enabling simplified and faster Agent development.”

Connotate brags that the new browser will give user access to around 95% of Webdata and is adaptable as new technologies are made. Connotate aims to place itself in the next wave of indispensable enterprise tools.

Whitney Grace, June 20, 2014
Sponsored by ArnoldIT.com, developer of Augmentext

Written by Stephen E. Arnold · Filed Under Data, News, Tools | 1 Comment

Rising Startup Tamr Has Big Plans for Data Cleanup

June 13, 2014

An article Gigaom is titled Michael Stonebraker’s New Startup, Tamr, Wants to Help Get Messy Data in Shape. With the help ($16 million) from Google Ventures and New Enterprise Associates, Stonebraker and partner Andy Palmer are working to crack the ongoing problem of data transformation and normalization. The article explains,

“Essentially, the Tamr tool is a data cleanup automation tool. The machine-learning algorithms and software can do the dirty work of organizing messy data sets that would otherwise take a person thousands of hours to do the same, Palmer said. It’s an especially big problem for older companies whose data is often jumbled up in numerous data sources and in need of better organization in order for any data analytic tool to actually work with it.”

Attempting to allow for machines to learn some human-like insight into repetitive cleanup work just might be the trick. Tamr does still require a human in the management seat known as the data steward, someone who will read the results of a projected comparison between two sets of separate data and decide whether it is a good relationship. Tamr has been compared to Trifacta, but Palmer insists that Tamr is preferable for its ability to compare thousands of data sources with a data stewards oversight. He also noted that Trifacta co-founder Joe Hellerstein was a student of Stonebraker’s in a PhD program.

Chelsea Kerwin, June 13, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

Written by Stephen E. Arnold · Filed Under Data, News | 1 Comment

Palantir Advises More Abstraction for Less Frustration

June 10, 2014

At this year’s Gigaom Structure Data conference, Palantir’s Ari Gesher offered an apt parallel for the data field’s current growing pains: using computers before the dawn of operating systems. Gigaom summarizes his explanation in, “Palantir: Big Data Needs to Get Even More Abstract(ions).” Writer Tom Krazit tells us:

“Gesher took attendees on a bit of a computer history lesson, recalling how computers once required their users to manually reconfigure the machine each time they wanted to run a new program. This took a fair amount of time and effort: ‘if you wanted to use a computer to solve a problem, most of the effort went into organizing the pieces of hardware instead of doing what you wanted to do.’

“Operating systems brought abstraction, or a way to separate the busy work from the higher-level duties assigned to the computer. This is the foundation of modern computing, but it’s not widely used in the practice of data science.

“In other words, the current state of data science is like ‘yak shaving,’ a techie meme for a situation in which a bunch of tedious tasks that appear pointless actually solve a greater problem. ‘We need operating system abstractions for data problems,’ Gesher said.”

An operating system for data analysis? That’s one way to look at it, I suppose. The article invites us to click through to a video of the session, but as of this writing it is not functioning. Perhaps they will heed the request of one commenter and fix it soon.

Based in Palo Alto, California, Palantir focuses on improving the methods their customers use to analyze data. The company was founded in 2004 by some folks from PayPal and from Stanford University. The write-up makes a point of noting that Palantir is “notoriously secretive” and that part(s) of the U.S. government can be found among its clients. I’m not exactly sure, though, how that ties into Gesher’s observations. Does Krazit suspect it is the federal government calling for better organization and a simplified user experience? Now, that would be interesting.

Cynthia Murrell, June 10, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

Written by Stephen E. Arnold · Filed Under Analytics, Big data, Data, News | Comments Off on Palantir Advises More Abstraction for Less Frustration

Data Journalism Handbook

June 2, 2014

In the fast moving world of technology, updated resources are especially important. The Data Journalism Handbook is a new one that is worth a second look. Available in a variety of languages, the handbook aims to be a primer for the emerging world of data journalism.

The overview states:

“The Data Journalism Handbook is a free, open source reference book for anyone interested in the emerging field of data journalism. It was born at a 48 hour workshop at MozFest 2011 in London. It subsequently spilled over into an international, collaborative effort involving dozens of data journalism’s leading advocates and best practitioners.”

Freely available online via a Creative Commons license, the handbook is an initiative of the European Journalism Centre. Download your free copy today to see if data journalism is a field in which you can participate.

Sponsored by ArnoldIT.com, developer of Augmentext

Emily Rae Aldridge, June 02, 2014

Written by Stephen E. Arnold · Filed Under Data, News | 15 Comments

Centrifuge Says It Offers More Insights

May 29, 2014

According to a press release from Virtual Strategy, Centrifuge Systems-a company that develops big data software-has created four new data connectors within its visual link analysis software. “Centrifuge Expands Their Big Data Discovery Integration Footprint,” explains that with the additional data software users will be able to make better business decisions.

“ ‘Without the ability to connect disparate data – the potential for meaningful insight and actionable business decisions is limited,’ says Stan Dushko, Chief Product Officer at Centrifuge Systems. ‘It’s like driving your car with a blindfold on. We all take the same route to the office every day, but wouldn’t it be nice to know that today there was an accident and we had the option to consider an alternate path.’ ”

The new connectors offer real time access to ANX file structure, JSON, LDAP, and Apache Hadoop with Cloudera Impala. Centrifuge’s entire goal is to add more data points that give users a broader and more detailed perspective of their data. Centrifuge likes to think of itself as the business intelligence tool of the future. Other companies, though, offer similar functions with their software. What makes Centrifuge different from the competition?

Whitney Grace, May 29, 2014
Sponsored by ArnoldIT.com, developer of Augmentext

Written by Stephen E. Arnold · Filed Under Big data, Connectors, Data, News | Comments Off on Centrifuge Says It Offers More Insights

Fusion Problems

May 29, 2014

Brett Slatkin at One Big Fluke makes a provoking point in his blog post: “Data Fusion Has No Error Bounds” about how data analysis can be full of calculating errors. Slatkin relates how he has come across many data fusion issues in his career. Data fusion problems occur when people want to merge two or more data sets without any related sources. There are companies that have tried to rectify data fusion problems, but no matter how they advertise their software, code, or gimmick Slatkin proves that there is always going to be some margin of error. How does he do it? Math.

Slatkin illustrates data fusion with three data sets that have zero to little relation. He outlines all the possible outcomes of each data set, ending with that there is a portion that cannot be measured. He proves that despite all of the careful planning, mapping out the possible outcomes yields a phantom zone. His response to this simple outcome is:

“There are two outcomes in data fusion: you measure so you can calculate the error bars, or you make a wild guess.”

What have we learned from this? Despite all attempts to overcome any errors, data analysis is still error prone. Big data vendors will not like that.

Whitney Grace, May 29, 2014
Sponsored by ArnoldIT.com, developer of Augmentext

Written by Stephen E. Arnold · Filed Under Data, News | Comments Off on Fusion Problems

Interview with Jeff Catlin on the Future of Enterprise Data

May 22, 2014

The interview titled Text Analytics 2014: Jeff Catlin, Lexalytics on Breakthrough Analysis may be overstating its case when it is billed as a breakthrough analysis. Most of the questions cover state-of-the-industry topics and Lexalytics promotion. Catlin offers insight into the world of enterprise data and the future of the industry. For example, when asked about new features for 2014 and the near future, Catlin responded,

“As a company, Lexalytics is tackling both the basic improvements and the new features with a major new release, Sallience 6.0 which will be landing sometime in the second half of the year. The core text processing and grammatic parsing of the content will improve significantly, which will in turn enhance all of our core features of the engine. Additionally, this improved grammatic understanding will allow us to be the key to detecting intention, which is the big new feature in Salience 6.0”

Catlin repeats in several of his answers that the industry is in flux, and that vendors can only scramble to keep up, even going so far as to compare 2013 and 2014 enterprise data to the Berlin Wall. He describes two “fronts”, one involving improving core technology, and the other focused on vertical market prospects.

Chelsea Kerwin, May 22, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

Written by Stephen E. Arnold · Filed Under Data, Enterprise, Interview, News | Comments Off on Interview with Jeff Catlin on the Future of Enterprise Data

Using Real Data to Mislead

May 14, 2014

Viewers of graphs, beware! Data visualization has been around for a very long time, but it has become ubiquitous since the onset of Big Data. Now, the Heap Data Blog warns us to pay closer attention in, “How to Lie with Data Visualization.” Illustrating his explanation with clear examples, writer Ravi Parikh outlines three common ways a graphic can be manipulated to present a picture that actually contradicts the data used to build it. The first is the truncated Y-axis. Parikh writes:

“One of the easiest ways to misrepresent your data is by messing with the y-axis of a bar graph, line graph, or scatter plot. In most cases, the y-axis ranges from 0 to a maximum value that encompasses the range of the data. However, sometimes we change the range to better highlight the differences. Taken to an extreme, this technique can make differences in data seem much larger than they are.”

The example here presents two charts on rising interest rates. On the first, the Y-axis ranges from 3.140% to 3.154% — a narrow range that makes the rise from 2008 to 2012 look quite dramatic. However, on the next chart the rise seems nigh non-existent; this one presents a more relevant span of 0.00% to 3.50% on the Y-axis.

Another method of misrepresentation is to present numbers, particularly revenue, cumulatively instead of from year-to-year or quarter-to-quarter. Parikh notes that Apple’s iPhone sales graph from last September is a prominent example of this tactic.

Finally, one can mislead one’s audience by violating conventions. The real-world example here presents a pie chart in which the slices add up to 193%. The network that created it had to know that cursory viewers would pay more attention to the bright colors than to the numbers. The write-up observes:

“The three slices of the pie don’t add up to 100%. The survey presumably allowed for multiple responses, in which case a bar chart would be more appropriate. Instead, we get the impression that each of the three candidates have about a third of the support, which isn’t the case.”

See the article for more examples, but the upshot is clear. Parikh concludes:

“Be careful when designing visualizations, and be extra careful when interpreting graphs created by others. We’ve covered three common techniques, but it’s just the surface of how people use data visualization to mislead.”

Cynthia Murrell, May 14, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

Written by Stephen E. Arnold · Filed Under Data, News, visualization | Comments Off on Using Real Data to Mislead

« Previous Page — Next Page »

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.

Categories
- 3D-Printing
- Acquisition
- Advertising
- Aggregation
- AI
- Alexa
- algorithms
- Amazon
- Amazonia
- Analytics
- Appliance
- Applications
- Audio
- Augmented Reality
- Big data
- Bing
- Bitcoin
- Bitext
- Book review
- Business intelligence
- Business process
- Business strategy
- Censorship
- Cloud computing
- Company Profile
- Conferences
- Connectors
- Consulting
- Consumer
- Content processing
- Copyright
- Corporate Concerns
- Cost
- Crawl
- Crowdfunding
- cryptocurrency
- Customer support
- Cyber OSINT
- cybercrime
- cybersecurity
- Dark Web
- DarkCyber
- Data
- Data mining
- Database
- Deepfakes
- Digital Assistant
- Digital Library
- E2EE
- ECommerce
- EDiscovery
- Editorial opinion
- Education
- Emoticons
- Enterprise
- Enterprise search
- Entity extraction
- Ethics
- Facebook
- Faceted search
- Factualities
- Feature
- Federated search
- Financial
- Fogint
- Google
- Governance
- Government
- Hackers
- healthcare
- IBM Watson
- Image search
- Indexing
- Infrastructure
- Innovation
- Integration
- intelware
- Interface
- Internet
- Interview
- Investment
- law enforcement
- Legal matters
- Library automation
- Management
- Marketing
- Mathematics
- Metadata
- Microsoft
- Mobile
- Natural language processing
- News
- NGIA
- Online (general)
- Open Access
- Open source
- OSINT
- Osint Radar
- Overflight
- Palantir
- Patents
- Personnel
- Podcast
- Policeware
- Portals
- Predictive coding
- Privacy
- Profile
- Publishing
- Quotation
- Real time search
- Reference tool
- Rich media
- Robot Writer
- Search
- Search enabled applications
- search engine
- Search quality
- Security
- Semantic
- Sentiment analysis
- SEO
- SharePoint
- Short Honks
- Smart Technology
- Social
- Social Media
- software
- Statistics
- Taxonomy
- Technology
- Text analytics
- Text processing
- Tools
- Tor
- Training
- Translation
- Twitter
- Uncategorized
- Unstructured Data
- User experience
- User Interface
- Vertical search
- Video
- visualization
- Voice search
- Voice technology
- Web 3
- Web Services
- Webinar
- Windows
- Work flow
- XML
- Yahoo

Beyond Search

Spreadsheet Fever May Suffer Spreadsheet Goofs

Building Lovely Data Visualizations

Connotate Shows Growth And Webdata Browser

Rising Startup Tamr Has Big Plans for Data Cleanup

Palantir Advises More Abstraction for Less Frustration

Data Journalism Handbook

Centrifuge Says It Offers More Insights

Fusion Problems

Interview with Jeff Catlin on the Future of Enterprise Data

Using Real Data to Mislead

Search the site

Categories

Archives

Recent Posts

Meta

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Search the site

Categories

Archives

Recent Posts

Meta