Statistics for the Statistically Inclined
June 10, 2011
Due to a strong bias against everyone’s favorite search engine, it is difficult for me to become excited over new Google developments. However, having endured a number of statistics classes, I will certainly give credit where credit is due.
I was recently directed to Google Correlate and spent a solid twenty-five minutes entertaining myself with test statistical relationships. The offering consists of comparisons of an uploaded data set against a real data set courtesy of the search mogul. Google provides results based on a Pearson Correlation Coefficient (r) nearest to 1.0, giving the user the most positively correlated queries. One can customize the results in a number of manners: for negative relationships, against a time series or regional location, for a normalized sine function or a scatter plot, etc.
For any glazed over eyes out there, the Web site sums up the intent this way:
“Google Correlate is like Google Trends in reverse. With Google Trends, you type in a query and get back a series of its frequency (over time, or in each US state). With Google Correlate, you enter a data series (the target) and get back queries whose frequency follows a similar pattern.”
Don’t worry, there is a tutorial.
It should also be noted that this service is tagged as “experimental”. I fear due to lack of popularity, it may dissolve in its very own time series in sad, monthly increments.
I imagine this tool is providing certain students some relief, but what of regular users? In the words of the head gander, how many Google mobile users know what correlate means? Without crunching the data, I think our r may be approaching -1.0.
Sarah Rogers, June 10, 2011
Sponsored by ArnoldIT.com, the resource for enterprise search information and current news about data fusion
ProQuest: A Typo or Marketing?
June 10, 2011
I was poking around with the bound phrase “deep indexing.” I had a briefing from a start up called Correlation Concepts. The conversation focused on the firm’s method of figuring out relationships among concepts within text documents. If you want to know more about Correlation Concepts, you can get more information from the firm’s Web site at http://goo.gl/gnBz6.
I mentioned to Correlation Concepts Dr. Zbigniew Michalewicz’s work in mereology and genetic algorithms and also referenced the deep extraction methods developed by Dr. David Bean at Attensity. I also commented on some of the methods disclosed in Google’s open source content. But Google has become less interesting to me as new approaches have become known to me. Deep extraction requires focus, and I find it difficult to reconcile focus with the paint gun approach Google is now taking in disciplines far removed from my narrow area of interest.
A typo is a typo. An intentional mistake may be a joke or maybe disinformation. Source: http://thiiran-muru-arul.blogspot.com/2010/11/dealing-with-mistakes.html
After the interesting demo given to me by Correlation Concepts, I did some patent surfing. I use a number of tools to find, crunch, and figure out which crazily worded filing relates to other, equally crazily worded documents. I don’t think the patent system is much more than an exotic work of fiction and fancy similar to Spenser’s The Faerie Queene.
Deep indexing is important. Key word indexing does not capture in some cases the “aboutness” of a document. As metadata becomes more important, indexing outfits have to cut costs. Human indexers are like tall grass in an upscale subdivision. Someone is going to trim that surplus. In indexing, humans get pushed out for fancy automated systems. Initially more expensive than humans, the automated systems don’t require retirement, health care, or much management. The problem is that humans still index certain content better than automated systems. Toss out high quality indexing and insert algorithmic methods, and you get search results which can vary from indexing update to indexing update.
Will Schema.org Would Limit Web Developer Choices?
June 10, 2011
We just don’t know. We noted on Slashdot the article “Schema.org—Google, Microsoft and Yahoo! Agree on Markup Vocabulary.” At first glance, this is another technical hoe down. The goal of standardization promised by Schema.org looks like a good move. The stated goal is improved search results. What could be wrong with that?
In reality, it’s a case of the big boys collaborating to make decisions for the rest of us, like in the good old days with Boss Tweed and Commodore Vanderbilt.
The Slashdot blurb points to Manu Sporny’s piece “The False Choice of Schema.org.” Sporny details the choices that will be lost by adopting this model. RDFa and Microformats would become unsupported, unnecessarily narrowing developer choice to Microdata only. The stated advantages of reducing complexity do not outweigh the losses:
Those [RDFa] features aren’t just there to be purely complex – they were specifically requested by the Web community when building RDFa. Microdata is lacking many of those community-requested features, which does make it simpler, but it also makes it so that it doesn’t solve the problems that the ‘complex’ features were designed for. RDFa is designed to solve a wider range of problems than just those of the search companies. Yes, complexity is bad – but so is cutting features that the Web community has specifically requested and needs to make structured data on the Web everything that it can be.
Because business success today depends so much on search ranking, few businesses are likely to resist the changes once in place. It’s possible, though, that enough protest now will cause the bosses to rethink their edict. As Sporny declares, “this is not how we do things on the Web.”
We also recall that Google has some serious standards horsepower working in the Googleplex. Is it possible that Google wants to move more quickly than the standard practice may be? Worth watching.
Stephen E Arnold, June 10, 2011
Sponsored by ArnoldIT.com, the resource for enterprise search information and current news about data fusion
Stormy Weather for the Eucalyptus Grove?
June 10, 2011
Still feel safe in the cloud? Have you heard from Eucalyptus lately?
According to “Critical Vulnerability in Open Source Eucalyptus Clouds”, there has been another break-in. At least a theoretical one; university researchers have found a hole in the cloud. Per the article:
“An attacker can, with access to the network traffic, intercept Eucalyptus SOAP commands and either modify them or issue their own arbitrary commands. To achieve this, the attacker needs only to copy the signature from one of the XML packets sent by Eucalyptus to the user. As Eucalyptus did not properly validate SOAP requests, the attacker could use the copy in their own commands sent to the SOAP interface and have them executed as the authenticated user.”
The platform has already provided a newer, downloadable version that corrects the issue. Eucalyptus has warned their services may be a little spotty while the rest of the system recognizes the fix.
Go ahead and tally another tick mark against the cloud. What’s worse, besides the discovered threat, users must contend with the hassle of outages related to the fix. I could be wrong, but it seems it is only a matter of time before some serious consequences arise from lax attitudes concerning data storage.
How about putting enterprise data in the cloud with a search interface? Or maybe a bank of social security numbers? Now what about a security lapse?
Sarah Rogers, June 10, 2011
Sponsored by ArnoldIT.com, the resource for enterprise search information and current news about data fusion
Protected: Reference List of Exemplary SharePoint Portals
June 10, 2011
Tech Parade: Rain Forecast with Hail and High Winds
June 9, 2011
Oh, oh. Is the tech parade scheduled for a July celebration in bad weather?
AdAge Blogs’ “Affluency: Being ‘Technology-Infused’ Proves Taxing for Affluent” offers the results of a survey that shows how the lives of the affluent have become technology-infused.
While this group has seen explosive growth in smartphone, e-reader, and tablet ownership, the technology has also complicated their lives. Advertisers and media:
“must understand the growing adoption and use of new technology, as well as the evolving “topography” of platforms and occasions. At each point in this topography, [they] must understand consumers’ level of engagement, receptivity to advertising, preferences for apps vs. Web-based content, unmet information needs and much more. And [they] must do it all in an environment in which consumers feel they are facing more complex and stressful decisions than ever before.”
With 30 percent of searches now associated with mobile devices, and, according to Google, 40 percent of those searches local, it’s easy to see how mobile data is having a real impact on purchase decisions. All of this takes on added importance when you realize that the mobile devices are in the hands of affluent households, a market everyone’s chasing.
Too many companies chasing too few customers—could a miserable summer be upon us.
Stephen E Arnold, June 9, 2011
Sponsored by ArnoldIT.com, the resource for enterprise search information and current news about data fusion
Slapping Facebook and Muting At Work Users
June 9, 2011
Have workplace bans on technology ever been effective? In “Half of UK Businesses Ban Social Media at Work,” The Next Women business magazine examines the issue.
A study of 2,500 UK businesses found that “48% ban their workers from posting updates on Twitter, Facebook and other social networking sites.” While employers may claim they are worried about protecting sensitive information or employees writing detrimental things about the company, “it’s the seamless integration between work and social media that is really concerning companies.”
How do you craft a policy that allows employees to use their smart phones for calls and e-mails but bans social networking? And who’s going to enforce it? This kind of negative management is never going to be considered a best practice.
Our view is that when 20 somethings join a “real” organization, the organization is going to have to work overtime to curtail what the 20 somethings perceive as normal behavior. Can organizations slap Facebook and mute its users at work? Good luck with that.
What happens if the hot new hire who cost a bonus, a new auto as an inducement, and a big salary takes a hike over a muting policy? Expensive for sure.
Stephen E Arnold, June 9, 2011
Sponsored by ArnoldIT.com, the resource for enterprise search information and current news about data fusion
Google Abandons Another No Brainer Database
June 9, 2011
In “Google Kills Google News Archive,” Techspot’s reporting the end of the Internet giant’s newspaper archiving project. We learned:
“Newspapers that have their own digital archives can still add material to Google’s news archive via sitemaps, but the search giant will no longer spend its own money toward the cause.” Users can continue to search digitized newspapers in the archive, but, the company isn’t going “to introduce any further features or functionality to the Google News Archive.”
Seems like Google now understands what commercial database publishers have known for some time–searchable newspaper databases are commodity products with thin profit margins.
It’s no surprise that the company has retreated from the market. Google’s threat to commercial online services, seemingly so real several years ago, has yet to materialize.
What does Google’s pull out mean for ProQuest and similar outfits? First, Google is going after bigger fish. Second, consolidation may be the path to stabilizing revenues from what is a shrinking library market.
There are other options, but the goose is not honking.
Stephen E Arnold, June 9, 2011
Sponsored by ArnoldIT.com, the resource for enterprise search information and current news about data fusion
The GOOG May Have Arthritis
June 9, 2011
“Ex-Google Engineer Dubs Goofrastructure ‘Truly Obsolete’” jarred me from end-of-day wind down. Read the article. The basic idea is that Google, which is 13 years old, is now getting technical arthritis. Here’s the passage that caught my attention:
In a blog post published earlier this week, Dhanji R. Prasanna announced that he had resigned from the company, and though he praised Google in many ways, he made a point of saying that the company’s famously distributed back-end is behind the times.”Here is something you may have heard but never quite believed before: Google’s vaunted scalable software infrastructure is obsolete,” he wrote. “Don’t get me wrong, their hardware and datacenters are the best in the world, and as far as I know, nobody is close to matching it. But the software stack on top of it is 10 years old, aging and designed for building search engines and crawlers. And it is well and truly obsolete.”
True or false? I don’t know. But few dare to criticize the GOOG. Even fewer Xooglers get too frisky in their post Google adventures.
One thing is certain: The GOOG faces some real competition. Like other online tech companies such as Dialog Information Services, the costs of keeping current is just too great. Measure the cost in management cycles, coding, or attacks on nation states. Has Google faced challenges in social media because of technical limitations?
Stephen E Arnold, June 9, 2011
Sponsored by ArnoldIT.com, the resource for enterprise search information and current news about data fusion
Digital Reasoning Adds Chinese Support to Synthesys
June 9, 2011
“Digital Reasoning Introduces Chinese Language Support for Big Data Analytics,” announces the company’s press release. This latest advance from the natural language wizards acknowledges the growing prevalence of Chinese on the Web. The support augments their premiere product, Synthesys:
“Synthesys can now analyze the unstructured data from a variety of sources in both English and Chinese to uncover potential threats, fraud, and political unrest. By automating this process, intelligence analysts can gain actionable intelligence in context quickly and without translation.”
This key development is the sort of thing that makes us view Digital Reasoning as a break out company in content processing. Their math-based approach to natural language analytics puts them ahead of the curve in this increasingly important field. Synthesis has become an essential tool for government agencies and businesses alike.
This support for Chinese is just the beginning. Rob Metcalf, President and COO, knows that “the next generation of Big Data solutions for unstructured data will need to natively support the world’s most widely spoken languages.”
We’re delighted to see Digital Reasoning continue to excel.
Cynthia Murrell June 8, 2011
Sponsored by ArnoldIT.com, the resource for enterprise search information and current news about data fusion