IBM at Wimbledon

August 5, 2013

The realms of tennis and technology intersect at Wimbledon, Britain’s NewStatesman reminds us in, “IBM and Wimbledon: The Tech that Takes You Closer to the Tennis. Brought to You by Wimbledon Insights.” The prestigious tournament, held at the All England Club in Wimbledon, London, since 1877, has relied on IBM tech since 1994. We wonder, does IBM’s involvement boost Wimbledon’s ratings in the US?

The article takes us through the history of that partnership, listing developments like the 2000 debut of the Wimbledon Information System and the Match Analysis DVDs distributed to singles players beginning in 2007. See the article for more, but, personally, I am most interested in the incorporation of predictive analytics via SlamTracker. (Ah, but did Watson predict Serena’s recent loss? The article does not say.) We learn from the write-up:

“Introduced last year, IBM SlamTracker was enhanced with a ‘Keys to the match’ feature. Using over eight years of Grand Slam tennis data and 41 million data points, IBM is able to find the patterns and styles of play for particular head-to-head matches (or between players of similar styles if the players in question have not met before).

“In the run-up to a match, the data for one player is compared to that of his or her opponent, along with players of a similar style to determine the ‘keys to the match’: the three targets that player has to hit if they want to enhance their chance of winning. These keys are selected by analysing 45 potential match dynamics – 19 offensive, 9 defensive, 9 endurance and 8 style – to identify the ones that will be vital to each player in this specific match.”

My, how sports have changed since 1877! It almost makes one long for simpler times.

Cynthia Murrell, August 05, 2013

Sponsored by ArnoldIT.com, developer of Augmentext

Replacing dtSearch is Easier than it Sounds

August 5, 2013

DtSearch is an interesting topic. Certainly once considered a high water mark for text retrieval systems, it has mostly fallen off the cultural radar. However, that hasn’t stopped one industrious company of…replacing it? We learned more from a recent Flax article, “An Open Source Replacement for the dtSearch Closed Source Search Engine.”

According to the story:

…we developed a new Lucene Analyzer that speaks the same syntax as dtSearch, allowing us to index text input. On the search side we have a Lucene QueryParser that shares this syntax. To make it easier to use we’ve wrapped the whole lot in a modified Solr server. As we needed some features of very recent Lucene code, our modifications are based on a patch to Lucene trunk.

Our best response here is, well, whoopee. Saying you’ve replaced dtSearch is like Chevy claiming it has replaced the horse and buggy with its 2014 model. Frankly, we weren’t aware of too many people still using that software. For goodness sake, a Google search only brought up a single news piece. Chances are most people moved on a long time ago, so we will be stunned to hear about anyone jumping for joy because of this open source option.

Patrick Roland, August 05, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

Spotter Makes its Name with Sarcasm

August 5, 2013

While we are generally cheerleaders for all things big data and analytics, we are not blind to its weaknesses. One major weakness of most big data platforms would give it a devil of a time parsing much from, say, an episode of Seinfeld. That’s right, we’re talking about its inability to detect sarcasm. However, Slashdot thinks it might have the answer, according to the recent article: “Tech Companies Looking into Sarcasm Detection.”

According to the story:

Spotter’s platform scans social media and other sources to create reputation reports for clients such as the EU Commission and Air France. As with most analytics packages that determine popular sentiment, the software parses semantics, heuristics and linguistics. However, automated data-analytics systems often have a difficult time with some of the more nuanced elements of human speech, such as sarcasm and irony — an issue that Spotter has apparently overcome to some degree, although company executives admit that their solution isn’t perfect. (Duh.)

Spotter is really making a name for itself. We fell in love with the company a long while ago, after an Arnold IT interview set the tone. This is a sharp company and if their sarcasm detection comes through, they’ll be industry leaders.

Patrick Roland, August 05, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

Open Source to Help Secure Cloud Storage

August 5, 2013

As technology advances quickly, so do security concerns. It stands to reason that new technologies open up new vulnerabilities. But open source is working to combat those challenges in an agile and cost-effective way. Read the latest on the topic in IT World Canada in their story, “Open-Source Project Aims to Secure Cloud Storage.”

The article begins:

“The open source software project named Crypton is working on a solution that would enable developers to easily create encrypted cloud-based collaboration environments. There are very few cloud services that offer effective encryption protection for data storage, according to Crypton. Security has always been the top concern for many enterprise organizations when it comes to cloud services and applications.”

It is reasonable that enterprises are concerned about security when it comes to cloud services and storage. For that reason, many prefer on-site hosting and storage. However, some open source companies, like LucidWorks, build value-added solutions on top of open source software and guarantee security as well as support and training. And while LucidWorks offers on-site hosting as well, those who venture into the Cloud can have the best of both worlds with cost-effective open source software and the support of an industry leader.

Emily Rae Aldridge, August 5, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

Search Engine Optimization and the Google Search Rewrite

August 4, 2013

I read an amazing article called “In Mastering Machine Intelligence, Google Rewrites Search Engine Rules.” For a person who takes a casual interest in Google, the write up appears to be “about” artificial intelligence and search engine optimization or SEO.

For a Google watchers, the article contains a number of gem-like assertions and a couple of factoids that warrant discussion.

First, let’s look at the gem-like assertions.

Artificial intelligence. The article highlights the Google[x] Labs’ self-driving vehicle. This is a project nominally under the direction of the person who created an online learning system. The idea is that a self-driving vehicle demonstrates prowess in “artificial intelligence,” a term which is not defined. Another example is Google’s voice-to-text capability. The article emphasizes “artificial intelligence.” In my work, the performance of Google’s voice-to-text is more dependent on knowledgebases and brute force methods than “artificial intelligence.” The third example is embedded in this passage:

Google has finally started to figure out how to stop bad actors from gaming its crown jewel – the Google search engine. We say finally because it’s something Google has always talked about, but, until recently, has never actually been able to do.

The idea combines “artificial intelligence” with figuring out how search results have been shaped by search engine optimization experts. The idea behind SEO is that when a user enters a query, the search system displays the link to the Web page that the SEO expert is boosting. Relevance to the user? Well, maybe not so much. The result is that “relevance” is not longer precision and recall. Relevance is what I interpret as a spoof, a trick, or a cheat.

And what about SEO? Forget that I think SEO is a stepping stone to buying online advertising or paying money to appear in a results list. A more subtle version of this is the filtering that some news release vendors do to ensure that nothing enters the stream which is negative to a certain gatekeeper. Each of these actions contributes to distortion of search results. Users may get information which is incomplete or presented to advance a specific agenda such as the SEO expert’s client.

Here’s what the article says about SEO:

Google chasing down and excluding content from bad actors is a huge opportunity for web content creators. Creating great content and working with SEO professionals from inception through maintenance can produce amazing results. Some of our sites have even doubled in Google traffic over the past 12 months. So don’t think of Google’s changes as another offensive in the ongoing SEO battles. If played correctly, everyone will be better off now.

What’s between the examples of Google’s “artificial intelligence” and the blunt “factoid” that SEO is really important?

The answer is a series of tips about content. The idea is that a Web site which must render correctly on any device has to contain certain characteristics or features. I don’t want to repeat what’s in the source article, but I can flag three points that I found interesting:

  1. Keep sites simple. The author’s phrase is “clean, well-structured site architecture.” Now how many legacy sites are “clear and well structured”? In my experience, exactly zero. Legacy sites are everywhere. Anyone who has tried to reengineer a legacy site knows that the work required is like plastic surgery on an unattractive person — expensive and almost certain to disappoint. Talk is easy. Remediating a legacy site is something that few organizations in today’s financial environment embrace eagerly.
  2. Content: interesting, open, and original. Who creates knowingly uninteresting, closed, and imitative content? My thought is that SEO experts, marketing managers who do not know what makes content sing or just stand without stumbling, and folks who think they are a combination of F. Scott Fitzgerald and Jane Austen. Remember college freshman English? Remember how many students cranked out disappointing essays week after week? Are those folks doing Web content? Some are.
  3. Markup. Yep, the glory of unstructured content sucks up processing cycles. The future belongs to tagged content which conforms to the guidelines promulgated by some large, irritable gorillas. Author tag? Insert it, now. Follow the rules or the “artificial intelligence” will give you a lower grade just like that college teacher grading those English 101 personal experience essays.

My reaction to this article is positive. Here’s why:

First, the summary of the problems with Google’s Web search system are clearly articulated. The author of the article has first hand experience dealing with queries that generate results which are surprising or unexpected.

Second, the article illustrates how the general perception of Google’s preeminent position in search has become part of the furniture of living. Even facts about flawed results do not tarnish the belief that Google’s artificial intelligence is outstanding and getting better.

Third, the unwavering support of SEO is exactly what the SEO experts need. Many firms have spent large sums of money on SEO only to see no significant impact. Some SEO activities make clients really happy? Is this because clients were clueless in the first place? Other SEO activities produce cancelled contracts. Is this because a particular site was demoted or removed from the index?

I urge you to read the article. Spend much money on SEO. Follow the guidelines for “better” content. Life will be good. Remember. One can buy traffic or use online advertising to produce visitors to a Web site. Are the visitors going to buy? Well, that’s not part of the source document’s analysis. Hire and SEO expert to explain the details.

Stephen E Arnold, August 4, 2013

Sponsored by Xenky

Crowdsourcing Helps Keep Big Data Companies Straight

August 4, 2013

As big data analytics begins picking up steam, we are seeing more and more interesting outlets to learn about different platforms to choose from. Not just catalogs and boastful corporation sites, but insightful criticism. One such recent stop was when we came about the “About” story of Bamboo DIRT.

According to the site:

Bamboo DiRT is a tool, service, and collection registry of digital research tools for scholarly use. Developed by Project Bamboo, Bamboo DiRT is an evolution of Lisa Spiro’s DiRT wiki and makes it easy for digital humanists and others conducting digital research to find and compare resources ranging from content management systems to music OCR, statistical analysis packages to mindmapping software.

One look at its tips for analyzing data and we were sold. Here we were turned on to such intriguing companies as 140kit and Dataverse. The user-supported recommendations were the best. About Dataverse, it said: “Researchers and data authors get credit, publishers and distributors get credit, affiliated institutions get credit.” Concise and giving all the needed vitals, this type of crowdsourcing recommendation site could really catch on as the world of big data analytics keeps growing beyond most users’ capacity to keep up.

Patrick Roland, August 04, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

Search and Null: Not Good News for Some

August 3, 2013

I read “How Can I Pass the String ‘Null’ through WSDL (SOAP)…” My hunch is that only a handful of folks will dig into this issue. Most senior managers buy the baloney generated by search and content processing. Yesterday I reviewed for one of the outfits publishing my “real” (for fee) columns a slide deck stuff full of “all’s” and “every’s”. The message was that this particular modern system which boasted a hefty price tag could do just about anything one wanted with flows of content.

Happily overlooked was the problem of a person with a wonky name. Case in point: “Null”. The link from Hacker News to the Stackoverflow item gathered a couple of hundred comments. You can find these here. If you are involved in one of the next-generation, super-wonderful content processing systems, you may find a few minutes with the comments interesting and possibly helpful.

My scan of the comments plus the code in the “How Can I” post underscored the disconnect between what people believe a system can do and what a here-and-now system can actually do. Marketers say one thing, buyers believe another, and the installed software does something completely different.

Examples:

  1. A person’s name—in this case ‘Null’—cannot be located in a search system. With all the hoo-hah about Fancy Dan systems, is this issue with a named entity important? I think it is because it means that certain entities may not be findable without expensive, time-consuming human curation and indexing. Oh, oh.
  2. Non English names pose additional problems. Migrating a name in one language into a string that a native speaker of a different language can understand introduces some problems. Instead of finding one person, the system finds multiple people. Looking for a batch of 50 people each incorrectly identified during processing generates a lot of names which guarantees more work for expensive humans or many, many false drops. Operate this type of entity extraction system a number of times and one generates so much work there is not enough money or people to figure out what’s what. Oh, oh.
  3. Validating named entities requires considerable work. Knowledgebases today are “built automatically and on-the-fly. Rules are no longer created by humans. Rules, like some of Google’s “janitor” technology, figure out the rules themselves and then “workers” modify those rules on-the-fly. So what happens when errors are introduced via “rules.” The system keeps on truckin’. Anyone who has worked through fixing up the known tags from an smart system like Autonomy IDOL knows that degradation can set in when the training set does not represent the actual content flow. Any wonder why precision and recall scores have not improved too much in the last 20 years? Oh, oh.

I think this item about “Null” highlights the very real and important problems with assumptions about automated content processing. Whether the corpus is a telephone directory with a handful of names or the mind-boggling flows which stream from various content channels.

Buying does not solve long-standing, complicated problems in text processing. Fast talk like that which appears in some of the Search Wizards Speak interviews does not change the false drop problem.

So what does this mean for vendors of Fancy Dan systems? Ignorance on the part of buyers is one reason why deals may close. What does this mean for users of systems which generate false drops and dependent reports which are off base? Ignorance on the part of users makes it easy to use “good enough” information to make important decisions.

Interesting, Null?

Stephen E Arnold, August 3, 2013

Sponsored by Xenky

Treparel Makes Big Data Waves Overseas

August 3, 2013

Belgium is not a country we instantly associate with big data dominance. But, the small nation has recently proven that it does have excellent sensibility with analytics and who does a good job. Be discovered just how from a Treparel article, “Treparel Wins LT-Innovate Award 2013.”

Treparel just recently announced its new strategy to collaborate with other software and solution vendors to enhance their solutions with advanced content analytics and visualizations using the KMX API. Winning the LT-Innovate Award 2013 is a reward from colleagues in the language technology and text analytics industry in Europe that we are in the right place, at the right time on the right track. And it’s a salute to to the committed and hard work of our growing team!

Frankly, this company has not just been on the Belgian radar, but ours as well. We fell head over heels after surfing their website and discovered powerful vision. “Our solutions use the SVM algorithm in our unique methodology that dramatically changes the way people obtain information from data by means of text mining and visualization.” Keep this company in your head, we suspect this Belgian award is not the last big trophy they will snag.

Patrick Roland, August 03, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

Search Engine Plumbing: The Autonomy IDOL Diagram

August 2, 2013

Short honk: Documentation for enterprise search systems can be tough to get even when one is a licensee. Public information about the way the inner gears turn is often
as rare as hen’s teeth or in my case, geese’s teeth.

For anyone wondering what Autonomy IDOL’s help system looks like, the Hamilton IT Blog supplies an example titled simply, “IDOL Online Help.” The example sports functional tabs (“Action commands”, “Config params”, “Index commands”, and “Service commands”) with expandable category lists. If you are curious, check out the post.

Hurry, before it goes dark. Posting this type of information can lead to some interesting actions on the part of the vendor whose plumbing secrets are made evident.

Cynthia Murrell, August 2, 2013

Sponsored by Xenky

Autonomy ArcSight Tackles Security

August 2, 2013

HP Autonomy is chasing the Oracle SES angle: security for search. We took a look at the company’s pages about HAVEn, Autonomy’s latest big data platform. Regarding the security feature, ArcSight Logger, the company promises:

“With HP ArcSight Logger you can improve everything from compliance and risk management to security intelligence to IT operations to efforts that prevent insider and advanced persistent threats. This universal log management solution collects machine data from any log-generating source and unifies the data for searching, indexing, reporting, analysis, and retention. And in the age of BYOD and mobility, it enables you to comprehensively manage an increasing volume of log data from an increasing number of sources.”

More information on HAVEn can be found in the YouTube video, “Brian Weiss Talks HAVEn: Inside Track with HP Autonomy.” At the 1:34 mark, Autonomy VP Weiss briefly describes how ArcSight analyzes the data itself, from not only inside but also outside an enterprise, for security clues. For example, a threatening post in social media might indicate a potential cyber-attack. It is an interesting approach. Can HP make this a high revenue angle?

Cynthia Murrell, August 02, 2013

Sponsored by ArnoldIT.com, developer of Augmentext

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta