Mozenda and the Zen of Screen Scraping

September 27, 2011

Mozenda, or “More Zenful Data,” is offering a new approach for comprehensive web data gathering. Their “zen” business style combines with a functional SaaS application to create a new productivity tool. We learn from the company’s Web site:

This concept of creating a productivity tool rather than another application for the IT department resonated well with existing Mozenda customers. In 2008, Mozenda accomplished this goal by launching the first of its kind Software as a Service (SaaS) application for performing comprehensive web data gathering (a.k.a web data extraction, screen scraping, web crawling, web harvesting, etc.), data management, and data publishing. Mozenda, or “More Zenful Data”, is now a reality.

Customers like Attensity and Yahoo! are using Mozenda to obtain content. So when you can’t or don’t want to search using human power, Mozenda could be a good alternative for generating relevant content. An affordable and compelling option, Mozenda will easily compete in the field. Screen scraping is an interesting technical function, and it is one which may lead to some dust ups between content owners and those who repurpose the content. Aggregating many different factoids into a giant repository can reduce some production costs, but the method can put the squeeze on those who create original information.

Emily Rae Aldridge, September 27, 2011

Sponsored by Pandia.com

Written by Stephen E. Arnold · Filed Under Business strategy, News, Text processing | 1 Comment

Search Technologies: Competency and a Search Milestone

September 27, 2011

Search Technologies, one of the world’s largest independent provider of search engine expertise and implementation services, has a lot to be proud of this week. In addition to receiving its second Microsoft Gold Digital Competency award, this time for digital marketing, the company has also signed its 100th Fast customer for search engine implementation services and consulting. We learned in Search Technologies Achieves Microsoft Gold Competency in Digital Marketing:

To earn this gold competency, organizations must complete a rigorous set of tests to prove their level of technology expertise, especially in delivering internet solutions on SharePoint 2010, Microsoft Fast Search Server, and related technologies. They must have the right number of Microsoft Certified Professionals, submit customer references and demonstrate their commitment to customer satisfaction by participating in an annual survey.

Search Technologies has a long history with the Fast product dating back to its origins, and it was honored as Fast Search & Transfer’s worldwide Alliance Partner of the Year back in 2006.

We have been monitoring the changes at Microsoft related to search. Our view is that interest in what is now Microsoft’s flagship search product continues to remain strong. Search Technologies has begun more than a dozen new customer engagements thus far in 2011. Congratulations.

Jasmine Ashton, Sept 27, 2011

Sponsored by Pandia.com

Written by Stephen E. Arnold · Filed Under Enterprise search, News, Search, Technology | Comments Off on Search Technologies: Competency and a Search Milestone

Protected: Amazon Converts to SharePoint 2010

September 27, 2011

Written by Stephen E. Arnold · Filed Under Cloud computing, Enterprise, Enterprise search, News, SharePoint | Enter your password to view comments.

Inteltrax: Top Stories, September 19 to September 23

September 26, 2011

Inteltrax, the data fusion and business intelligence information service, captured three key stories germane to search this week, specifically, the idea of data analytics and business intelligence coming to the rescue in one way or another.

Our first story came from the article, “BI Rescues Legal World” http://inteltrax.com/?p=2392 we took a look inside how the legal billing world was saving firms money by using business intelligence.

Another rescue tale was found in the aptly titled, “Cloud Computing to Rescue Struggling Ledgers,” http://inteltrax.com/?p=2407 used Amazon as an example of how melding cloud computing and BI is putting many companies into the black.

Also, we in “Wine Gets the Big Data Treatment,” http://inteltrax.com/?p=2412 we explored how the wine industry, taking quite a hit during tough economic times, is staying afloat with big data analytic techniques.

No matter if an organization is dishing up legal briefs or chardonnay, there seems to be a need for book balancing by way of big data. We’ll keep an eye on this development as economic belt tightening continues around the world.

Follow the Inteltrax news stream by visiting www.inteltrax.com

Patrick Roland, Editor, Inteltrax. September 25, 2011

Written by Stephen E. Arnold · Filed Under Analytics, Business intelligence, Cloud computing, News | Comments Off on Inteltrax: Top Stories, September 19 to September 23

Email Management Goes Mobile with Recommind

September 26, 2011

Recommind, famous for predictive information management software, has released a new version of its collaborative email management software. Check out the press release for, “Recommind Takes Email Management Mobile With Decisiv Email 3.6.” We learned:

While smartphones, tablets and other mobile devices have made enterprise email more accessible than ever, they have also made managing and classifying it more difficult. By giving employees the ability to access and share email-based content around the clock, mobile devices have increased the amount of email content stored on corporate networks, the number of platforms used to read it and, most difficult to address, the complexity of managing it all.

In light of the above problem, Decisiv Email offers an innovative and functional approach to email filing and access. Filing is collaborative – when one employee files an email it is automatically filed for everyone else. The automatic Outlook sync feature and the ability to file from any mobile device will quickly make Decisiv Email an invaluable time saver for busy employees. Peripatetic eDiscovery plus enterprise search!

Emily Rae Aldridge, September 26, 2011

Sponsored by Pandia.com

Written by Stephen E. Arnold · Filed Under Business strategy, News, Search | Comments Off on Email Management Goes Mobile with Recommind

UK Demands Stricter Piracy Regulations From Google

September 26, 2011

Recently the UK government, spearheaded by Culture Secretary Jeremy Hunt, has been pressuring Google to make life for difficult for companies that break copyright laws. One way of doing this would be to limit their searchability. According to the Silicon.com article Seek and Ye Shall Not Find: Google Asked to Hide Piracy Sites From Search Hunt said:

We intend to take measures to make it more and more difficult to access sites that deliberately facilitate infringement, misleading consumers and depriving creators of a fair reward for their creativity.

According to a Google blog post from Sept 2, the company has been working to fight this issue since December. It’s strategy has been: to act on reliable copyright takedown requests within 24 hours, to prevent terms closely related to piracy from appearing in Autocomplete, to improve their anti-piracy review, and to increase visibility of authorized preview content in search results.Requiring Google to monitor the organizations that appear in their search results seems like a fairly logical idea. However, it leads me to ask who will monitor the person doing the scrubbing?

Jasmine Ashton, September 26, 2011

Sponsored by Pandia.com

Written by Stephen E. Arnold · Filed Under Legal matters, News, Search | Comments Off on UK Demands Stricter Piracy Regulations From Google

Endeca Clicks into Real Time Search with DataSift

September 26, 2011

Endeca, known for its e-commerce software, is pairing with DataSift, a provider of aggregated social data feeds at Web scale. Their partnership will produce visualizations and advanced analytics on semi-structured content in real time. Benzinga covers the latest in, “Endeca and DataSift Team to Analyze the Real Time Web.” The write up asserts:

Pairing Endeca Latitude®, an Agile BI platform, with the breadth of social data like Facebook, Twitter, and WordPress as well as other popular social solutions, enables organizations to react to the “big data fire hose” alongside internal data, for marketing analytics, customer intelligence, CRM and competitive intelligence. Endeca and DataSift will demonstrate their joint offering at O’Reilly’s Strata Conference on September 22-23 in New York.

DataSift’s granular and modular sifting abilities combine with Endeca Latitude’s intuitive interface to produce a product that is both powerful and cost-effective. The yet unnamed offering will help companies mine the business value out of the gushing well of new social data.

Our view is that “latency” exists across the six major types of “real time” solutions. What does “real time” mean? Well, it means different things depending upon the application. Some solutions are mind bogglingly expensive. Think Thomson Reuters’ feeds of financial data on certain investments. Others are pretty leisurely; for example, what is trending in the world of Lady Gaga. Interesting tie up. No solid definition of latency yet. We are watching and waiting. You know. Latency.

Emily Rae Aldridge, September 23, 2011

Written by Stephen E. Arnold · Filed Under Business strategy, News, Real time search, Social | Comments Off on Endeca Clicks into Real Time Search with DataSift

Protected: Unraveling SharePoint 2010 Architecture

September 26, 2011

Written by Stephen E. Arnold · Filed Under Enterprise, Enterprise search, News, SharePoint, Text processing | Enter your password to view comments.

Traditional Entity Extraction’s Six Weaknesses

September 26, 2011

Editor’s Note: This is an article written by Tim Estes, founder of Digital Reasoning, one of the world’s leading providers of technology for entity based analytics. You can learn more about Digital Reasoning at www.digitalreasoning.com.

Most university programming courses ignore entity extraction. Some professors talk about the challenges of identifying people, places, things, events, Social Security Numbers and leave the rest to the students. Other professors may have an assignment related to parsing text and detecting anomalies or bound phrases. But most of those emerging with a degree in computer science consign the challenge of entity extraction to the Miscellaneous file.

Entity extraction means processing text to identify, tag, and properly account for those elements that are the names of person, numbers, organizations, locations, and expressions such as a telephone number, among other items. An entity can consist of a single word like Cher or a bound sequence of words like White House. The challenge of figuring out names is tough one for several reasons. Many names exist in richly varied forms. You can find interesting naming conventions in street addresses in Madrid, Spain, and for the owner of a falafel shop in Tripoli.

Entities, as information retrieval experts have learned since the first DARPA conference on the subject in 1987, are quite important to certain types of content analysis. Digital Reasoning has been working for more than 11 years on entity extraction and related content processing problems. Entity oriented analytics have become a very important issue these days as companies deal with too much data, the need to understand the meaning and not the just the statistics of the data and finally to understand entities in context – critical to understanding code terms, etc.

I want to highlight the six weaknesses of traditional entity extraction and highlight Digital Reasoning’s patented, fully automated method. Let’s look at the weaknesses.

1 Prior Knowledge

Traditional entity extraction systems assume that the system will “know” about the entities. This information has been obtained via training or specialized knowledge bases. The idea is that a system processes content similar to that which the system will process when fully operational. When the system is able to locate or a human “helps” the system locate an entity, the software will “remember” the entity. In effect, entity extraction assumes that the system either has a list of entities to identify and tag or a human will interact with various parsing methods to “teach” the system about the entities. The obvious problem is that when a new entity becomes available and is mentioned one time, the system may not identify the entity.

2 Human Inputs

I have already mentioned the need for a human to interact with the system. The approach is widely used, even in the sophisticated systems associated with firms such as Hewlett Packard Autonomy and Microsoft Fast Search. The problem with relying on humans is a time and cost equation. As the volume of data to be processed goes up, more human time is needed to make sure the system is identifying and tagging correctly. In our era of data doubling every four months, the cost of coping with massive data flows makes human intermediated entity identification impractical.

3 Slow Throughput

Most content processing systems talk about high performance, scalability, and massively parallel computing. The reality is that most of the subsystems required to manipulate content for the purpose of identifying, tagging, and performing other operations on entities are bottlenecks. What is the solution? Most vendors of entity extraction solutions push the problem back to the client. Most information technology managers solve performance problems by adding hardware to either an on premises or cloud-based solution. The problem is that adding hardware is at best a temporary fix. In the present era of big data, content volume will increase. The appetite for adding hardware lessens in a business climate characterized by financial constraints. Not surprisingly entity extraction systems are often “turned off” because the client cannot afford the infrastructure required to deal with the volume of data to be processed. A great system that is too expensive introduces some flaws in the analytic process.

Written by Stephen E. Arnold · Filed Under Business strategy, Enterprise, Feature, Search, Semantic | 1 Comment

Squiz Funneback Try Blog Marketing

September 25, 2011

The Squiz and Funnelback product suite continues to adapt to downward Web traffic trends among vendors of enterprise search. A spot check of traffic to the Funnelback Web site on Compete.com reported an alleged 2,845 unique visitors in July 2011, down from 5,988 in January 2011. For comparison, Compete reported an alleged 22,508 visitors in July 2011. Squiz acquired Funnelback in 2009, and the firm has been marketing in some interesting ways.

Adopting a pulse like presentation for Funnelback, the interesting marketing angle has yet to be proven. Check out more at the Squiz and Funnelback blog. The joint venture has been making in roads into higher education in the past, such as the project mentioned here:

Daniel Jackson, Development Manager at City University, presented a case study of their recent web project to redesign and rebuild the university’s two corporate websites and create a new intranet for staff and students. This huge undertaking incorporated a new CMS (Squiz Matrix), new search engine (Funnelback Search), new servers, new network, new content, new IA, new design, new business processes… the list goes on!

Here’s what the blog looked like when we checked it on September 23, 2011:

The design reminded us of how Pulse presents information on its iPad application. What is interesting is that the content focuses almost exclusively on Squiz and Funnelback. In our extensive tests of blogs, we find that company centric information does not generate much traffic. The exceptions are firms which offer a wide range of solutions and are “standards” in the enterprise. Therefore, IBM generated 968,158 unique visitors in July 2011, suffering a sharp drop from 1.5 alleged visitors.

How does a firm get “found” when a Bing or Google user is thrashing for a solution to an information problem? Check out www.augmentext .com for a possible solution. Company centric content satisfies egos, but a different content method is required to cope with how findability is working in the real world. Some search and content processing vendors essentially attract zero traffic to their Web sites. Is it the content, the topic area, or a killer Panda with a hunger for search and content processing vendors?

Emily Rae Aldridge, September 25, 2011

Sponsored by Pandia.com

Written by Stephen E. Arnold · Filed Under Business strategy, Marketing, News, Search | 2 Comments

« Previous Page — Next Page »

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.