Voice Web Sites: New Frontier for Search
March 16, 2009
The Economic Times (India) reported that IBM has developed a technology for voice only Web sites. The story “IBM Develops a Technology That Will Allow Users to Talk to Web” here reported:
“People will talk to the web and the web will respond. The research technology is analogous to the Internet. Unlike personal computers it will work on mobile phones where people can simply create their voice sites,” IBM India Research Laboratory Associate Director Manish Gupta said.
The notion of a spoken Web in interesting. The question I have is, “What technology will one use to search these sites?” I find that as I age, certain frequencies become difficult for me to hear and certain speech patterns become unparseable for me. Has IBM a breakthrough technology to address the challenges of searching voice only Web sites?
Stephen Arnold, March 15, 2009
EveryZing: Exclusive Interview with Tom Wilde, CEO
March 16, 2009
Tom Wilde, CEO of EveryZing, will be one of the speakers at the April 2009 Boston Search Engine Meeting. To meet innovators like Mr. Wilde, click here and reserve your space. Unlike “boat show” conferences that thrive on walk in gawkers, the Boston Search Engine Meeting is content muscle. Click here to reserve your spot.
EveryZing here is a “universal search and video SEO (vSEO) firm, and it recently launched MediaCloud, the Internet’s first cloud-based computing service for generating and managing metadata. Considered the “currency” of multimedia content, metadata includes the speech transcripts, time-stamped tags, categories/topics, named entities, geo-location and tagged thumbnails that comprise the backbone of the interactive web.
With MediaCloud, companies across the Web can post live or archived feeds of video, audio, image and text content to the cloud-based service and receive back a rich set of metadata. Prior to MediaCloud and the other solutions in EveryZing’s product suite — including ezSEARCH, ezSEO, MetaPlayer and RAMP — discovery and publishing of multimedia content had been restricted to the indexing of just titles and tags. Delivered in a software-as-a-service package, MediaCloud requires no software to purchase, install or maintain. Furthermore, customers only pay for the processing they need, while obtaining access to a service that has virtually unlimited scalability to handle even large content collections in near real-time. The company’s core intellectual property and capabilities include speech-to-text technology and natural language processing.
Harry Collier (Infonortics Ltd) and I spoke with Mr. Wilde on March 12, 2009. The full text of our interview with him appears below.
Will you describe briefly your company and its search / content processing technology?
EveryZing originally spun out of BBN technologies in Cambridge MA. BBN was truly one of the godfathers of the Internet, and developed the email @ protocol among other breakthroughs. Over the last 20 years, the US Government has spent approximately $100MM with BBN on speech-to-text and natural language processing technologies. These technologies were spun out in 2006 and EveryZing was formed. EveryZing has developed a unique Media Merchandising Engine which is able to connect audio and video content across the web with the search economy. By generating high quality metadata from audio and video clips, processing it with our NLP technology to automatically “tag” the content, and pushing it through our turnkey publishing system, we are able to make this content discoverable across the major search engines.
What are the three major challenges you see in search / content processing in 2009?
Indexing and discovery of audio and video content in search; 2) Deriving structured data from unstructured content; 3) Creating better user experiences for search & navigation.
What is your approach to problem solving in search and content processing?
Well, yes, meaning that all three are critical. However, the key is to start with the user expectation. Users expect to be able to find all relevant content for a given key term from a single search box. This is generally known as “universal search”. This requires then that all content formats can be easily indexed by the search engines, be they web search engines like Google or Yahoo, as well as site search engines. Further, users want to be able to alternately search and browse content at will. These user expectations drive how we have developed and deployed our products. First, we have the best audio and video content processing in the world. This enables us to richly markup these files and make them far more searchable. Second, our ability to auto-tag the content makes it eminently more browsable. Third, developing a video search result page that behaves just like a text result page (i.e. keyword in context, sortability, relevance tuning) means users can more easily navigate large video results. Finally, plumbing our meta data through the video player means users can search within videos and jump-to the precise points in these videos that are relevant to their interests. Combining all of the efforts together means we can deliver a great user experience, which in turn means more engagement and consumption for our publishing partners.
Search / content processing systems have been integrated into such diverse functions as business intelligence and customer support. Do you see search / content processing becoming increasingly integrated
into enterprise applications?
Yes, absolutely. Enterprises are facing a growing pile of structured and unstructured content, as well as an explosion in multimedia content with the advent of telepresence, Webex, videoconferencing, distance learning etc. At the same time, they face increasing requirements around discovery and compliance that requires them to be able to index all of this content. Search is rapidly gaining the same stature as databases and document management systems as core platforms.
Microsoft acquired Fast Search & Transfer. SAS acquired Teragram. Autonomy acquired Interwoven and Zantaz. In your opinion, will this consolidation create opportunities or shut doors?
Major companies are increasingly looking to vendors with deep pockets and bench strength around support and R&D. This has driven some rapid market consolidation. However, these firms are unlikely to be the innovators, and will continue to make acquisitions to broaden their offerings. There is also a requirement to more deeply integrate search into the broader enterprise IT footprint, and this is also driving acquisitions.
Multi core processors provide significant performance boosts. But search / content processing often faces bottlenecks and latency in indexing and query processing. What’s your view on the performance of
your system or systems with which you are familiar?
Yes, CPU power has directly benefited search applications. In the case of EveryZing, our cloud architecture takes advantage of quad-core computing so we can deliver triple threaded processing on each box. This enables us to create multiple quality of service tiers so we can optimize our system for latency or throughput, and do it on a customer by customer basis. This wouldn’t be possible without advances in computing power.
Graphical interfaces and portals (now called composite applications) are making a comeback. Semantic technology can make point and click interfaces more useful. What other uses of semantic technology do you see gaining significance in 2009?
Semantic analysis is core to our offering. Every clip we process is run through our NLP platform, which automatically extracts tags and key concepts. One of the great struggles publishers face today is having the resources to adequately tag and title all of their video assets. They are certainly aware of the importance of doing this, but are seeking more scalable approaches. Our system can use both a unsupervised and supervised approach to tagging content for customers.
Where can I find more information about your products, services, and research?
Our Web site is www.everyzing.com.
Autonomy Knipsel
March 12, 2009
A news release turned up in my newsreader with an interesting set of tags. You can read the story about Autonomy, the meaning based computing company, here. If the link goes dead, you will be able to find the original story on the Autonomy Web site here. My newsreader presented me with this headline, “Autonomy Powers Pioneering News Portal – MSN MoneyCentral”. What I think happened is that the news release title has the appended source, “MSN Money Central” as the full title. I don’t know if the parser jammed the two separate fields together or if it was some other type of human or system error. I was expecting to learn that Autonomy sold its search system to MSN Money Central. What the item told me was that Autonomy landed a news service about which I knew nothing. I found this interesting because my Overflight service makes some assumptions about what is a title and what is not a title. I will have to revisit that logic.
Stephen Arnold, March 12, 2009
Media Cloud: Foggy Payoff
March 12, 2009
I wrote about Calais in 2008. You can find that article here. Calais makes use of ClearForest technology to perform semantic tagging. I am cautious when large companies make services available at a low or no cost. Now, Calais was pegged to a project at Harvard University. You can read the ReadWriteWeb.com story here. the Media Cloud project delivers some of the Google Trends or Compete.com type outputs from content processed with Calais. For me, the most interesting comment in the write up was:
we see this as an example of how the Internet is driving traditional media to change and respond in new ways. We are excited by the scope and potential that Media Cloud brings to anyone interested in following news and media trends.
I have a different view. A university demo project is just that a demo with an academic spin. Traditional media need to do more than a demo before the money in the checking and savings accounts runs dry.
Stephen Arnold, March 12, 2009
Database Content: Take or Use
March 12, 2009
You may want to read Out-Law.com’s “Database Infringements Depend on Taking, Not Usage of Data” here. The article tackles an issue that has triggered a European Court of Justice ruling. For me the key statement in the Out-Law.com synopsis of the ruling was:
The Directive protects against “extraction and/or re-utilisation of the whole or of a substantial part…of the contents of that database”. The ECJ said that infringement was independent of the use to which someone wants to put the information.
Does this ruling matter in the US or elsewhere?
In my opinion, the ruling underscores the difference between how a person who compiles and provides access to that specific compilation of data perceives the value of the data and the person who wants to repurpose some of the data in that database. I am no lawyer, but I do work with clients who can click to a Web site and find useful information; for example, the data available from a government Web site or the patent information I have compiled for my Google patent search service.
Software can now slice and dice data. A programmer can make many information “meals” with these amazing software tools.
There are different ways to view the structured data such as airline flight information or condos for sale in Baltimore, Maryland or loosely structured data such as an RSS feed or well formed XML documents.
An innovator / entrepreneur can see these data as raw material for something new. The idea is that individual data items may gain utility when assembled or organized in a way different from the way the information appear on a specific Web site. Because the information are viewable in a browser, it seems to the innovator / entrepreneur that the data or their constituent elements like a phone number are like molecules in a mixture. These can be combined without losing their original chemical structure. The data are publicly available, so the data are meant to be used.
Search May Not Mean Search
March 11, 2009
Last week, I had a disturbing conversation with a very confident 30 something. After more than a year of planning, I learned that the company had decided to deploy a key word search system from a big name vendor. I asked, “What do the employees need? Keyword retrieval? Reports? Alerts?”
The answer was, “We have that information from informal discussions. Keyword search.”
I thanked the person for lunch and walked away shaking my head. Businesses are struggling for revenue, and employees in the organizations I have visited since October 2008 strike me as wanting to make their companies successful. Employees are savvy and know that if their employer goes down the drain finding another job might not be easy.
For some, there will be increased competition. Darwinianism is an abstract concept until a person can’t find work.
The 30 something had a job. An important job. The information technology unit at this services firms had search systems but employees did not use them. The IT budget was getting scrutiny, so the manager and tech staff decided it was time to get a “new” search system.
The problem was that I had in 2003 and 2004 conducted interviews with a number of senior managers at this organization. I even knew the president of one of the operating units socially. Although my lunch took place in 2009, I realized that the IT department was going to make the same errors it had with its previous search procurements. Every two or three years, the company licensed another system. After a honeymoon of six months, the results were predictable in my opinion. Grousing and declining usage.
Vendors have a tough time breaking the cycle. Some search companies pitch a “simple solution” that is like a One a Day vitamin. Others deliver a toolkit that is far to complicated for the IT team to get working and scarce budget dollars cannot be pumped into what amounts a customized search system.
If this scenario resonates, you may want to navigate to LLrX and read the article, “Knowledge Discovery Resources 2009: An Internet MiniGuide Annotated Link Compilation” here. The listing was compiled by the prolific Marcus P. Zillman, Internet expert. What I liked about the meaty listing was it made clear to me one point: Search does not mean keyword retrieval. The list provided me with a meaty link burger. I discovered a number of useful resources. You will want to download it and do some exploration.
I did not send the list to my lunch pal, the 30 something who knows what his users want without bothering with surveys, interviews, focus groups, and observation of users in action. As long as organizations hire information technology professionals who know what “search” means, a list won’t make much difference.
You might have a more open mind. I hope so. Search defined as keyword retrieval is about as relevant today as a bronze surgical instrument in an emergency room in a big city hospital. Access to information in a way that meets the needs of individual users is, in my opinion, what search means.
Stephen Arnold, March 11, 2009
Search: Still in Its Infancy
March 9, 2009
Click here and read the job postings for intelligence professionals. Notice that the skills are those that require an ability to manipulate information, not just in English but in other languages. Here’s a portion of one posting:
Core Collector-certified Collection Management Officers (CMO’s) oversee and facilitate the collection, evaluation, classification, and dissemination of foreign intelligence developed from clandestine sources. CMO’s play a critical role in ensuring that foreign intelligence collected by clandestine sources is relevant,
I keep reading about search is stable and search is simple. I don’t think so. Because language is complex, the challenge for search and content processing vendors is significant. With more than 150 systems available to manipulate information, one would think that software could handle basic collection and analysis, right? Software helps but search is still in its infancy. The source of the jobs? The US Central Intelligence Agency, which is reasonably well equipped with search, text processing, and content analysis systems. Too bad the reality of search is complex, but some find it easy to say the problem is solved and move on in a fog of wackiness.
Stephen Arnold, March 9, 2009
ODNI Data Mining Report Available
March 8, 2009
If you want to keep a scorecard for data mining projects in some US government agencies, you may find the “Data Mining Report” (unclassified) interesting. You can download a copy here. You will need an acronym knowledgebase to make sense of some of the jargon.
For me, there were two interesting points:
- Video is a sticky wicket: lots of data and the tools are still evolving
- Coordination remains a challenge.
Enjoy.
Stephen Arnold, March 8, 2009
MyRoar: NLP Financial Information Centric Service
March 6, 2009
A happy quack to the reader who alerted me to MyRoar.com. This is a vertical search service that relies on natural language processing. I did some sleuthing and learned that François Schiettecatte joined the company earlier this year. Mr. Schiettecatte has a distinguished track record in search, natural language processing, and content processing. French by birth, he went to university in the UK and has lived and worked in the US for many years. Here’s what the company says about MyRoar.com:
In today’s current political and economic environment people have never had more questions. MyRoar helps people sort through the hype to find just the answers they are looking for. Extraneous information is eliminated, while saving hours of time or abandonment of search. We provide a fun new interface that keeps users up to date on current news, which helps them formulate the best questions to ask. MyRoar is a Natural Language Processing Question Answering Search Engine. Using integrated technologies we are able to offer high precision allowing users to ask questions relating to finance and news. MyRoar integrates proprietary Question Answer matching techniques with the best English NLP tools that span the globe.
You can use the system here. The system performed quite well on my test queries; for example, “What are the current financials for Parker Hannifin?” returned two results with the data I wanted. I will try to get Mr. Schiettecatte to participate in the Search Wizards Speak interview series. Give the system a whirl.
Stephen Arnold, March 6, 2009
Metadata Perp Walk
March 5, 2009
I mentioned the problems of eDiscovery in a briefing I did last year for a content processing company. I have not published that information. Maybe some day. The point that drew a chuckle from the client was my mentioning the legal risk associated with metadata. I was reporting what I learned in one of my expert witness projects. Short take: bad metadata could mean a perp walk. Mike Fernandes’ “Think You’re Compliant? Corrupt Metadata Could Land You in Jail” here tackled this subject in a more informed way than my anecdote. He does a good job of explaining why metadata are important. Then he hits the marrow of this info bone:
Data recovery cannot be treated as the ugly stepsister of enterprise backup, and the special needs that ECM systems place on backup must not be ignored. Regulatory authorities and industry experts are beginning to demand more ECM- and compliance-savvy recovery management strategies, thereby setting new industry-wide legal precedents. One misstep can lead to disaster; however, there are approaches and ECM solutions that help avoid noncompliance, downtime and other incidents.
If you are floating through life assuming that your metadata are shipshape, you will want to make a copy of Mr. Fernandes’ excellent write up. Oh, and why the perp walk? Bad metadata can annoy a judge. More to the point, bad metadata in the hands of the attorney from the other side can land you in jail. You might not have an Enron problem, but from the inside of a cell, the view is the same.
Stephen Arnold, March 5, 2009