Cuil.com Gets Better

March 30, 2009

I did a fly over of the Cuil.com Web site. What triggered an overflight was a Google patent; specifically, US20090070312, “Integrating External Related Phrase Information into a Phrase-Based Indexing Information Retrieval System”. Filed in September 2007, the USPTO spit it out on March 12, 2009. I discussed a chain of Dr. Patterson’s inventions in my 2007 study Google Version 2.0 here. Dr. Patterson is no longer a full-time Googler, the tendrils of her research from Xift to Cuil pass through the GOOG. When I looked at Cuil.com today (March 29, 2007), I ran my suite of test queries. Most of them returned more useful and accurate results than my first look at the system in July 2008 here.

Several points I noticed:

  • The mismatching of images to hits has mostly been connected. The use of my logo for another company, which was in the search engine optimization business was annoying. No more. That part of the algorithm soup has been filtered.
  • The gratuitous pornography did not pester me again. I ran my favorites such as pr0n and similar code words. There were some slips which some of my more young at heart readers will eagerly attempt to locate.
  • The suggested queries feature has become more useful.
  • My old chestnut “enterprise search” flopped. The hits were to sources that are not particularly useful in my experience. The Fast Forward conference is no more, but there’s a link to the now absorbed user group. The link to the enterprise search summit surprised me. The conference has been promoting like crazy despite the somewhat shocking turn out last year in San Jose, so it’s obvious that flooding information into sites fools the Cuil.com relevancy engine.
  • The Explore by Category is now quite useful. One can argue if it is better than the “improved” Endeca. I think Cuil.com’s automated and high-speed method may be more economical to operate. Dr. Patterson and her team deserve a happy quack.

I am delighted to see that the improvements in Cuil.com are coming along nicely. Is the system better than Google’s or Microsoft’s Web search system? Without more testing, I don’t think I can make a definitive statement. I am certain that there will be PhD candidates or ASIS members who will rise to fill this gap in my understanding.

I have, however, added the Cuil.com system to my list of services to ping when I am looking for information.

Stephen Arnold, March 30, 2009

Storage a Problem for Most Organizations

March 30, 2009

Most people don’t know too much about Kroll, a unit of a diversified financial services firm. I was surprised, therefore, to see a public story about a survey conducted by this ultra low profile outfit. The article was “Storage Practices Don’t Match Policies” in IDM.Net, a Australian Web log here. The point of the write up was that in the Kroll survey storage policies were not particularly well conceived. The most important comment in the write up was:

The survey found that 40 percent of individuals stated that their company has a policy regarding where data should be stored. However, the survey results also revealed that 61 percent of respondents “usually” save to a local drive instead of a company network.

Makers of automated back up systems will rejoice. Attorneys suing an organization with lousy back up practices are probably dancing in the streets. Where there are informal collections of data, there is gold for the eDiscovery prospector.

If you want to know more about Kroll, click here and read the Search Wizards Speak with David Chaplin, one of the developers of Engenium, an interesting software for extracting nuggets from these data gold mines.

Stephen Arnold, March 30, 2009

Google Interview Worth Reading

March 25, 2009

The interview with Alfred Spector in ComputerWorld is interesting for what it says and what it omits. You can find the article “The Grill: Google’s Alfred Spector on the Hot Seat” here. This is a three part interview. Mr. Spector is billed as Google’s vice president of research. For me, the most interesting comment was:

Do you have plans to go after that huge body of information on the Internet that is not currently searched? There is stuff on the Web, the so-called Deep Web, that is only “materialized” when a particular query is given by filling fields in a form. Since crawlers only follow HTML links, they cannot get to that “hidden” content. We have developed technologies to enable the Google crawler to get content behind forms and therefore expose it to our users. In general, this kind of Deep Web tends to be tabular in nature. It covers a very broad set of topics. It’s a challenge, but we’ve made progress.

I would hope so. Google has Drs. Guha and Halevy chugging away or had them chugging away on this problem. Furthermore, Google bought Transformics, a company that most of the Google pundits have paid scant attention to. Yep, Googzilla is making progress. Just plonking along with the fellow who worked on the semantic Web standards and the chap who invented the information manifold. I enjoy Google understatement.

Stephen Arnold, March 24, 2009

Palantir: Data Analysis

March 24, 2009

In the last month, three people have asked me about Palantir Technologies. I have had several people mention the work environment and the high caliber of the team at the company. The company has about 170 employees and is privately held. I have heard that the firm is profitable, but I have that from two sources now hunting for work after their financial institutions went south. The company is one of the leaders in finance and intelligence analytics. The specialities of the company include global macro research and trading; quantitative trading; knowledge discovery and knowledge management.

If you are not familiar with the company, you may want to navigate to www.palantirtech.com and take a look at the company’s offerings. Located in Palo Alto, the company focuses on making software that facilitates information analysis. With interest in business intelligence waxing and waning, Palantir has captured a very solid reputation for sophisticated analytics. Law enforcement and intelligence agencies “snap in” Palantir’s software to perform analysis and generate visualizations of the data. The company has been influenced by Apple in terms of the value placed upon sophisticated design and presentation. Palantir’s system makes highly complex tasks somewhat easier because of the firm’s interfaces. If you want to generate a visualization of a large, complex analytic method, Palantir can produce visually arresting graphics. If you navigate to the company’s “operation tradestop” page here, you can access demonstrations and white papers.

When I last checked the company’s demos, a number of them provided examples drawn from military and intelligence simulations. These examples provide a useful window into the sophistication of the Palantir technology. The company’s tools can manipulate data from any domain where large datasets and complex analyses must be run. The screenshot below comes from the firm’s demonstration of an entity extraction, text processing, and relationship analysis:

palantir 1

A Palantir relationship diagram. Each object is a link making it easy to drill down into the underlying data or documents.

Each object on the display is “live” so you can drill down or run other analyses about that object. The idea is to make data analysis interactive. Most of the vendors of high-end business intelligence systems offer some interactivity, but Palantir has gone further than most firms.

The company has a Web log, and it seems to be updated with reasonable frequency. The Web log does a good job of pointing out some of the features of the firm’s software. For example, I found this discussion of the Palantir monitoring server quite useful. The Web site emphasizes the visualization capabilities of the software. The Web log digs deeper into the innovations upon which the graphics rest.

Be careful when you run a Google query for Palantir. There are several firms with similar names. You will want to navigate to www.palantirtech.com. You may find yourself at another Palantir when you want the business intelligence firm.

Stephen Arnold, March 24, 2009

ISYS Search Software: Google Patent Collection

March 24, 2009

You will want to take a look at the ISYS Search Software demonstration here. The company took my collection of Google patent documents from 1998 to December 2008 and processed them. You can run a key word query, click on the names of people, and explore this window into Google’s technology hot house via the ISYS Search Version 9. When you locate a patent document that interests you, a single click will display the PDF of the patent document. You can browse the drawings and claims with the versatile ISYS system at your beck and call.

I have used the ISYS Search Software since Version 3.0. The system delivers high speed document processing, high speed query processing, and a raft of features. For more information about ISYS Version 9, click here. I have been critical of search systems for more than two decades. ISYS Search Software engineers’ have listened to me, and I know from experience that the team in Crow’s Nest and in Denver have a long term commitment to their customers and implementing useful features with each release.

Highly recommended. More information about ISYS Search Software is at http://www.isys-search.com/

Stephen Arnold, March 24, 2009

Financial Times: Try, Try, Try

March 20, 2009

Flashback. FT.com year 2005. I was a paying subscriber. I got a user name and a password. I logged on. Ran a query and the system timed out. Flash forward to 2007. FT.com licenses Fast Search & Transfer. I tested the system. Slow. I was asked to test a semantic system under consideration by the Financial Times. Useful but slow, slow, slow. Now the Financial Times has tapped another point and click vendor for a “deep” search experience. Time out. The Financial Times, arguably one of the two bigger franchises in business information, has been a laggard in online search for quite a while. The FT’s parent owns a chunk of the Economist, another blue chip in business information. I was a subscriber to * both * the print and online editions until late 2007. Why did I drop these must read news sources? Too much hassle. I hope the FT’s new system moves from the “deep” to the daylight. I hope the FT monetizes successfully its content. I hope that I will be able to play in the World Cup, but I am a realist and recognize that hope not mean accomplishment. If you are cheerleading for a dead tree outfit that once owned a wax museum, read the Guardian’s “Financial Times Launches Business-Focused Deep Search Service” here by Kevin Anderson. The article included a useful description of what the FT hopes to do with indexing:

The service allows users to search easily by news topic, organisation, person, place or theme. If a user searches for stories about business in China, the search can quickly be refined to cities in China, showing stories about Beijing, Shanghai or Hubei. Greenleaf described this as a “know before you click” model so that users can see related topics and the number of stories available for each sub-topic. In addition to automatic tagging, Newssift editors have also added other relationships to the service relevant to their business audience so that if someone looks for news about Ford Motor Company, they can also see related content from Ford suppliers.

This type of metatagging is useful, but it is computationally and human intensive. But the main difference between this most recent try in FT’s quest to develop an online service that makes up for the precipitous loss of revenue from its traditional dead tree business is the economy. Too late. I wish the FT team success, but I don’t think this most recent service will deliver the cash needed to get the ship squared away for even rougher seas ahead. Red ink ahead in my opinion.

Stephen Arnold, March 20, 2009

Marc Krellenstein Interview: Inside Lucid Imagination

March 17, 2009

Open source search is gaining more and more attention. Marc Krellenstein, one of the founders of Lucid Imagination, a search and services firm, talked about the company’s technology with Stephen E. Arnold, ArnoldIT.com. Mr. Krellenstein was the innovator behind Northern Light’s search technology, and he served as the chief technical officer for Reed Elsevier, where he was responsible for search.

In an exclusive interview, Mr. Krellenstein said:

I started Lucid in August, 2007 together with three key Lucene/Solr core developers – Erik Hatcher, Grant Ingersoll and Yonik Seeley – and with the advice and support of Doug Cutting, the creator of Lucene, because I thought Lucene/Solr was the best search technology I’d seen. However, it lacked a real company that could provide the commercial-grade support and other services needed to realize its potential to be the most used search software (which is what you’d expect of software that is both the best core technology and free). I also wanted to continue to innovate in search, and believed it is easier and more productive to do so if you start with a high quality, open source engine and a large, active community of developers.

Mr. Krellenstein’s technical team gives the company solid open source DNA. With financial pressures increasing and many organizations expressing dissatisfaction with mainstream search solutions, Lucid Imagination may be poised to enjoy rapid growth.

Mr. Krelllenstein added:

I think most search companies that fail do so because they don’t offer decisively better and affordable software than the competition and/or can’t provide high quality support and other services. We aim to provide both and believe we are already working with the best and most affordable software. Our revenue comes not only from services such as training but also from support contracts and from value-add software that makes deploying Lucene/Solr applications easier and makes the applications better.

You can read the full text of the interview on the ArnoldIT.com Web site here. Search Wizards Speak is a collection of 36 candid interviews with movers and shakers in search, content processing, and business intelligence. Instead of reading what consultants say about a company’s technology, read what the people who developed the search and content processing systems say about their systems. Interviews may be reprinted and distributed without charge. Attribution and a back link to ArnoldIT.com and the company whose executive is featured in the interview are required. Stephen E. Arnold provides these interviews as a service to those interested in information retrieval.

Stephen Arnold, March 17, 2009

Voice Web Sites: New Frontier for Search

March 16, 2009

The Economic Times (India) reported that IBM has developed a technology for voice only Web sites. The story “IBM Develops a Technology That Will Allow Users to Talk to Web” here reported:

“People will talk to the web and the web will respond. The research technology is analogous to the Internet. Unlike personal computers it will work on mobile phones where people can simply create their voice sites,” IBM India Research Laboratory Associate Director Manish Gupta said.

The notion of a spoken Web in interesting. The question I have is, “What technology will one use to search these sites?” I find that as I age, certain frequencies become difficult for me to hear and certain speech patterns become unparseable for me. Has IBM a breakthrough technology to address the challenges of searching voice only Web sites?

Stephen Arnold, March 15, 2009

EveryZing: Exclusive Interview with Tom Wilde, CEO

March 16, 2009

Tom Wilde, CEO of EveryZing, will be one of the speakers at the April 2009 Boston Search Engine Meeting. To meet innovators like Mr. Wilde, click here and reserve your space. Unlike “boat show” conferences that thrive on walk in gawkers, the Boston Search Engine Meeting is content muscle. Click here to reserve your spot.

EveryZing here is a “universal search and video SEO (vSEO) firm, and it recently launched MediaCloud, the Internet’s first cloud-based computing service for generating and managing metadata. Considered the “currency” of multimedia content, metadata includes the speech transcripts, time-stamped tags, categories/topics, named entities, geo-location and tagged thumbnails that comprise the backbone of the interactive web.

With MediaCloud, companies across the Web can post live or archived feeds of video, audio, image and text content to the cloud-based service and receive back a rich set of metadata.  Prior to MediaCloud and the other solutions in EveryZing’s product suite — including ezSEARCH, ezSEO, MetaPlayer and RAMP — discovery and publishing of multimedia content had been restricted to the indexing of just titles and tags.  Delivered in a software-as-a-service package, MediaCloud requires no software to purchase, install or maintain.  Furthermore, customers only pay for the processing they need, while obtaining access to a service that has virtually unlimited scalability to handle even large content collections in near real-time. The company’s core intellectual property and capabilities include speech-to-text technology and natural language processing.

Harry Collier (Infonortics Ltd) and I spoke with Mr. Wilde on March 12, 2009. The full text of our interview with him appears below.

Will you describe briefly your company and its search / content processing technology?

EveryZing originally spun out of BBN technologies in Cambridge MA.  BBN was truly one of the godfathers of the Internet, and developed the email @ protocol among other breakthroughs.  Over the last 20 years, the US Government has spent approximately $100MM with BBN on speech-to-text and natural language processing technologies.  These technologies were spun out in 2006 and EveryZing was formed.  EveryZing has developed a unique Media Merchandising Engine which is able to connect audio and video content across the web with the search economy.  By generating high quality metadata from audio and video clips, processing it with our NLP technology to automatically “tag” the content, and pushing it through our turnkey publishing system, we are able to make this content discoverable across the major search engines.

What are the three major challenges you see in search / content processing in 2009?

Indexing and discovery of audio and video content in search; 2) Deriving structured data from unstructured content; 3) Creating better user experiences for search & navigation.

What is your approach to problem solving in search and content processing?

Well, yes, meaning that all three are critical.  However, the key is to start with the user expectation.  Users expect to be able to find all relevant content for a given key term from a single search box.  This is generally known as “universal search”.  This requires then that all content formats can be easily indexed by the search engines, be they web search engines like Google or Yahoo, as well as site  search engines.  Further, users want to be able to alternately search and browse content at will.  These user expectations drive how we have developed and deployed our products.  First, we have the best audio and video content processing in the world.  This enables us to richly markup these files and make them far more searchable.  Second, our ability to auto-tag the content makes it eminently more browsable.  Third, developing a video search result page that behaves just like a text result page (i.e. keyword in context, sortability, relevance tuning) means users can more easily navigate large video results.  Finally, plumbing our meta data through the video player means users can search within videos and jump-to the precise points in these videos that are relevant to their interests.  Combining all of the efforts together means we can deliver a great user experience, which in turn means more engagement and consumption for our publishing partners.

Search / content processing systems have been integrated into such diverse functions as business intelligence and customer support. Do you see search / content processing becoming increasingly integrated
into enterprise applications?

Yes, absolutely.  Enterprises are facing a growing pile of structured and unstructured content, as well as an explosion in multimedia content with the advent of telepresence, Webex, videoconferencing, distance learning etc.  At the same time, they face increasing requirements around discovery and compliance that requires them to be able to index all of this content.  Search is rapidly gaining  the same stature as databases and document management systems as core platforms.

Microsoft acquired Fast Search & Transfer. SAS acquired Teragram. Autonomy acquired Interwoven and Zantaz. In your opinion, will this consolidation create opportunities or shut doors?

Major companies are increasingly looking to vendors with deep pockets and bench strength around support and R&D.  This has driven some rapid market consolidation.  However, these firms are unlikely to be the innovators, and will continue to make acquisitions to broaden their offerings.  There is also a requirement to more deeply integrate search into the broader enterprise IT footprint, and this is also driving acquisitions.

Multi core processors provide significant performance boosts. But search / content processing often faces bottlenecks and latency in indexing and query processing. What’s your view on the performance of
your system or systems with which you are familiar?

Yes, CPU power has directly benefited search applications.  In the case of EveryZing, our cloud architecture takes advantage of quad-core computing so we can deliver triple threaded processing on each box.  This enables us to create multiple quality of service tiers so we can optimize our system for latency or throughput, and do it on a customer by customer basis.  This wouldn’t be possible without advances in computing power.

Graphical interfaces and portals (now called composite applications) are making a comeback. Semantic technology can make point and click interfaces more useful. What other uses of semantic technology do you see gaining significance in 2009?

Semantic analysis is core to our offering.  Every clip we process is run through our NLP platform, which automatically extracts tags and key concepts.  One of the great struggles publishers face today is having the resources to adequately tag and title all of their video assets.  They are certainly aware of the importance of doing this, but are seeking more scalable approaches.  Our system can use both a unsupervised and supervised approach to tagging content for customers.

Where can I find more information about your products, services, and research?

Our Web site is www.everyzing.com.

Autonomy Knipsel

March 12, 2009

A news release turned up in my newsreader with an interesting set of tags. You can read the story about Autonomy, the meaning based computing company, here. If the link goes dead, you will be able to find the original story on the Autonomy Web site here. My newsreader presented me with this headline, “Autonomy Powers Pioneering News Portal – MSN MoneyCentral”. What I think happened is that the news release title has the appended source, “MSN Money Central” as the full title. I don’t know if the parser jammed the two separate fields together or if it was some other type of human or system error. I was expecting to learn that Autonomy sold its search system to MSN Money Central. What the item told me was that Autonomy landed a news service about which I knew nothing. I found this interesting because my Overflight service makes some assumptions about what is a title and what is not a title. I will have to revisit that logic.

Stephen Arnold, March 12, 2009

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta