An Interview with Dinesh Vadhia
Since 2008, we have talked with more than 50 individuals in the search and content processing sector. However, we have not talked to an entrepreneur in the midst of getting a search company up and running. Xyggy’s splash page points out that the system has been designed to go “beyond search.” Intrigued, I chased down the founder, Dinesh Vadhia. The hook in the Xyggy approach is the “item”. In Xyggy parlance, items can be "documents, web pages, images, database records, publications, ads, audio, articles, video, social profiles, investments, patents, resumes, medical records, in fact, any data type” and a query of one or more items finds other similar items.
I found the demonstrations interesting, and you will want to take a look at the image search, the music recommendation, and the patent search. Links for each are at www.xyggy.com.
Last week I spoke with Mr. Vadhia, and the full text of our discussion appears below:
Getting straight to the point, what is item-search?
Xyggy’s item-search is a new framework for IR based on how people learn concepts and generalize to new items. For instance, shown one or two apples for the first time you will thereafter be able to point to apples every time one crosses your path. The apple may appear as the fruit or in an image and yet we have the remarkable ability to absorb a small amount of information and generalize to new instances. The ability to learn concepts from examples and to generalize to new items is one of the cornerstones of intelligence.
In the item-search IR framework, a query consists of one or more items and Xyggy automatically infers which other items belong to that concept. Call it intelligent information retrieval founded on Bayesian statistical machine learning.
A document and a song are both items. Using a corpus of documents as the example, the unit of indexing is the document, a query consists of one or more documents, and the results are other similar documents in ranked order. For a universe of songs, a query consists of one or more songs, and the results are other similar songs in ranked order. Xyggy doesn’t employ tf-idf indexing and associated ranking methods of traditional text-search systems which are word-based. With item-search the atomic and operational unit is the item.
Xyggy’s item-search method is a new IR tool for solving the ‘findability’ problem. Without a new tool you only have conventional and well travelled paths to address the problem.
Why call it item-search and not document-search? Also, how are items represented?
An item can be any data type including web pages, documents, images, ads, social and professional profiles, publications, audio, articles, video, investments, patents, resumes, medical records and so on.
An item is defined by some or all of the features extracted from the data to create a feature vector which can be thought of as its signature. For example, feature vectors for documents can be defined simply with word count occurrences.
The feature vector of an item type is defined during the data analysis phase prior to indexing and is mostly a matter of common sense. There doesn’t need to be a huge amount of feature engineering as the relevant information simply needs to be clearly present in the feature vectors. For example, if you want to search for movies, it is useful to have information about actors if you expect your system to find movies with the same actor. If you represent images only with texture features, it won’t find images with the same colors and vice-versa. The main advantage is that it doesn’t really hurt to have too many features, if they are at least plausibly relevant to search.
Depending on the search application some clever and sophisticated features can also be created. For example, feature vectors can be defined for web pages and text documents that include word count occurrences as well as phrases, semantic concepts and relationships, numbers, geo details, urls, patterns, tags and annotations and so on.
Songs for a music recommendation system can be represented simply with listener playcounts. We can take this further by adding low-level audio features, text in lyrics, tags given by users, attributes from the Music Genome Project, patterns of user-music preferences and so on.
What kind of item-search services can be built with Xyggy?
Xyggy supports all data types and so covers the spectrum of IR uses including text-only, non-text, mixed and recommendation systems. The following are the data sets that we have created demos from:
- New York Times annotated corpus consisting of 1.8m articles from 1987 to 2007
- last.fm listener playcount and tag data
- Netflix ratings data
- Patents using patent bibliographic data
- Content-based image search using unlabelled and labelled images with flickr pictures
- Legal cases with citations
Apart from web and document search, Xyggy can also be used for ad retrieval through content matching, building recommendation systems (“if you liked this you will also like these” which is about understanding the user’s mindset instead of the traditional “people who liked your choice also liked these”) and finding similar people based on profiles (e.g. for social networks, online dating, recruitment, security and medical records).
These applications illustrate the countless range of problems for which Xyggy’s item-search provides a powerful new approach to finding relevant information.
In the digital world we deal primarily with the words on web pages, documents and messages. Instead, if we consider digital objects as items that can be searched for and found then it opens up a new world of possibility and opportunity. Dealing with items should be very natural because it is precisely what we do in our non-digital world.
Could you describe the unique search box deployed by Xyggy?
Because an item is the atomic unit we can drag, drop and touch them to deliver a far more natural user experience and interface (UXI). Items can be dragged in and out of the search box, a search can contain multiple query items which can each be toggled on and off, with each action automatically returning a new set of relevant results.
There is a natural affinity between the interactive Xyggy search box and the new generation of touch devices. Internally, an item is represented by a feature vector. For display purposes the item can be represented by appropriate text, urls, visual icons or images to enhance usability.
Can you describe the commercial products and services your firm makes available today?
Xyggy is building an online item-search service as well as a mobile app capability.
What are the benefits to a commercial organization or a government agency when working with your firm?
If organizations are having problems with findability (text and non-text) then I would encourage them to get in touch with Xyggy with a view to getting a prototype built so that they can see the qualititative difference between text-search and item-search.
Since Xyggy is a startup we are always looking to work with organizations who can see the upside value of item-search.
Another challenge, particularly in professional intelligence operations, is moving data from point A to point B; that is, information enters a system but it must be made available to an individual who needs that information or at least must know about the information. Does your firm offer licensees to address this issue?
Xyggy is building an online item-search service and so this shouldn’t be an issue. Xyggy will work with organizations who need to operate with private cloud infrastructures. Co-incidently, we believe that intelligence and criminal investigation operations would benefit tremendously from item-search particularly as they deal with multiple data types and finding similar items must help in the investigation process.
There has been a surge in interest in putting "everything" in a repository and then manipulating the indexes to the information in the repository. On the surface, this seems to be gaining traction because network resident information can "disappear" or become unavailable. What's your view of the repository versus non repository approach to content processing?
This is analogous to the concept of a continually available data warehouse of predominantly unstructured data. Xyggy is applicable for all data types (which talks to the “everything” in the question) and I would encourage the innovative organizations to get prototypes built so that they can evaluate the difference between text-search and item-search.
I am on the fence about the merging of retrieval within other applications. What's your take on the "new" method which some people describe as "search-enabled applications"?
Text-search is coming under pressure today within the enterprise and on the web. Many see it as a failure of today’s information filters in the face of the real- time social data firehose. We have always lived with information overload and each time the information filters start breaking, a new one pops up and off we go again. A bigger issue and different from what has gone before is the exponential rate of information growth.
As the current generation of information filters start breaking it opens up the opportunity for companies to try different methods to satisfy IR needs. So, I don’t see anything wrong with this trend. It probably makes IR purists hopping mad but innovation is preferable to going down the same path.
Xyggy can be implemented as both pure search-&-find or within an app.
There seems to be a popular perception that the world will be doing computing via iPad devices and mobile phones. My concern is that serious computing infrastructures are needed and that users are "cut off" from access to more robust systems. How does Xyggy see the computing world over the next 12 to 18 months?
The formation of the real-time web gets closer with each new mobile internet device connection. The numbers are truly staggering. The expectation is that most of the global 5bn feature phone users will be transitioning to mobile internet devices in the coming years. With the advent of near-field-communication (NFC) chips in devices expect mobile ecommerce to take off. As all this happens, solid computing infrastructures will be needed and they will predominantly be located in the cloud (well, data centers to the rest of us!).
Most people, most of the time will be interacting with the web from their mobile device. This has implications for the UXI and complexity of the app (native or web). People – consumers and employees – will want really simple ways to get things done. It is going to be interesting to see how the traditional enterprise vendors deal with this new reality.
The real-time web of exponentially growing information has put the spotlight on personalization. Lots of different terms are being bandied about including implicit search, autonomous search, search-without-search, contextual discovery, information recommenders and serendipity engines but it all comes down to making it easier and easier to find the information that you need. As you’ve said countless times it is about “findability” not search. Search is a process, finding is satisfying the information need.
Is Xyggy working on anything in the autonomous area?
Yes, for mobile devices. We call it "autonomous find" or "machine-assisted discovery." Xyggy will autonomously discover new items – comics, games, music, messages, videos, sites and so on – that are of interest and value to you whenever and wherever. Xyggy will understand you, take context and proximity into consideration when appropriate, and include surprise and randomness.
Forms of “do nothing” search will become the norm for coping with an exponential information growth rate with billions of always-connected people looking to find items of value to them. I also think this will impact web sites. Consider products like Flipboard and paper.li which are semi-personalized pages of news, information, photos and social media streams. Going forward, people will have their own highly personalized and dynamic web site of one or few pages – instead of going to web sites, information from other sites will come to your site. Expect the reverse, too – visiting a web site will become a personalized experience.
As this plays out on the web in the coming years, it is unclear yet how it will impact enterprise search.
Put on your wizard hat. What are the three most significant technologies that you see affecting your business or as the opportunity?
The real-time web. Billions of always-connected mobile internet devices. Managing trillions of data items in public and private clouds in real-time.
Where does a reader get more information about Xyggy?
Technical overview: Information Retrieval using a Bayesian Model of Learning and Generalization
Email: dinesh [at] xyggy dot com
We think Xyggy’s approach is promising. The challenge for companies developing search and content processing boils down to two factors: getting the word out to thought leaders and coping with the still sluggish financial climate. We think Xyggy warrants a test drive. Some of its functionality appears to offer a financial upside as well.
Stephen E. Arnold, January 25, 2011