Spindex Revealed
June 2, 2010
Microsoft’s Spindex is likely to add some useful functions to social media access. “Microsoft launches its Impossible Project: Spindex” provides a good description of smart software performing “personal indexing.” The idea is that a user’s social information allows Microsoft software to filter information. Only information pertinent to the user is available. When the source is a stream of Twitter messages, the Spindex system converts the noise in tweets to information related to a user’s interests. For me, the most interesting passage in the write up was:
Spindex is a way of surfacing the most shared or popular items that come through your personal news feeds on social networks like Twitter. Microsoft’s project is part of a wave of similar projects like The Cadmus, Feedera and Knowmore that try and synthesize trends and news streams from personal social networks. “Most people don’t really care about what’s trending on Twitter. They care about what’s trending in your own personal index. They want something that’s private, but that you can possibly make public and share with friends,” said Lili Cheng, who is general manager of the lab.
Objectively the service appears to be useful. Subjectively Microsoft will have to make certain that privacy centric users feel comfortable with the system.
Stephen E Arnold, June 2, 2010
Freebie
Apps Versus Browsers for Content
June 1, 2010
Fred a VC’s “I Prefer Safari to Content Apps On The iPad” triggered some thoughts about search and findability. The main point of the write up is that some content is better when consumed through a browser. The write up identifies a number of reasons, including:
- Content as images
- Link issues
- A page at a time.
There are other reasons and you will want to read them in the original document.
I agree with most of these points, but there is a larger and I think more significant issue standing out of the spotlight. Those who create content as Apps may be making it difficult for a person looking for information to “find” the content. With the surge of interest in charging for “real” journalism or “real” essays, will search engines be able to index the content locked in Apps? The easy answer is, “Sure, you silly goose.”
But what if the publishers balk at playing ball with a Web indexing company? The outfit could be big and threatening like you-know-who in Mountain View or small and just getting its feet wet like Duck Duck Go.
Locked up content creates problems for researchers and restarts the cycle of having to have a bunch of accounts or waiting until an appropriate meta-index becomes available.
Stephen E Arnold, June 1, 2010
Freebie
Property Mappings or Why Microsoft Enterprise Search Is a Consultants’ Treasure Chest
May 31, 2010
First, navigate to “Creating Enterprise Search Metadata Property Mappings with PowerShell.” Notice that you may have difficulty reading the story because the Microsoft ad’s close button auto positions itself so you can’t get rid of the ad. Pretty annoying on some netbooks, including my Toshiba NB305.
Second, the author of the article is annoyed, but he apparently finds his solution spot on as something germane to open source search. Frankly I don’t get the link between manual scripting to perform a common function and open source search. Well, that’s what comes from getting old and becoming less tolerant of stuff that simply does not work unless there is a generous amount of time to fix a commercial product.
What’s broken? Here’s the problem:
One of the things that drove me absolutely nuts about Enterprise Search in MOSS 2007 was that there was no built-in way to export your managed property mappings and install them on a new server. A third party utility on CodePlex helped, but it was still less than ideal. With SharePoint 2010, well you still really can’t export your property mappings to a file, but you do get a lot of flexibility using PowerShell.
And the fix?
You use the baker’s dozen lines of code in the write up, substitute your own variable names, and presto, you can get access to that hard won metadata. Here’s the author’s key point:
It seems like a lot but it really isn’t. I create two managed properties (TestProperty1 and TestProperty2). In the case of TestProperty2, I actually map two crawled properties to it.
In my opinion, this type of manual solution is great for those with time to burn and money to pay advisors. Flip the problem. Why aren’t basic functions included in Microsoft’s enterprise search solutions? Oh, and what about that short cut for reindexing? Bet that works like a champ for some users. Little wonder that third party search solutions for SharePoint are thriving. And the open source angle? Beats me.
Stephen E Arnold, May 31, 2010
Freebie
DataparkSearch, Free Full-Featured Web Search Engine
May 24, 2010
Newslookup.com is a quite the feat of news-search engineering. It is the first search engine to arrange search results by media type (television, radio, Internet, etc.) and category, display separate document parts, and effectively use meta data to crawl the internet to provide a “snapshot look of news websites throughout the world.” This is powered by a free, open-source search system called DataparkSearch, its origins going all the way back to 1998 via Russian programmer Maxim Zakharov.
Now in version 4, DataparkSearch boasts an impressive set of features, including indexing of all (x)html file types as well as MP3 and GIF files; support for http(s) and ftp URL schemes; vast language support; authentication and cookie support with session IDs in URLs; and a wide array of sorting, categorizing, and relevancy models to return specific results quickly. All of this is run through various database systems, notably SQL and ODBC.
Sochi’s Internet, a portal and search engine for the Russian city hosting the 2014 Winter Olympics, uses the DataparkSearch engine to deliver hotel, job, and real estate data for the city and surrounding area. The CGI front-end seen on the site provides the data collected by the “indexer,” described as a mechanism that “walks over hypertext references and stores found words and new references into the database.” The same mechanism allows for “fuzzy search,” correcting for spelling corrections and different word forms.
DataparkSearch is available through its own Web site or via Google Code where it has a quite busy activity log. Coded in C, the software is supported on a plethora of UNIX operating systems including FreeBSD and RedHat. Frequency dictionaries, synonym lists, and other helpful files can be found in multiple languages on the website, as well. Support for the search engine can be found through their Wiki, forum, and Google Group.
Samuel Hartman, May 20, 2010
Freebie.
Exalead and Dassault Tie Up, Users Benefit
May 24, 2010
A happy quack to the reader who alerted us to another win by Exalead.
Dassault Systèmes (DS) (Euronext Paris: #13065, DSY.PA), one of the world leaders in 3D and Product Lifecycle Management (PLM) solutions, announced an OEM agreement with Exalead, a global software provider in the enterprise and Web search market. As a result of this partnership, Dassault will deliver discovery and advanced PLM enterprise search capabilities within the Dassault ENOVIA V6 solutions.
The Exalead CloudView OEM edition is dedicated to ISVs and integrators who want to differentiate their solutions with high-performing and highly scalable embedded search capabilities. Built on an open, modular architecture, Exalead CloudView uses minimal hardware but provides high scalability, which helps reduce overall costs. Additionally, Exalead’s CloudView uses advanced semantic technologies to analyze, categorize, enhance and align data automatically. Users benefit from more accurate, precise and relevant search results.
This partnership with Exalead demonstrates the unique capabilities of ENOVIA’s V6 PLM solutions to serve as an open federation, indexing and data warehouse platform for process and user data, for customers across multiple industries. Dassault Systèmes PLM users will benefit from its Exalead-empowered ENOVIA V6 solutions to handle large data volumes thus enabling PLM enterprise data to be easily discovered, indexed and instantaneously available for real-time search and intelligent navigation. Non-experts will have the opportunity to access PLM know-how and knowledge with the simplicity and the performance of the Web in scalable online collaborative environments. Moreover, PLM creators and collaborators will be able to instantly find IP from any generic, business, product and social content and turn it into actionable intelligence.
Stephen E Arnold, May 22, 2010
Freebie.
Social Networks, Testosterone, and Facebook
May 13, 2010
In my Information Today column which will run in the next hard copy issue, I talk about the advantage social networks have in identifying sites members perceive as useful. Examples are Delicious.com (owned by Yahoo) and StumbleUpon.com (once eBay and now back in private hands).
The idea is based in economics. Indexing the entire Web and then keeping up with changes is very expensive. With most queries answered by indexing a subset of the total Web universe, only a handful of organizations can tackle this problem. If I put on my gloom hat, the number of companies indexing as many Web pages as possible is Google. If I put on my happy hat, I can name a couple of other outfits. One implication is that Google may find itself spending lots of money to index content and its search traffic starts to go to Facebook. Yikes. Crisis time in Mountain View?
It costs a lot when many identify important sites and the lone person or company has to figure everything out for himself or herself. Image source: http://lensaunders.com/habit/img/peerpressuresmall.jpg
The idea is that when members recommend a Web site as useful, the company getting this Web site url can index that site’s content. Over time, a body of indexed content becomes useful. I routinely run specialized queries on Delicious.com and StumbleUpon.com, among others. I don’t run these queries on Google because the results list require too much work to process. One nagging problem is Google’s failure to make it possible to sort results by time. I can get a better “time sense” from other systems.
When I read “The Big Game, Zuckerberg and Overplaying your Hand”, I interpreted these observations in the context of the information cost advantage. The write up makes the point via some interesting rhetorical touches that Facebook is off the reservation. The idea is that Facebook’s managers are seizing opportunities and creating some real problems for themselves and other companies. The round up of urls in the article is worth reviewing, and I will leave that work to you.
First, it is clear that social networks are traffic magnets because users see benefits. In fact, despite Facebook’s actions and the backlash about privacy, the Facebook system keeps on chugging along. In a sense, Facebook is operating like the captain of an ice breaker in the arctic. Rev the engines and blast forward. Hit a penguin? Well, that’s what happens when a big ship meets a penguin. If – note, the “if” – the Facebook user community continues to grow, the behavior of the firm’s management will be encouraged. This means more ice breaker actions. In a sense, this is how Google, Microsoft, and Yahoo either operated or operated in their youth. The motto is, “It is better to beg for forgiveness than ask for permission.”
Five Myths of Enterprise Search Marketing
May 12, 2010
The telephone and email flow has spiked. We are working to complete Google Beyond Text and people seem to be increasingly anxious (maybe desperate?) to know what can be done to sell search, content processing, indexing, and business intelligence.
Sadly there is no Betty White to generate qualified leads and close deals for most search and content processing vendors. See “From Golden Girl To It Girl: Betty White Has Become Marketing Magic.” This passage got my goose brain rolling forward:
On Saturday night, ‘SNL’ had its best ratings since 2008, with an estimated 11 million people tuning in to see Betty talk about her muffin. But more than the ratings boost was the shear hilarity of the show; for the first time in a long time, ‘SNL’ was at the center of the national conversation this Monday morning. ‘Saturday Night Live’ was good with Betty White. Really good! And that kind of chatter is something you just can’t buy.
The one thing the goose knows is that one-shot or star-centric marketing efforts are not likely to be effective. A few decades ago, I was able to promote newsletters via direct mail. The method was simple. License a list and pay a service bureau to send a four page letter, an envelope, and a subscription card. Mail 10,000 letters and get 200 subscribers at $100 a pop. If a newsletter took off like Plumb Bulletin Board Systems which we sold to Alan Meckler or MLS: Marketing Library Services which we sold to Information Today, the math was good. Just keep mailing and when the subscription list hit 1,000 or more, sell out.
Times have changed. The cost of a direct mail program in 1980 was less than a $1.00 per delivered item. Today, the costs have risen by a factor of five or more. What’s more important is that snail mail (postal delivered envelopes) is ignored. An indifferent recipient or an recipient overwhelmed with worries about money, the kids, or getting the lawn mowed has afflicted radio, television, cable, door knob hangers, fliers under windshield wipers, and almost any other form of marketing I used in 1970.
I had a long call with a search entrepreneur yesterday, and in that conversation, I jotted down five points. None is specific to her business, but the points have a more universal quality in my opinion. Let me highlight each of these “myths”. A “myth” of course is a story accepted as having elements of truth.
First, send news releases with lots of words that assert “best,” “fastest”, “easiest”, or similar superlatives produces sales. I am not sure I have to explain this. The language of the news release has to enhance credibility. If something is the “fastest” or “easiest”, just telling me one time will not convince me. I don’t think it convinces anyone. The problem is the notion of a single news release. Another problem is the idea that baloney sells or produces high value sales leads. Another problem is that news releases disappear into the digital maw and get spit out in RSS feeds. Without substance, I ignore them. PR firms are definitely increasing their reliance on news releases which are silly. So the myth that cooking up a news release makes a sale is false. A news release will get into the RSS stream, but will that sell? Probably a long shot?
Second, Webinars. I don’t know about you but scheduled Webinars take time. For me to participate in one of these, I need to know that the program is substantive and that I won’t hear people stumble through impenetrable PowerPoint slides. I have done some Webinars for big name outfits, but now I am shifting to a different type of rich media. Some companies charge $10,000 or more to set up a Webinar and deliver an audience. The problem is that some of the audiences for these fees are either not prospects or small. A Webinar, like a news release, is a one shot deal and one shot deals are less and less effective. The myth is that a Webinar is a way to make sales now. Maybe, maybe not.
Third, trade show exhibits. Trade show attendance is down. People want to go to conferences but with the economic climate swinging wildly from day to day, funds to go to conferences are constrained. Conferences have to address a specific problem. Not surprisingly events that are fuzzy are less likely to produce leads. I attended a user conference last week and the exhibitors were quite happy. In fact, one vendor sent me an email saying, “I am buried in follow ups.” The myth that all trade shows yield says is wrong. Some trade shows do; others don’t. Pick wrong and several thousand dollars can fly away in a heartbeat. For big shows, multiply that number by 10.
Fourth, Web sites sell. I don’t know about you, but Web sites are less and less effective as a selling tool. Most Web sites are brochureware unless there is some element of interactivity or stickiness. In the search world, most of the Web sites are not too helpful. Who reads Web pages? I don’t. Who reads white papers? I don’t. Who reads the baloney in the news releases or the broad descriptions of the company’s technology? I don’t. Most effective Web sites are those showcased by the marketing and designers. These are necessary evils, and my hunch is that Web sites will be losing effectiveness like snail mail, just more quickly. The myth is that Web sites pump money to the bottom-line. Hog wash. Web sites are today’s collateral in most cases. A Web site is a necessary evil.
Fifth, social media. I know that big companies have executives who are in charge of social media. Google lacks this type of manager, but apparently the company is going to hire a “social wrangler” or “social trail boss.” Social media, like any other messaging method, requires work. I know for certain that a one shot social media push may be somewhat more economical and possibly more effective than a news release or two. Social media is real and hard work. The myth that it is a slam dunk is wrong.
So with these myths, what works?
I have to be candid. In the search and content processing markets, technology is not going to close deals. The companies whom I hear are making sells are companies able to solve problems. In a conflicted market with great uncertainty, the marketing methods have to be assembled into a meaningful, consistent series of tactics. But tactics are not enough. The basics of defining a problem, targeting specific prospects, and creating awareness are the keys to success.
I wish I could identify some short cuts. I think consistency and professionalism have to be incorporated into on going activities. One shot, one kill may have worked for Buffalo Bill. I am not so sure the idea transfers to closing search deals.
Stephen E Arnold, May 12, 2010
A freebie.
A New Term for Search: Enterprise Mashup
May 12, 2010
I received a copy of “Mashups in the Enterprise IT Environment: The Impact of Enterprise Mashup Platforms on Application Development and Evolving IT Relationships with Business End Users”, written by BizTechReports.com. The white paper is about JackBe.com’s software platform.
Here is the company’s description of its product and services:
Enterprise Mashups solve the quintessential information sharing problem: accessing and combining data from disparate internal and external data sources and software systems for timely decision-making. JackBe delivers trusted mashup software that empowers organizations to create, customize and collaborate through enterprise mashups for faster decisions and better business results. Our innovative Enterprise Mashup platform, Presto®, provides dynamic mashups that leverage internal and external data while meeting the toughest enterprise security and governance requirements. Presto provides enterprise mashups delivered to the user in 3 clicks versus 3 months.
You can get more information from the firm’s Web site at www.jackbe.com. If you want a short cut to demonstrations of the firm’s technology, click here.
The company provides a platform and services to convert disparate data into meaningful information assets. What I find interesting is that the phrase “enterprise mashup” is used to reference a range of content processing activities, including content acquisition and processing, indexing, and information outputting. In short, “enterprise mashup” is a useful way to position functions that some vendors describe as search or findability.
The JackBe’s interface reminds me of other business intelligence data presentations.
I want to focus on the white paper because it provides important hints about the direction in which some types of content processing is moving.
First, the argument of in the white paper hinges on an assertion that there is a “hyper dynamic environment.” How does an organization deal with this environment, a different approach to information is required. What is interesting is that the JackBe audience is a blend of developers and business professionals. Some search vendors are trying to get to the senior management of a company. JackBe is interested in two audiences.
Second, the white paper explains the concept of “mashup”. The term compresses a range of information activities into one term. To implement a mashup, JackBe provide widgets to help reduce the time and hassle for building “situation specific” implementations. Some search vendors talk about customization and personalization. The JackBe approach sidesteps these fuzzy notions and focuses on the idea of a “snap in”, lightweight method.
Finally, the JackBe approach uses an interesting metaphor. The phrase I noted was the “Home Depot model of enterprise IT.” Instead of taking disparate components of a typical search engine, JackBe suggests that a licensee can select what’s needed to do a particular information job.
You will want to read the white paper and glean more detailed information. I want to focus on the differences in the JackBe approach. These include:
- Avoiding the overused and little understood terms such as search, taxonomies, business intelligence, and semantic technology. I am not sure JackBe’s approach is going to eliminate confusion, but it is clear to me that JackBe.com is trying to steer clear of the traditional jargon.
- The JackBe approach is more trendy than IBM’s explanation of OmniFind. Examples of the JackBe approach include the notion of a mashup itself and the references to the “long tail” concept are examples.
- To some enterprise procurement teams, JackBe’s approach may be perceived as quite different from the services of larger, higher profile vendors. In my view, this may be a positive step. Search vendors who follow in the footsteps of STAIRS III or Verity are not likely to have the sales success a more creative positioning permits.
To sum up, I think that companies with search and content processing technology will be working hard to distance themselves from the traditional vendors’ methods. The reason is that search as a stand alone service is increasingly perceived as an island. Organizations need systems that connect the islands of information into something larger.
Is JackBe a search and content processing vendor? Yes. Will most people recognize the company’s products and services as basic search? Not likely. Will the positioning confuse some potential licensees? Maybe.
Stephen E Arnold, May 12, 2010
Unsponsored post.
Monitoring Google via Patent Documents, Definitely Fun
May 8, 2010
As soon as I returned from San Francisco, it was telephone day. Call after call. One of the callers was a testosterone charged developer in a far off land. The caller had read my three Google studies and wanted to know why my comments and analyses were at variance with what Googlers said. The caller had examples from Google executives in mobile, enterprise apps, advertising, and general management. His point was that Google says many things and none of the company’s comments reference any of the technologies I describe.
I get calls like this every couple of months. Let me provide a summary of the points I try to make when I am told that I describe one beastie and the beastie is really a unicorn, a goose, or an eagle.
First, Google is compartmentalized, based on short info streams shot between experts with sometimes quite narrow technical interests. I describe Google as a math club, which has its good points. Breadth of view and broad thinking about other subjects may not be a prerequisite to join. As a result, a Googler working in an area like rich media may not know much or even care about the challenges of scaling a data center, tracking down SEO banditry, or learning about the latest methods in ad injection for YouTube advertisers. This means that a comment by a Google expert is often accurate and shaped for that Googler’s area. Big thinking about corporate tactics may or may not be included.
Second, Google management—the top 25 or 30 executives—are pretty bright and cagey folks. Their comments are often crafted to position the company, reassure those in the audience, or instruct the listener. I have found that these individuals provide rifle shot information. On rare occassions, Google will inform people about what they should do; for example, “embrace technology” or “stand up for what’s right”. On the surface these comments are quotable but they don’t do much to pin down the specific “potential energy” that Google has to move with agility into a new market. I read these comments, but I don’t depend on them for my information. In fact, verbal interactions with Googlers are often like a fraternity rush meeting, not a discussion of issues, probably for the reasons I mentioned in point one above.
Third, Google’s voluminous publicly available information is tough to put into a framework. I hear from my one, maybe two clients, that Google is fragmented, disorganized, chaotic, and tough to engage in discussion. No kidding. The public comments and the huge volume of information scattered across thousands of Google Web pages requires a special purpose indexing operation to make manageable. I provide a free service, in concert with Exalead, so you can search Google’s blog posts. You can see a sample of this service at www.arnoldit.com/overflight. I have a system to track certain types of Google content and from that avalanche of stuff, I narrow my focus to content that is less subject to PR spin; namely, patent documents and papers published in journals. I check out some Google conference presentations, but these are usually delivered through one of Google’s many graduate interns or junior wizards. When a big manager talks, the presentation is subject to PR spin. Check out comments about Google Books or the decision to play hardball with China for examples.
My work, therefore, is designed to illuminate one aspect of Google that most Googlers and a most Google pundits don’t pay much attention to. Nothing is quite so thrilling as reading Google patent applications, checking the references in these applications, figuring out what the disclosed system and method does, and relating the technical puzzle piece to the overall mosaic of “total Google”.
You don’t have to know much about my monographs to understand that I am describing public documents that focus on systems and methods that may or may not be part of the Google anyone can use today. In fact, patent documents may never become a product. What a patent application provides includes:
- Names of Google inventors. Example: Anna Patterson, now running Cuil.com. I don’t beat up on Cuil.com because Dr. Patterson is one sharp person and I think her work is important because she is following the research path explained in her Google patent documents, some of which have now become patents. In my experience, knowing who is “inventing” some interesting methods for Google is the equivalent of turning on a light in a dark room.
- The disclosed methods. Example: There’s a lot of chatter about how lousy Wave was and is. The reality I inhabit is that Wave makes use of a number of interesting Google methods. Reading the patent applications and checking out Wave makes it possible to calibrate where in a roll out a particular method is. For that reason, I am fascinated by Google “janitors” and other disclosures in these publicly available and allegedly legal documents.
- The disclosures through time. I pay attention to dates on which certain patent documents and technical papers appear. I plot these and then organize the inventions by type and function. Over the last eight years I have built a framework of Google capabilities that makes it possible to offer observations based on this particular body of open source information.
When you look at these three points and my monographs, I think it is pretty easy to see why my writings seem to describe a Google that is different from the popular line. To sum up, I focus on a specific domain and present information about Google’s technology that is described in the source documents. I offer my views of the systems and methods. I describe implications of these systems and methods.
I enjoy the emails and the phone calls, but I am greatly entertained by my source documents. My fourth Google monograph, Google Beyond Text, will be available in a month or so. Like my previous three studies, there are some interesting discoveries and hints that Google has reached a pivot point.
Stephen E Arnold, May 8, 2010
Sponsored post. I paid myself to write this article. Such a deal.
Milward from Linguamatics Wins 2010 Evvie Award
April 28, 2010
The Search Engine Meeting, held this year in Boston, is one of the few events that focuses on the substance of information retrieval, not the marketing hyperbole of the sector. Entering its second decade, the conference speakers tackle challenging subjects. This year speakers addressed such topics as “Universal Composable Indexing” by Chris Biow, Mark Logic Corporation, “Innovations in Social Search” by Jeff Fried, Microsoft, and “From Structured to Unstructured and Back Again: Database Offloading”, by Gregory Grefenstette, Exalead, and a dozen other important topics.
From left to right: Sue Feldman, Vice President, IDC, Dr. David Milward, Liz Diamond, Stephen E. Arnold, and Eric Rogge, Exalead.
Each year, the best paper is recognized with the Evvie Award. The “Evvie” was created in honor of Ev Brenner, one of the pioneers in machine-readable content. After a distinguished career at the American Petroleum Institute, Ev served on the planning committee for the Search Engine Meeting and contributed his insights to many search and content processing companies. One of the questions I asked after each presentation was, “What did Ev think?”. I valued Ev Brenner’s viewpoint as did many others in the field.
The winner of this year’s Evvie award is David R. Milward, Linguamatics, for his paper “From Document Search to Knowledge Discovery: Changing the Paradigm.” Dr. Milward said:
Business success is often dependent on making timely decisions based on the best information available. Typically, for text information, this has meant using document search. However, the process can be accelerated by using agile text mining to provide decision-makers directly with answers rather than sets of documents. This presentation will review the challenges faced in bringing together diverse and extensive information resources to answer business-critical R&D questions in the pharmaceutical domain. In particular, it will outline how an agile NLPbased approach for discovering facts and relationships from free text can be used to leverage scientific knowledge and move beyond search to automated profiling and hypothesis generation from millions of documents in real time.
Dr. Milward has 20 years’ experience of product development, consultancy and research in natural language processing. He is a co-founder of Linguamatics, and designed the I2E text mining system which uses a novel interactive approach to information extraction. He has been involved in applying text mining to applications in the life sciences for the last 10 years, initially as a Senior Computer Scientist at SRI International. David has a PhD from the University of Cambridge, and was a researcher and lecturer at the University of Edinburgh. He is widely published in the areas of information extraction, spoken dialogue, parsing, syntax and semantics.
Presenting this year’s award was Eric Rogge, Exalead, and Liz Diamond, niece of Ev Brenner. The award winner received a recognition award and a check for $500. A special thanks to Exalead for sponsoring this year’s Evvie.
The judges for the 2010 Evvie were Dr. David Evans (Evans Research), Sue Feldman (IDC), and Jill O’Neill, NFAIS.
Congratulations, Dr. Milward.
Stuart Schram IV, April 28, 2010
Sponsored post.