Search: A Persistent Disconnect between Reality and Innovation

August 17, 2012

Two years ago I wrote The New Landscape of Search. Originally published by Pandia in Norway, the book is now available without charge when you sign up for  our new “no holds barred” search newsletter Honk!. In the discussion of Microsoft’s acquisition of Fast Search & Transfer SA in 2008, I cite documents which describe the version of Fast Search which the company hoped to release in 2009 or 2010. After the deal closed, the new version of Fast seemed to drop from view. What became available was “old” Fast.

I read the InfoWorld story “Bring Better Search to SharePoint.” Set aside the PR-iness of the write up. The main point is that SharePoint has a lousy search system. Think of the $1.2 billion Microsoft paid for what seems to be, according to the write up, a mongrel dog. My analysis of Fast Search focused on its age. The code dates from the late 1990s and its use of proprietary, third party, and open source components. Complexity and the 32 bit architecture were in need of attention beyond refactoring.

The InfoWorld passage which caught my attention was:

Longitude Search’s AptivRank technology monitors users as they search, then promotes or demotes content’s relevance rankings based on the actions the user takes with that content. In a nutshell, it takes Microsoft’s search-ranking algorithm and makes it more intelligent…

The solution to SharePoint’s woes amounts to tweaking. In my experience, there are many vendors offering similar functionality and almost identical claims regarding fixing up SharePoint. You can chase down more at www.arnoldit.com/overflight.

The efforts are focused on a product with a large market footprint. In today’s dicey economic casino, it makes sense to trumpet solutions to long standing information retrieval challenges in a product like SharePoint. Heck, if I had to pick a market to pump up my revenue, SharePoint is a better bet than some others.

Contrast the InfoWorld’s “overcome SharePoint weaknesses” with the search assertions in “Search Technology That Can Gauge Opinion and Predict the Future.” We are jumping from the reality of a Microsoft product which has an allegedly flawed search system into the exciting world of what everyone really, really wants—serious magic. Fixing SharePoint is pretty much hobby store magic. Predicting the future: That is big time, hide the Statue of Liberty magic.

Here’s the passage which caught my attention:

A team of EU-funded researchers have developed a new kind of internet search that takes into account factors such as opinion, bias, context, time and location. The new technology, which could soon be in use commercially, can display trends in public opinion about a topic, company or person over time — and it can even be used to predict the future…Future Predictor application is able to make searches based on questions such as ‘What will oil prices be in 2050?’ or ‘How much will global temperatures rise over the next 100 years?’ and find relevant information and forecasts from today’s web. For example, a search for the year 2034 turns up ‘space travel’ as the most relevant topic indexed in today’s news.

Yep, rich indexing, facets, and understanding text are in use.

What these two examples make clear, in my opinion, is that:

Search is broken. If an established product delivers inadequate findability, why hasn’t Microsoft just solved the problem? If off the shelf solutions are available from numerous vendors, why hasn’t Microsoft bought the ones which fix up SharePoint and call it a day? The answer is that none of the existing solutions deliver what users want. Sure, search gets a little better, but the SharePoint search problem has been around for a decade and if search were such an easy problem to solve, Microsoft has the money to do the job. Still a problem? Well, that’s a clue that search is a tough nut to crack in my book. Marketers don’t have to make a system meet user needs. Columnists don’t even have to use the systems about which they write. Pity the users.

Writing about whiz bang new systems funded by government agencies is more fun than figuring out how to get these systems to work in the real world. If SharePoint search does not work, what effort and investment will be required to predict the future via a search query? I am not holding my breath, but the pundits can zoom forward.

The search and retrieval sector is in turmoil, and it will stay that way. The big news in search is that free and open source options are available which work as well as Autonomy- and Endeca-like systems. The proprietary and science fiction solutions illustrate on one hand the problems basic search has in meeting user needs and, on the other hand,  the lengths to which researchers are trying to go to convince their funding sources and regular people that search is going to get better real soon now.

Net net: Search is a problem and it is going to stay that way. Quick fixes, big data, and predictive whatevers are not going to perform serious magic quickly, economically, or reliably without significant investment. InfoWorld seems to see chipper descriptions and assertions as evidence of better search. The Science Daily write up mingles sci-fi excitement with a government funded program to point the way to the future.

Sorry. Search is tough and will remain a chunk of elk hide until the next round of magic is spooned by public relations professionals into the coffee mugs of the mavens and real journalists.

Stephen E Arnold, August 17, 2012

Sponsored by Augmentext

 

Perfecting Web Site Semantics

August 6, 2012

Web site search is most often frustrating, and at its worst, a detriment to customers and commerce.  Fabasoft Mindbreeze, a company heralded for its advances in enterprise search, is bringing its semantic specialization to the world of Web site search with Fabasoft Mindbreeze InSite.  Daniel Fallmann, Fabasoft Mindbreeze CEO, highlights the features of the new product in his blog entry, “4 Points for Perfect Website Semantics.”

Fallmann lays out the problem:

The problem: Standard search machines, in particular the one provided by CMS, are unproductive and don’t consider the website’s sophisticated structure. The best example: enter the search term ‘product’ and the search delivers no results, even though product is its own category on the site. Even if the search produces a result for another term, there’s nothing more than a ‘relatively un-motivating list of links,’ not really much help to a website visitor.

Using semantics in the search means that the Web site is being understood, not just keyword searched.  Automatic indexing preserves the existing site structure, while providing hassle-free search for the customer.  In addition, InSite benefits the Web site developer, in that he/she can see how users are navigating the site and which elements are most often searched.

The attractive “behind-the-scenes” functioning of Fabasoft Mindbreeze InSite means that customers benefit from the intuitive, semantic search without the distraction of a clunky search layer.  Satisfy your customers and your developers by exploring InSite today.

Emily Rae Aldridge, August 6, 2012

Sponsored by ArnoldIT.com, developer of Augmentext.

Short Honk: High Value Podcast about Solr

August 4, 2012

If you are interested in Lucene/Solr and have a long commute, you will want to check out Episode 187 of the IEEE’s Software Engineering Podcast. You can find the podcast on iTunes. Grant Ingersoll, one of Lucid Imagination’s experts in open source search and a committer on the Apache Lucene/Solr project, reviews the origins of Lucene, explains the features of Solr, and covers a range of important, hard to get search information. According to IEEE, the podcast offers a:

dive into the architecture of the Solr search engine. The architecture portion of the interview covers the Lucene full-text index, including the text ingestion process, how indexes are built, and how the search engine ranks search results.  Grant also explains some of the key differences between a search engine and a relational database, and why both have a place within modern application architectures.

One of the highlights of the podcast is Mr. Ingersoll’s explanation of vector space indexing. Even a high school brush with trigonometry is sufficient to make this important subject fascinating. Highly recommended.

Stephen E Arnold, August 4, 2012

Sponsored by Augmentext

Research and Development Innovation: A New Study from a Search Vendor

August 3, 2012

I received message from LinkedIn about a news item called “What Are the Keys to Innovation in R&D?” I followed the links and learned that the “study” was sponsored by Coveo, a search vendor based in Canada. You can access similar information about the study by navigating to the blog post “New Study: The Keys to Innovation for R&D Organizations – Their Own, Unused Knowledge.” (You will also want to reference the news release about the study as well. It is on the Coveo News and Events page.

Engineers need access to the drawings and those data behind the component or subsystem manufactured by their employer. Text based search systems cannot handle this type of specialized data without some additional work or the use of third party systems. A happy quack to PRLog: http://www.prlog.org/10416296-mechanical-design-drawing-services.jpg

The main of the study, as I interpret it, is marketing Coveo as a tool to facilitate knowledge management. Even though I write a monthly column for the print and online publication KMWorld, I do not have a definition of knowledge management with which I am comfortable. The years I spent at Booz, Allen & Hamilton taught me that management is darned tough to define. Management as a practice is even more difficult to do well. Managing research and development is one of the more difficult tasks a CEO must handle. Not even Google has an answer. Google is now buying companies to have a future, not inventing its future with existing staff.

The unhappy state of many search and content processing companies is evidence that those with technological expertise may not be able to generate consistent and growing revenues. Innovation in search has become a matter of jazzing up interfaces and turning up the marketing volume. The $10 billion paid for Autonomy, the top dog in the search and content processing space, triggered grousing by Hewlett Packard’s top executives. Disappointing revenues may have contributed to the departure of some high profile Autonomy Corporation executives. Not even the HP way can make traditional search technology pay off as expected, hoped, and needed. Search vendors are having a tough time growing fast enough to stay ahead of spiking technical and support costs.

When I studied for a year at the Jesuit-run Duquesne University, I encountered Dr. Frances J. Chivers. The venerable PhD was an expert in epistemology with a deep appreciation for the lively St. Augustine and the comedian Johann Gottlieb Fichte. I was indexing medieval Latin sermons. I had to take “required” courses in “knowledge.” In the mid 1960s, there were not too many computer science departments in the text indexing game, so I assume that Duquesne’s administrators believed that sticking me in the epistemology track would improve the performance of my mainframe indexing software. Well, let me tell you: Knowledge is a tough nut to crack.

Now you can appreciate my consternation when the two words are juxtaposed and used by search vendors to sell indexing. Dr. Chivers did not have a clue about what I was doing and why. I tried to avoid getting involved in discussions that referenced existentialism, hermeneutics, and related subjects. Hey, I liked the indexing thing and the grant money. To this day, I avoid talking about knowledge.

Selected Findings

Back to the study. Coveo reports:

We recently polled R&D teams about how they use and share innovation across offices and departments, and the challenges they face in doing so.  Because R&D is a primary creator and consumer of knowledge, these organizations should be a model for how to utilize and share it. However, as we’ve seen in the demand for our intelligent indexing technology, and as revealed in the study, we found that R&D teams are more apt to duplicate work, lose knowledge and operate in soloed, “tribal” environments where information isn’t shared and experts can’t be found.  This creates a huge opportunity for those who get it right—to out-innovate and out-perform their competition.

The question I raised to myself was, “How were the responses from Twitter verified as coming from qualified respondents?” And, “How many engineers with professional licenses versus individuals who like Yahoo’s former president just arbitrarily awarded themselves a particular certification were in the study?” Also, “What statistical tests were applied to the results to validate the the data met textbook-recommended margins of error?”

I may have the answers to these questions in the source documents. I have written about “number shaping” at some of the firms with which I have worked, and I have addressed the issue more directly in my opt in, personal news service Honk. (Honk, a free weekly newsletter, is a no-holds-barred look at one hot topic in search and content processing. Those with a propensity to high blood pressure should not subscribe.)

Read more

Making Data Easy with Training Wheels? The Nielsen Dust Up

July 31, 2012

In the Honk newsletter, I have been plugging away at some of the flights of fancy that surround big data, next generation analytics, and all things predictive. I am nervous about “training wheels” on complex mathematical processes. Like the fill-in-the-blanks functions in Excel, a person without a foundation in math can fiddle around until the software spits out a number which “looks good.” In one of my jobs, my boss was a master at “the flow.” The idea was that numbers can be shaped to support a particular point. I recall his comment to me in 1974, “Most of our clients are not smart enough to work through the math. We have to generate outputs which flow.” The idea is one that troubled me. I moved on to greener and less slippery pastures and I kept that notion of “flow” squarely in mind. Numbers should not cause the person looking at a chart or a table to say, “Wow, that number looks weird.” Hence, flow allows the reasoning process to be guided.

I just read a story which I hope is not accurate. I want to document my coming across the item, however. I think it will be an interesting touchstone and search and content processing companies race to be come players in big data and analytics. The story appeared in the Hollywood Reporter, a publication about which I know little. The headline caught my attention because it resonates with advertising and advertising automatically evokes the Google logo for me. “Nielsen Sued for Billions over Allegedly Manipulated TV Ratings” carries a hard hitting subtitle too: “In a huge new lawsuit, the business of TV ratings is fingered for rampant corruption by India’s largest TV news network.” I know even less about India than the Hollywood Reporter.

Fancy math underlies the products and services of many analytics firms which offer products and services to licensees that make interacting with data a matter of pointing and clicking. A happy quack for the equation to http://goo.gl/lBlXV

Here’s the passage I noted:

In a 194-page lawsuit filed in New York court late last week, NDTV accuses Nielsen of violating the Foreign Corrupt Practices Act by manipulating viewership data in favor of channels that are willing to provide bribes to its officials. According to NDTV, rampant manipulation of viewership data has been going on for eight years, and when presented with evidence earlier this year, top executives at Nielsen pledged to make changes. But the Indian news giant says these promises have been false ones.

Like most litigation, the story will unfold slowly and perhaps not at all. The i2 Group Palantir litigation is a relatively recent example. Based on my experience with the boss who wanted numbers to flow, I can see how the possibility of tweaking could be useful to some companies. However, with the dismal state of math skills, how can I now of the problem was a result of human intent, human error, or a training wheels type system driven over rocky terrain. I can’t and I bet that most people thinking about this situation cannot either.

What is interesting to me, however, are these notions:

  1. How many other fancy math systems are open to similar allegations from their licensees?
  2. Will this type of legal action cause some of the vendors pitching fancy math and predictive systems to modify their marketing materials to include more caveats and real world anchors instead of bold assertions?
  3. How will the legal system deal with fancy math litigation? I don’t know many attorneys. The handful with which I have some experience have been quick to point out that math, engineering, and science were not their strengths. Logic and reasoning were their strong suits.

With many search and content processing companies embracing fancy math, sentiment analysis, smart indexing and other math-based functions, will a search vendor find itself in the hot seat? I hope not but the market wants to buy fancy math. Understanding the fancy math may drive demand for individuals who can figure out if the systems and methods do what the licensee believes they do.

Oh, I like the word “billions.” Big money adds to the drama of analytics risk management in my opinion.

Stephen E Arnold, July 31, 2012

Microsoft and Yammer: Extending SharePoint Functionality

July 31, 2012

Yammer is what an enterprise social network tool; organizations implement it to spur collaboration between users. On the Yammer homepage we found a new application which permits Microsoft SharePoint Integration. After reading the specs, we found on the Microsoft blog about “Yammer-The Next Step for Social Networking In Schools?”

According to the post, Microsoft recently purchased Yammer. The post explains Yammer’s basic functions, the dashboard mirrors Facebook’s design with hints of Twitter. The post digs into how Yammer would be used in schools, basically the same way it would for any company: staff would use to communicate between departments, share content, etc. It can also be a boon for students too:

“We know that group work is a great way to encourage students to engage with their peers, but this isn’t easy when they all use different social networks, clouds and systems. By joining Yammer, students can create secure groups via which they can communicate their ideas, ask questions and share files, as well as allowing for their competitive side to come out through ‘Leaderboards’, which show data about who has received the most likes, replies and much.”

Students can perform group work, receive studying help, share content, and even praise each other within Yammer. While it can be a tool of food for students, it can also make cheating and plagiarism easier if not monitored. Yammer should install an app that will be able to detect plagiarism.

The surge of interest in social content is growing in government agencies, commercial organizations, and educational institutions. However, indexing and making this content
findable can be a challenging task. The tools an organization uses require tight integration with
a search system. Mindbreeze provides capabilities to make social content easily findable within a SharePoint environment. A Yammer style can enhance productivity. Mindbreeze offers a range of social and collaborative features and has the engineering expertise to resolve almost any search and retrieval issue. Check out the Mindbreeze social collaboration Web page for more information.

Whitney Grace, July 31, 2012

Sponsored by Mindbreeze

Latent Semantic Technology Tops Business Strategy List

July 26, 2012

The current 2012 year is going by quickly, but there is still time to implement business strategies that could gain your company a bigger presence on the Internet. Venture Beat reported on the “Top 10 Most Important SEO and Social Marketing Tactics of 2012.” Generally these top ten lists yield information we already know: distribute content via social channels, list your social media connection buttons prominently on the page, enable sharing content, join Pinterest, etc. Some of the ideas are new: author guest blog posts, keep your own blog content interesting and new, but the number one suggestion that caught our attention was:

“Get an onsite SEO audit: an onsite SEO audit is the foundation of your SEO campaign. Getting one will help you answer questions like: Are your title and meta tags optimized? How’s your keyword density? Have you correlated certain pages with certain keywords? Is that evident in the copy? Have you done your LSI (latent semantic indexing) research and incorporated it into the copy? An onsite SEO audit is relatively cheap, and it’s a one-time payment that you shouldn’t need to address more than once a year.”

An SEO audit done by a professional company will work wonders, heck, if you do your research you can do provide the service for yourself. One important aspect of the audit is latent semantic indexing, a powerful component of text and document analysis.

Whitney Grace, July 26, 2012

Sponsored by Polyspot

 

The TREC 2011 Results and Predictive Whatevers

July 20, 2012

Law.com reports “Technology-Assisted Review Boosted in TREC 2011 Results” how technology-assisted review boasts that it may be capable of ousting predictive coding’s title. TREC Legal Track is an annual government sponsored project (2012 was canceled) to examine document review methods. From the 2011 TREC, participants voted in favor of technology-assisted review, but it may have a way to go:

“As such, ‘There is still plenty of room for improvement in the efficiency and effectiveness of technology-assisted review efforts, and, in particular, the accuracy of intra-review recall estimation tools, so as to support a reasonable decision that ‘enough is enough’ and to declare the review complete. Commensurate with improvements in review efficiency and effectiveness is the need for improved external evaluation methodologies,’ the report states.”

The 2011 TREC asked participants to test three document review requests, but different from past years the rules were more specific in requirements by having participants rank documents as well as which were the most responsive. The extra requirement meant that researchers were able to test hypothetical situations, but there were some downsides:

“TREC 2011 had its share of controversy. ‘Some participants may have conducted an all-out effort to achieve the best possible results, while others may have conducted experiments to illuminate selected aspects of document review technology. … Efficacy must be interpreted in light of effort,’ the report authors wrote. They noted that six teams devoted 10 or fewer hours for document review during individual rounds, two took 20 hours, one used 48 hours, and one, Recommind, invested 150 hours in one round and 500 in another.”

We noticed this passage in the write up as well:

“`It is inappropriate –- and forbidden by the TREC participation agreement –- to claim that the results presented here show that one participant’s system or approach is generally better than another’s. It is also inappropriate to compare the results of TREC 2011 with the results of past TREC Legal Track exercises, as the test conditions as well as the particular techniques and tools employed by the participating teams are not directly comparable. One TREC 2011 Legal Track participant was barred from future participation in TREC for advertising such invalid comparisons,’ the report states.”

TREC is sensitive to participants who use the data for commercial purposes. We wonder which vendor allegedly stepped over the end line. We also wonder if TREC is breaking out of the slump which traditional indexing seems have relaxed into. Is “predictive” the future of search? We are not sure about the TREC results. We do have an opinion, however. Predictive works in certain situations. For others, there are other, more reliable tools. We also believe that there is a role for humans, particularly when the risks of an algorithm going crazy exist. A goof in placing an ad on a Web page is one thing. An error predicting more significant events? Well, we are more cautious. Marketers are afoot. We prefer the more pragmatic approach of outfits like Ikanow and we avoid the high fliers whom we will not name.

Stephen E Arnold, July 20, 2012

Sponsored by Polyspot

 

Useful Graphics Explaining SharePoint 2013

July 19, 2012

SharePoint developers are eagerly waiting for SharePoint 2013.  A blogger at the Microsoft Blogs wrote “SharePoint 2013-Initial Take On Changes To Search” and he has been viewing a lot of slideshows on the new version.  His favorites are at SharePoint 2012: Presentation: IT Pro Training and all are easy to download.  He takes a look a the Module 7: SharePoint Search 2013 that takes an in depth view into enterprise search, including architectural changes to physical and logical topologies and configuration details on crawling, content, and query.

Fast Search functionality is behind much of the SharePoint 2013 enterprise search capability:

“ From the SharePoint 2013 slides, it’s pretty clear that the rumors have played out and core components of SharePoint Search (particularly the Indexing pipeline) effectively got replaced by the Fast Search pipeline… although it will maintain the ‘SharePoint Search’ moniker (Disclaimer: I’m not a marketing guy and have no idea what the licenses will be, so this is just my observation).”

There is a lot of content to digest in from the presentations, but the article pulls out the very detailed and informative diagrams to understand how Fast Search has and will change the search architecture for SharePoint.  With more than 30,000 consultant days of Fast implementation experience at Search Technologies, we will be gearing up early to support SharePoint 2013 Search Rollouts.

Iain Fletcher, July 19, 2012

Sponsored by Search Technologies

Predictive Coding Sparks Some Opposition

July 11, 2012

Editor’s Note: The original write up included observations which we learned were “out of bounds.” We want to make a formal apology to those mentioned in the article.  

We wrote about a story in The New York Law Journal. You can find the source document  “Judge Rejects Recusal Over Support for eDiscovery Method” at this link. We heard from one of our advertisers that Recommind www.recommind.com, a company known for its work in predictive coding, objected to the summary and the opinions expressed in the source article. We have, based on the statements communicated to us by our advertiser, removed the 256 word summary, quote from the source article, and the  opinions of the person who wrote the story for Beyond Search. We deeply regret writing an abstract which offended Recommind.
We are sorry that our approach to creating useful pointers to articles of interest offended Recommind, the other individual  referenced in the special letter to our advertiser, and to any of our readers who found our article problematic. We have a procedure in place, and we try to allow writers  scope and try to point to useful source articles. We want to help inform people about important articles. The opinion the writers express are designed to add value to the abstract.  As we note in the About section to this Web log, we are performing a specific type of abstracting and indexing service. We wrote in January 2008 and updated the statement in November 2011:

The data and information provided on this site are for informational purposes only. I, Stephen E. Arnold, make no representations as to accuracy, completeness, timeliness, objectivity, suitability, or validity of any information in a write up or on this site. Our content is pegged to source materials which are reachable via a public network. If you read something in Beyond Search, it is not news or “real” journalism. My views and opinions change. Frequently. Expect to find variances when you compare certain essays with my other written work. I started the Beyond Search Web log for myself, promote my studies, capture information that won’t be in my for-fee work, and have a time stamped record of what I was thinking, why, and when. I am human; therefore, I make errors. If you read something and accept it without verifying my information and interpretation, that’s your decision. Got a problem with my approach? Do not read this Web log. The information is provided on an as-is basis and, at this time, without a fee. Just in case: If I say something dumb in the future, it’s better to be able to point out that the error is mine and a mistake should not be a surprise.

We are sorry, apologize, and will make an effort to do an even better job going forward.

Stephen E Arnold, July 16, 2012, 1 30 pm Eastern

[Source article with offending and objectionable comments, statements, and opinions removed by request.]

We have heard the predictive coding and eDiscovery are the legal community’s future, but change is always met by resistance. The New York Law Journal recounts one of the first challenges for predictive coding: “Judge Rejects Recusal Over Support for eDiscovery Method.” The article describes the case Moore v. Publicis Groupe, 11 Civ. 1279 and how Magistrate Judge Andrew Peck refused to recuse himself from the case. We noted this passage.

” ‘Here, my comments at eDiscovery conferences related to the general use of predictive coding in appropriate cases, and I did not express any opinion regarding the specific issues in the case. Consequently, neither my comments nor the fact that Losey was on some panels with me, nor the fact that MSL’s vendor Recommind sponsored different panels at LegalTech, separately or collectively, are a basis for recusal.’ “

Whitney Grace, July 11, 2012 and revised on July 16, 2012

Sponsored by Content Analyst

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta