Processing Content Is Easy, Right?

September 30, 2014

A mobile search app would be useful and appreciated by mobile devices. According to the URX Blog post “Deduplication Of Web Content” it is relatively easy to create a search app, but creating a robust search app is the challenge. A robust search app would need to include link prioritization, feature extraction, re-crawl estimation, and content deduplication. The post is the first in an article series developing a mobile search app.

Deduplicating content is important for user experience:

“Duplicate pages in a search index poison search results. The goal of a search engine is to return both relevant and diverse documents, allowing users to decide the optimal resolution for a query. Without deduplication, the top-k results returned for a user’s query would likely contain duplicate content. In the extreme, all k results will be copies of the same page. This creates a bad user experience where, as the crawler scales out, the duplicate likelihood increases. In fact, Google’s Matt Cutts believes that up to 20% of web content is duplicated.”

The rest of the post examines the different types of duplication, how to identify them, and remove them from search results.

While the search app will serve an important function, it does not make sense to me why people cannot just open a Web browser on a mobile device and conduct a regular search. What I would like to see is an app that searches content on apps on a device.

Whitney Grace, September 30, 2014
Sponsored by ArnoldIT.com, developer of Augmentext

Internet Business: Slightly Different Points of View

September 29, 2014

First, navigate to “Another Top Investor Sounds the Alarm: When the Market Turns, a Bunch of Startups Are Going to Vaporize.” No big surprise here. The main idea is, in my opinion:

Over the past few years, it’s been relatively easy for startups to raise money from venture capitalists. In some cases, they’re raising hundreds of millions of dollars to keep their companies afloat. But behind the scenes, they’re plowing through that money either on marketing, overhead, or some other expense, which results in high burn rates. These bloated companies are using their millions to hide serious flaws in their business models.

At some point, those who provide the bucks to the venture firms will want a return. Many of the Fancy Dan outfits are not among the world’s most liquid operations. To raise cash, MBAs and accountants can cook up some quite remarkable solutions. The actions cascade down the line and end up pushing technology companies like those that pitch wild and crazy content technology into an Iron Maiden. This is essentially a casket with spikes protruding into the box and spikes pointing into the box on its lid.

Ta da.

The individual is placed into the Iron Maiden and the door is shut. Ouch.

Now navigate to either the Google book itself or the concepts Web site at http://bit.ly/1mr9OvS. Eric Schmidt argues that businesses should be like Google. You know the moon shots, trying stuff and failing fast (I am not sure how fast Google has failed at social networking, but I don’t want to be argumentative), and value numbers/data over any humanoid subjectivity.

For many search and content processing companies, the senior managers have been failing for years in some cases. I want to make a list of would be start ups and then provide their date of inception. Heck, why embarrass outfits like Attivio, Coveo, Digital Reasoning, Lucid Imagination (now Lucid Works to which I am tempted to add “Really? but I will not.”), and quite a few others.

The point is that we have two somewhat conflicting interpretations of the present business climate. The tweets that inspired the Business Insider write up are taking a hard look at what happens when the money goes away. No money means that affected firms first people, raise prices, and pivot along with a half dozen or so MBA maneuvers before shutting the doors as Convera, Delphes, did Entopia. A few lucky outfits will sell out like Endeca, Exalead, and iPhrase. A few will struggle along sort of open and sort of closed like a number of French search and content processing firms.

On one hand, these outfits are toast if more money is not “found.” On the other hand, forget money. In Google’s world view, these companies need to be more like Google or out Google Google.

The reality is that the contraction of search and content processing has already begun. Some outfits are going to have to find a way to deliver a solution that solves an actual problem and generates sustainable revenue. Companies in this spot include IBM with its Watson project, Hewlett Packard with its Autonomy IDOL technology, and Palantir, a billion dollar baby of considerable note.

My view is that the doom and gloom expressed in the Business Insider write up is more likely to occur than a Google style entity arising from the Google Moon shot and allied suggestions. I am not sure the Google recommendations apply to Google. A company that is 15 years old and has one revenue stream may be a success that fulfills Steve Ballmer’s one trick pony observation.

For search and content processing vendors, there is no easy way out unless money remains plentiful and Google’s advice actually works for an information retrieval company.

Stephen E Arnold, September 29, 2014

Why Good Enough Is the New Norm in Search

September 29, 2014

Navigate to “Postgres Full Text Search Is Good Enough.” I first heard this argument at a German information technology conference a few years ago. The idea is surprisingly easy to understand. As long as a user can bang in a couple of key words, scan a result list, and locate information that the user finds helpful—job done. The search results may consist of flawed or manipulated information. The search results may be off point for the user’s query when evaluated by old fashioned methods such as precision and recall. The user may be dumb and relies on what the user finds accurate.

Whatever.

This write up explains the good enough approach in terms of PostgreSQL, a useful open source Codd type data management system. Please, note. I am not uncomfortable with good enough search. I understand that when the herd stampedes, it is not particularly easy to stop the run. Prudence suggests that one take cover.

Here’s the guts of the write up:

What do I mean by ‘good enough’? I mean a search engine with the following features:

  • Stemming
  • Ranking / Boost
  • Support Multiple languages
  • Fuzzy search for misspelling
  • Accent support

Luckily PostgreSQL supports all these features.

The write up contains some useful code snippets to make use of search features. The discussion of full text search is coherent and addresses a vast swath of content. Note that proprietary vendors have tilled acres of marketing earth and fertilizer to convert search into a mind boggling range of functions.

This article includes code snippets to tackle full text within PostgreSQL.

Querying is included as well. Again, code snippets are included. (My teenage advisors said, “Very useful snippets.” Okay. Good.

The write up concludes:

We have seen how to build a decent multi-language search engine based on a non-trivial document. This article is only an overview but it should give you enough background and examples to get you started with your own….Postgres is not as advanced as ElasticSearch and SOLR but these two are dedicated full-text search tools whereas full-text search is only a feature of PostgreSQL and a pretty good one

Reasonable observation. Worth reading.

If you are a vendor of proprietary search technology, there will be more individuals infused with the sprit of open source, not fewer. How many experts are there for proprietary systems? Fewer than the cadres of open source volk I surmise.

Stephen E Arnold, September 29, 2014

Yahoo Kills Its Directory

September 28, 2014

Yahooooo. Remember that sound. Once it was a happy yodel. Soon it will be a howl of agony. The Directory created by the original Yahoos, Messrs. Filo and Yang is to be terminated with extreme prejudice. The top Xoogler has decided I learned in “Yahoo to Shut Down Another Batch of Products as Activist Investor Pushes for AOL Acquisition.” The Directory spawned Web search. Web search spawned online advertising. Online advertising created the environment that killed precision and recall. In 20 years, finding information related to what the user actually wanted arrived and will soon depart.

The article asserts:

Directory, meanwhile, is one of Yahoo’s oldest services. As the name suggests, Directory is basically a directory listing designed to help users find the types of websites they’re looking for. Years ago, services like this were a valuable resource but times have certainly changes and Directory will come to an end on December 31.

My hunch is that Yahoo itself may experience the departure of a senior executive in about the same time frame. And the AOL clarion call? Two aged sparrows do not a peacock make.

Stephen E Arnold, September 28, 2014

NSA Catalog Available

September 27, 2014

Short honk: I you want a copy of National Security Agency 2014 Technology Catalog: Technology Transfer Program, you can download it for now from this link. If found pages 26 to 40 fascinating. Will IDC issue its own version of this document, using its surfing technical demonstrated by Dave Schubmehl with my content? I will keep my eye open.

Stephen E Arnold, September 27, 2014

Google-News Corp. Marvelous, He Said, She Said

September 26, 2014

I don’t have a dog in this hunt. I think both Google and News Corp. are wonderful. Weaknesses, none. Both companies just have strengths. Google has its Washington, DC lobbying effort and News Corp. has Fox News. Google has its ups and downs with the privacy issue (except at Stanford University). News Corp. has that alleged telephone tapping matter. Google has legions of users in Europe. News Corp. has fingers clutching newspapers, eyeballs watching television, and some Web users.

One big difference. Google is a  15 year old adolescent. News Corp. is an aged information company.

The two, like a May December romance gone sour will face nothing but irreconcilable differences. Need an example? Check out this blog post from the Google charmingly labeled “Dear Rupert.” See, Google does have a sense of humor.

I don’t have the energy to walk through the arguments and counter arguments. I do want to highlight one point and comment about it. News Corp. leaves a door open with its comment: “Google’s “power” makes it hard for people to “access information independently and meaningfully.” Google is “willing to exploit [its] dominant market position to stifle competition.

The Google response is wonderful. I believe that Commodore Vanderbilt, Jay Gould, John D. Rockefeller (oops sorry. He’s apoplectic about his descendants’ dumping holdings in fossil fuels), and JP Morgan (you remember: the fellow whose portrait makes it appear he is holding a knife as he starts to push himself from a chair) could not have collectively inked a better response:

With the Internet, people enjoy greater choice than ever before — and because the competition is just one click away online, barriers to switching are very, very low.

Well, I sort of enjoy the one click notion, but the reality for online users is that once a habit is formed, users have a tough time breaking it. Google is a habit with a market share only a drug lord can seek: 95% of users in Denmark, 66% of users in the US, 95 percent of users in France (gasp, France, home of Exalead, the former Quaero brain trust centroid, and numerous search vendors), etc. For more data see http://bit.ly/1nwgGrw.

Add to the monopoly position Google search controls, competition is few and far between. Bing.com is just not able to gain significant market share from the GOOG. The hot ticket search engines according to search “experts” are Ixquick.com and DuckDuckGo.com. Er, there are metasearch engines and need access to other vendors’ indexes. As metasearch vendors, as search vendors doing primary indexing bite the dust, these outfits face some tough choices if they want to stay in business. The little known Exalead search,  which is almost unknown, offers a tiny fraction of the GOOG’s coverage. And Yandex? Well, Mr. Putin may make it difficult for that outfit to remain in business without picking up and heading to a new campground.

One click? Nope. As users shift to mobile devices, the information access mode shifts to applications or apps. Maybe News Corp. can tackle Google in this new space. I am not sure, however, if those who know how to do intercepts focus much on cutting Google off at the knees with an appealing online ad platform.

You can work through the rest of the arguments. Remember, you are one click away from finding a new search engine.

Stephen E Arnold, September 26, 2014

A Detailed Look at SharePoint 2013

September 25, 2014

If you’re looking to pull back the curtain on SharePoint, check out “Deep-Dive of Search in SharePoint 2013, Office 365 and SharePoint Online ‘From the Trenches’” at the EPCGroup’s blog. That company has been implementing SharePoint & Office 365 hybrids for years, and is highly regarded by many SharePoint analysts. The introduction to the detailed article tells us:

“In this blog post, EPC Group’s Sr. Search Architects will cover the key service applications and services that power SharePoint 2013, Office 365 and SharePoint Online’s search to enable your organization’s data to easily be found on-demand as well to enable the accuracy of your search results.”

The first section lists SharePoint’s search applications and related services, and notes some things to keep in mind. For example, both “federated search” and “scopes” are now known as “result sources.” Also, a default crawl account must be established; the post explains:

“In order for search to properly work, the SharePoint 2013 Search service must configure a default crawl account which is also referred to as the default content access account. This account must be an active, Active Directory Domain Services domain account. This account should not be setup as an individual or a specific person in IT as EPC Group has seen SharePoint search issues caused by this account being deactivated and an entire organization’s SharePoint search cease to work until the account issue was resolved.”

The article delves into detail on the platform’s components: Search, Crawl, Content Processing, Analytics Processing, Search Administration, Search Index, Search Query, and Search Diagnostics. The flow charts and bulleted lists make this an easy resource to reference; I’d recommend bookmarking to anyone who has a SharePoint system to maintain.

Cynthia Murrell, September 25, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

Elasticsearch Optimization Tips

September 25, 2014

One of the Elasticsearch experts at Found shares some of his wisdom in “Optimizing Elasticsearch Searches.” Writer and open source enthusiast Alex Brasetvik emphasizes that Elasticsearch often offers several ways to approach a problem, and that his suggestions can lead to improved performance. The post begins with a look at the way the platform’s filters work:

“Understanding how filters work is essential to making searches faster. A lot of search optimization is really about how to use filters, where to place them and when to (not) cache them….

“This is the key property of filters: the result will be the same for all searches, hence the result of a filter can be cached and reused for subsequent searches. Caching them is quite cheap, as you can store them as a compact bitmap. When you search with filters that have been cached, you are essentially manipulating in-memory bitmaps – which is just about as fast as it can possibly get.

“A rule of thumb is to use filters when you can and queries when you must: when you need the actual scoring from the queries.”

Brasetvik goes on to elaborate on points such as effective filter usage, combining filters, acceleration filters, aggregation issues, scoring, and important Things to Avoid. The helpful post concludes with a list of further resources.

Cynthia Murrell, September 25, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

Watson and Its API

September 24, 2014

Short honk: Attention, Watson fans. check out the documentation “Example Post for Answers with Evidence.” Put your code hat on.

Stephen E Arnold, September 25, 2014

Lucid Works: Pando Daily Sets the Record Straight

September 23, 2014

On LinkedIn I learned about this Pando Daily write up: “How Disgruntled Ex-Employees and Bad Reporting Hung LucidWorks Out to Dry.” I noted the Venture Beat analysis of Lucid Works in my post on September 6, 2014. My focus was the wild and crazy information from an “expert” about various factoids. You can read my reaction to the “Trouble at LucidWorks” story here.

The Pando Daily story comes at the issue in a different way. I was delighted to see that Pando found the “expert’s” comments a bit wobbly. There was an interesting run down about Lucid Works that seems to have come from a different point of view. In a way, the two stories—Venture Beat’s and Pando Daily’s—are a bit like the he said, she said information provided to police investigating a married couple’s disturbing the peace incident. I am no cop, so I can’t figure out who is correct and who is incorrect.

Pando takes this tack:

More accurately: It’s [Lucid Works] a startup, and this shit is hard.

I understand that search is hard, but is an eight year old company a start up? That time span baffled me. Coveo asserts that it too is a start up. Other search vendors dating from the implosion of the Big Five in 2006 also use the start up moniker.

the article points out that there are happy employees and positive investors. More money is likely to be needed. Pando Daily quotes a backer as saying:

We won’t start looking for an expansion round until early next year.

ElasticSearch has amassed about $90 million in funding. So LucidWorks may be thinking it needs the same scale of investment to take wing.

With regard to management, Pando Daily reports that the new top dog is the type of CEO who can deliver revenues. The new president—Will Smith—is described in this context:

On this point, VentureBeat seems oddly hung up on the idea that Hayes is a first-time CEO, perhaps failing to realize that Silicon Valley was (and continues to be) literally built on the success of first-time CEOs. Not to over egg the point, but Mark Zuckerberg and Steve Jobs were first-time CEOs.

Pando Daily added:

As an early member of the Splunk team, Hayes is certainly more qualified for this job than 99 percent of the candidates out there, and more importantly, given that he didn’t found the company, he appears excited about the category.

Pando Daily reminded me that good start ups fire people. I understand the difference between the Silicon Valley approach to management and that practiced at Halliburton and Booz, Allen & Hamilton where I worked for many years. The idea of stability is not always congruent with the needs of a fast moving, pivoting technology company.

Pando Daily also takes issue with Venture Beat’s report that Lucid Works fumbled deals with some real big companies. Pando Daily asserted:

These accounts may or may not have any basis in reality, but they hardly indicate a failing company. The very nature of sales and business development is that deals fall apart all the time. Sometimes those are big deals, sometimes not. The facts are that LucidWorks counts Apple, Sears, Verizon, ADP, Raytheon, Zappos, Qualcomm, Ford, eHarmony, Cisco, and others among current customers.

My reaction to this is okay, but won’t naming these firms give ElasticSearch and other firms a target at which to shoot. Some content processing vendors like Palantir and Recorded Future don’t provide too much information about their customers.

On the all important revenue front, Pando Daily quoted the new top dog at Lucid Works as saying:

“$12 million in services revenue isn’t worth shit,” Hayes says. “But $12 million in product sales on subscription? That’s a $100 million business.”

I agree. Unless the subscriber terminates the subscription. As the competition among content processing vendors heats up, some firms will be quite aggressive in their attempts to take away business. Amazon, for example, seems to be struggling with search, but it could get its act together and offer both a good enough solution at very competitive prices. Amazon is not the only sharp toothed outfit in the pond.

Pando Daily tracked down its own search wizard. That poobah said:

Not everyone agrees that enterprise search is quite this sexy. One enterprise analyst, speaking to Pando on the condition of anonymity, describes it as “not that big of an end market.” But at the same time, it’s one that’s still out there for the taking. “There isn’t really a single company or set of companies that have dominant products in the space,” this analyst says. Google and Microsoft have entered the market (the latter via acquisition) with low-cost offerings that would seem to make the competitive environment more challenging for LucidWorks and other upstarts. But according to the company’s supporters, these products are targeting different, less big data-centric applications and are thus not a valid comparison.

If you have ever listened to opposing expert witnesses in a legal dispute, the same factoid gets very different treatment by each expert. That’s what makes subjective expertise difficult to interpret. My view is that enterprise search is struggling for credibility. Some of the value for information retrieval has been exhausted by vendors now out of business. These include Convera, Delphes, Entopia, Siderean, and others. Some credibility has been eroded as a result of the Fast Search & Transfer matter. The CEO was hit with a jail term and a ban on working in search for a couple of years. Then there is the on going dispute between Hewlett Packard and Autonomy. IDOL is an aging technology like Endeca. But the mud slinging about search and content processing does not improve the image of those working in this sector.

Consequently information retrieval companies are working overtime to explain their solutions in terms that do not invoke memories of Convera or Fast Search. Palantir is a data mining company. Record Future does predictive analytics. Coveo is eDiscovery and customer support. Search vendors are using a wide range of jargon to describe findability. Lucid Works is brave in using enterprise search with a dash of Big Data in its marketing.

Pando Daily said:

Journalism is tough, particularly in the technology sector. Reporters in this industry asked to cover complex and rapidly evolving companies that often take on hordes of venture cash and set outrageous performance expectations. Unseemly as it may be, stories of failure and calamity make for good scoops, and in these cases ex-employees and competitors often make the best sources. Unfortunately, they also can be the most biased sources and are often are in the best position to credibly lead a journalist astray. LucidWorks certainly has its warts and its scars. But that doesn’t make it trouble, that only makes it a startup.

One question remains: When does a company cease to be a start up and start to be a viable company? Is it one years, four years, or eight years? I just don’t know, but I think that companies that have been in business for almost a decade may not be start ups. Management with a start up mentality may not want to face the cold realities expected of established, stable firms. With Lucid’s technology originating with a community, management may be the issue to watch at Lucid Works. Good management can produce revenue, happy employees, and contented customers. Its absence is often evidenced by a lack of harmony.

Stephen E Arnold, September 23, 2014

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta