Ghost Sites Generate Traffic But Not Much Else

April 1, 2013

Due to a low rate of turnover and clickthroughs and the often unreliable world of ad exchange and technology that promises sky high viewership that they just can not deliver on, the rise of digital tricksters is at an all time high.

You’ve heard of a ghost writer, well, “Meet the Most Suspect Publishers On the Web: The Rise of Ghost Sites, Where Traffic is Huge but People are Few.”

“Increasingly, digital agencies and buy-side technology firms are seeing massive traffic and audience spikes from groups of Web publishers few people have ever heard of. These sites—billed as legitimate media properties—are built to look authentic on the surface, with generic, nonalarm-sounding content. But after digging deeper, it becomes evident that very little of these sites’ audiences are real people. Yet big name advertisers are spending millions trying to reach engaged users on these properties.”

That is right, companies like DigiMogul and Alphabird are getting advertisers to pay to leave an impression on a viewer that may or may not exist. The problem with this is you get pretty lousy search results due to the lack of actual humans hitting and working on the site. But with bots driving up traffic, those big names like BMW, Pillsbury, and JetBlue are clamoring to throw their money at the company in an effort to reach “consumers.”

Sounds a little backward to us.

Leslie Radcliff, April 3, 2013

Sponsored by ArnoldIT.com, developer of Augmentext

Soutron and EBSCO Join Forces

April 1, 2013

Could the library be a gold mind just waiting to be tapped for its financial resources? The Examiner article “Soutron and EBSCO Enter Partnership Agreement” talks about the technology partnership that Soutron Global and EBSCO forged. With this new partnership Soutron Global will begin to integrate EBSCO Discovery Services with Soutron’s Library and Knowledge Management system. This collaboration will provide clients with a single integrated search environment that they can use for research and information resources. Tony Saadat, President and CEO of Soutron Global made the following statement.

“This partnership means that libraries, knowledge management centers, and information resource portals can ensure optimal access to knowledge assets, physical resources, and digital resources, thus ensuring optimal exploitation of resources.”

EBSCO Publishing is the company behind EBSCOhost, which is a fee-based online research service. A variety of libraries including educational, medical and public use EBSCO services. EBSCO Discovery Service (EDS) provides better indexing and full-text searching than any other discovery service. Graham Beastall, Managing Director, UK hade the following to say regarding the collaboration.

“Soutron is very excited to be working with EBSCO on what we regard as a key initiative to develop access to digital and physical resources in an organization. It will allow us to offer customers using Soutron additional opportunities to maximize use of their collection through EDS single search indexing technologies. Our goal is to make life easier for end users and for library managers.”

Never really thought of library catalogs as a way to financial security but could they be the next technology gold mind. Looking at the big picture I think the answer is no. Most libraries already work on a limited budget and it’s unlikely that they will suddenly get additional funds. With their proven technology EBSCO should focus on acquiring library cataloging and services companies for an extra boost. “Might as well be all or nothing.”

April Holmes, April 01, 2013

Sponsored by ArnoldIT.com, developer of Augmentext

Promise Best Practices: Encouraging Theoretical Innovation in Search

March 29, 2013

The photo below shows the goodies I got for giving my talk at Cebit in March 2013. I was hoping for a fat honorarium, expenses, and a dinner. I got a blue bag, a pen, a notepad, a 3.72 gigabyte thumb drive, and numerous long walks. The questionable hotel in which I stayed had no shuttle. Hitchhiking looked quite dangerous. Taxis were as rare as an educated person in Harrod’s Creek, and I was in the same city as Leibnitz Universität. Despite my precarious health, I hoofed it to the venue which was eerily deserted. I think only 40 percent of the available space was used by Cebit this year. The hall in which I found myself reminded me of an abandoned subway stop in Manhattan with fewer signs.

image

The PPromise goodies. Stuffed in my bag were hard copies of various PPromise documents. The most bulky of these in terms of paper were also on the 3.73 Gb thumb drive. Redundancy is a virtue I think.

Finally on March 23, 2013, I got around to snapping the photo of the freebies from the PPromise session and reading a monograph with this moniker:

Promise Participative Research Laboratory for Multimedia and Multilingual Information Systems Evaluation. FP7 ICT 20094.3, Intelligent Information Management. Deliverable 2.3 Best Practices Report.

The acronym should be “PPromise,” not “Promise.” The double “P” makes searching for the group’s information much easier in my opinion.

If one takes the first letter of “Promise Participative Research Laboratory for Multimedia and Multilingual Information Systems Evaluation” one gets PPromise. I suppose the single “P” was an editorial decision. I personally like “PP” but I live in a rural backwater where my neighbors shoot squirrels with automatic weapons and some folks manufacture and drink moonshine. Some people in other places shoot knowledge blanks and talk about moonshine. That’s what makes search experts and their analyses so darned interesting.

To point out the vagaries of information retrieval, my search to a publicly accessible version of the PPromise document returned a somewhat surprising result.

image

A couple more queries did the trick. You can get a copy of the document without the blue bag, the pen, the notepad, the 3.72 gigabyte thumb drive, and the long walk at http://www.promise-noe.eu/documents/10156/086010bb-0d3f-46ef-946f-f0bbeef305e8.

So what’s in the Best Practices Report? Straightaway you might not know that the focus of the whole PPromise project is search and retrieval. Indexing, anyone?

Let me explain what PPromise is or was, dive into the best practices report, and then wrap up with some observations about governments in general and enterprise search in particular.

Read more

Search Evaluation in the Wild

March 26, 2013

If you are struggling with search, you may be calling your search engine optimization advisor. I responded to a query from an SEO expert who needed information about enterprise search. His clients, as I understood the question, were seeking guidance from a person with expertise in spoofing the indexing and relevance algorithms used by public Web search vendors. (The discussion appeared in the Search-Based Applications (SBA) and Enterprise Search group on LinkedIn. Note that you may need to be a member of LinkedIn to view the archived discussion.)

The whole notion of turning search into marketing has interested me for a number of year. Our modern technology environment creates a need for faux information. The idea, as Jacques Ellul pointed out in Propaganda, is that modern man needs something to fill a void.

How can search deliver easy, comfortable, and good enough results? Easy. Don’t let the user formulate a query. A happy quack to Resistance Quotes.

It, therefore, makes perfect sense that a customer who is buying relevance in a page of free Web results would expect an SEO expert to provide similar functionality for enterprise search. Not surprisingly, the notion of controlling search results based on an externality like key word stuffing or content flooding is a logical way to approach enterprise search.

Precision, recall, hard metrics about indexing time, and the other impedimenta of the traditional information retrieval expert are secondary to results. Like the metrics about Web traffic, a number is better than no number. If the number’s flaws are not understood, the number is better than nothing. In fact, the entire approach to search as marketing is based on results which are good enough. One can see the consequences of this thinking when one runs a query on Bing or on systems which permit users’ comments to influence relevancy. Vivisimo activated this type of value adding years ago and it still is a good example of trying to make search useful. A result which delivers a laundry list of results which forces the user to work through the document list and determine what is useful is gone. If a document has internal votes of excellence, that document is the “right” one. Instead of precision and recall, modern systems are delivering “good enough” results. The user sees one top hit and makes the assumption that the system has made decisions more informed.

There are some downsides to the good enough approach to search which deliver a concrete result which, like Web traffic statistics, looks so solid, so meaningful. That downside is that the user consumes information which may not be accurate, germane, or timely. In the quest for better search, good enough trumps the mentally exhausting methods of the traditional precision and recall crowd.

To get a better feel for the implications of this “good enough” line of thinking, you may find the September 2012 “deliverable” from Promise whose acronym should be spelled PPromise in my opinion, “Tutorial on Evaluation in the Wild.” The abstract for the document does not emphasize the “good enough” angle, stating:

The methodology estimates the user perception based on a wide range of criteria that cover four  categories,  namely  indexing,  document  matching,  the  quality  of  the  search  results  and  the user interface of the system. The criteria are established best practices in the information retrieval  domain  as  well  as  advancements  for  user  search  experience.  For  each  criterion  a test  script  has  been  defined  that  contains  step-by-step  instructions,  a  scoring  schema  and adaptations for the three PROMISE use case domains.

The idea is that by running what strike me as subjective data collection from users of systems, an organization can gain insight into the search system’s “performance” and “all aspects of his or her behavior.” (The “all” is a bit problematic to me.)

Read more

Taming Unstructured Information

March 25, 2013

Right now, as you read this, your company’s data are piling up. Scarier yet, most don’t have a way to structure all that precious information, so it goes to waste. Thankfully, clarity is on the way as we found in a recent Paradigma Labs story, “Unstructured Information Extraction: A Sample Case with a Unitex-Manager.”

The article lays out the problem:

There is a lot of information in today’s companies flowing from one computer to another like e-mails, documents, many kinds of files and, of course, the webs the employees surf through. These electronic documents probably contain part of the core knowledge of the company or, at least, very useful information which besides of being easily readable by humans is unstructured and impossible to be processes automatically using computers. The amount of unstructured information in enterprises is around 80% [1] to 85% [2] nowadays, and such a situation is a disadvantage…

This has been an elephant in the room for many preparing to start squeezing help from their data. Unstructured data can derail good intentions by making it impossible to sort out. Thankfully, there are companies with experience in structuring the unstructured and then forming useful analytic insights from this info. One of our favorites is the international firm, Sinequa who boast an incredible two-plus decades in the business.

Patrick Roland, March 25, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search.

Government Initiatives and Search: A Make-Work Project or Innovation Driver?

March 25, 2013

I don’t want to pick on government funding of research into search and retrieval. My goodness, pointing out that payoffs from government funded research into information retrieval would bring down the wrath of the Greek gods. Canada, the European Community, the US government, Japan, and dozens of other nation states have poured funds into search.

In the US, a look at the projects underway at the Center for Intelligent Information Retrieval reveals a wide range of investigations. Three of the projects have National Science Foundation support: Connecting the ephemeral and archival information networks, Transforming long queries, and Mining a million scanned books. These are interesting topics and the activity is paralleled in other agencies and in other countries.

Is fundamental research into search high level busy work. Researchers are busy but the results are not having a significant impact on most users who struggle with modern systems usability, relevance, and accuracy.

In 2007 I read “Meeting of the MINDS: An Information Retrieval Research Agenda.” The report was sponsored by various US government agencies. The points made in the report were, like the University of Massachusetts’ current research run down, were excellent. The 2007 recent influences are timely six years later. The questions about commercial search engines, if anything, are unanswered. The challenges of heterogeneous data also remain. Information analysis and organization which is today associated with analytics and visualization-centric systems could be reprinted with virtually no changes. I cite one example, now 72 months young, for your consideration:

We believe the next generation of IR systems will have to provide specific tools for information transformation and user-information manipulation. Tools for information transformation in real time in response to a query will include, for example, (a) clustering of documents or document passages to identify both an information group and also the document or set of passages that is representative of the group; (b) linking retrieved items in timelines that reflect the precedence or pseudo-causal relations among related items; (c) highlighting the implicit social networks among the entities (individuals) in retrieved material;
and (d) summarizing and arranging the responses in useful rhetorical presentations, such as giving the gist of the “for” vs. the “against” arguments in a set of responses on the question of whether surgery is recommended for very early-stage breast cancer. Tools for information manipulation will include, for example, interfaces that help a person visualize and explore the information that is thematically related to the query. In general, the system will have to support the user both actively, as when the user designates a specific information transformation (e.g., an arrangement of data along a timeline), and also passively, as when the system recognizes that the user is engaged in a particular task (e.g., writing a report on a competing business). The selection of information to retrieve, the organization of results, and how the results are displayed to the user all are part of the new model of relevance.

In Europe, there are similar programs. Examples range from Europa’s sprawling ambitions to Future Internet activities. There is Promise. There are data forums, health competence initiatives, and “impact”. See, for example, Impact. I documented Japan’s activities in the 1990s in my monograph Investing in an Information Infrastructure, which is now out of print. A quick look at Japan’s economic situation and its role in search and retrieval reveals that modest progress has been made.

Stepping back, the larger question is, “What has been the direct benefit of these government initiatives in search and retrieval?”

On one hand, a number of projects and companies have been kept afloat due to the funds injected into them. In-Q-Tel has supported dozens of commercial enterprises, and most of them remain somewhat narrowly focused solution providers. Their work has been suggestive, but none has achieved the breathtaking heights of Facebook or Twitter. (Search is a tiny part of these two firms, of course, but the government funding has not had a comparable winner in my opinion.) The benefit has been employment, publications like the one cited above, and opportunities for researchers to work in a community.,

On the other hand, the fungible benefits have been modest. As the economic situation in the US, Europe, and Japan has worsened, search has not kept pace. The success story is Google, which has used search to sell advertising. I suppose that’s an innovation, but it is not one which is a result of government funding. The Autonomy, Endeca, Fast Search-type of payoff has been surprising. Money has been made by individuals, but the technology has created a number of waves. The Hewlett Packard Autonomy dust up is an example. Endeca is a unit of Oracle and is becoming more of a utility than a technology game changer. Fast Search has largely contracted and has, like Endeca, become a component.

Some observations are warranted.

First, search and retrieval is a subject of intense interest. However, the progress in information retrieval is advancing just slowly in my opinion. I think there are fundamental issues which researchers have not been able to resolve. If anything, search is more complicated today than it was when the Minds Agenda cited above was published. The question is, “Maybe search is more difficult than finding the Higgs Boson?” If so, more funding for search and retrieval investigations is needed. The problem is that the US, Europe, and Japan are operating at a deficit. Priorities must come into play.

Second, the narrow focus of research, while useful, may generate insights which affect the margins of larger information retrieval questions. For example, modern systems can be spoofed. Modern systems generate strong user antipathy more than half the time because they are too hard to use or don’t answer the user’s question. The problem is that the systems output information which is quite likely incorrect or not useful. Search may contribute to poor decisions, not improve decisions. The notion that one is better off using more traditional methods of research is something not discussed by some of the professionals engaged in inventing, studying, or selling search technology.

Third, search has fragmented into a mind boggling number of disciplines and sub-disciplines. Examples range from Coveo (a company which has ingested millions in venture funding and support from the province of Québec) which is sometimes a customer support system and sometimes a search system to Palantir (a recipient of venture funding and US government funding) which outputs charts and graphs, relegating search to a utility function.

Net net: I am not advocating the position that search is unimportant. Information retrieval is very important. One cannot perform some work today unless one can locate a specific digital item in many cases.

The point is that money is being spent, energies invested, and initiatives launched without accountability. When programs go off the rails, these programs need to be redirected or, in some cases, terminated.

What’s going on is that information about search produced in 2007 is as fresh today as it was 72 months ago. That’s not a sign of progress. That’s a sign that very little progress is evident. The government initiatives have benefits in terms of making jobs and funding some start ups. I am not sure that the benefits affect a broader base of people.

With deficit financing the new normal, I think accountability is needed. Do we need some conferences? Do we need giveaways like pens and bags? Do we need academic research projects running without oversight? Do we need to fund initiatives which generate Hollywood type outputs? Do we need more search systems which cannot detect semantically shaped or incorrect outputs?

Time for change is upon us.

Stephen E Arnold, March 25, 2013

It Is Movie Search Time

March 25, 2013

Google, Bing, and DuckDuckGo are the preliminary search engines users turn to for locating information. One of the problems, even with advanced search options, is sifting through the search results. Any search expert will tell you if the desired information is not in the first or second page of results, users move on. Does this call for a specialization in search engines? It just might for a subject as all encompassing as movies. MoreFlicks searches through the popular video streaming Web sites:Hulu, Netflix, Vudu, Fox, Crackel, and BBC iPlayer for movies and TV Shows.

It takes a page out of Google’s book by displaying basic facts about a movie or show: summary, genre, release date along with where it can be viewed online. Search results can be sorted by genre, most popular, new arrivals, and what is soon expiring. It will come in hand when you are searching for an obscure title. Downsides are that it only browses through legal channels. YouTube has been given the boot for these results. MoreFlicks is a niche search engine, possibly the lovechild of Google and IMDB, but how long it stays depends on content relevance or until Google snaps it up. Zeus eating Athena anyone?

Whitney Grace, March 25, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

The Search and Retrieval News Feed

March 24, 2013

Dear, dear Twitter was not able to reclaim my newsfeed. The lads and lasses did try. I think that in grade school that would earn a high mark. In the real world, well, no comment. But everyone was really, really nice. We have many, many assurances that action was underway. We really, really think this makes clear the risks of using a free service. Super experience. I assume a couple thousand people know more about pop stars than they ever thought possible.

If you want to sign up for the feed of headlines for ArnoldIT’s Beyond Search Blog, the new handle is BeyondSearchNow. The RSS link has been updated. Wow, I love Twitter. Wow, I never knew how fascinating a pop star’s secret life could be.

Anyway, the dull, old Beyond Search news stream is at BeyondSearchNow. Here you go: http://twitter.com/beyondsearchnow

Stephen E Arnold, March 24, 2013

Search Technology at Funnelback Morphs Into Compliance Auditor

March 24, 2013

According to the new post on Funnelback’s Web page, the search engine technology and services system is morphing into a compliance auditor. The article, “Funnelback WCAG Compliance Auditor Version 2 Is Out Now,” shares that the newest release of the auditor technology includes the ability to configure checking runs, on-demand checking of HTML, and comprehensive reporting abilities.

The article comments on the additional features:

“Matthew Sheppard, Manager of Research and Development at Funnelback said, ‘Customers have been making amazing progress on their website accessibility with version 1.5, but we wanted version 2 of the tool to make the process easier than ever. WCAG Compliance Auditor version 2 is flexible, highly configurable and accommodates the last minute changes that web content editors often face whilst maintaining their website’s content accessibility at the same time.’”

I think the new capabilities are intriguing, but there is no information available on the Web page on which countries the auditor works for. However, the global company has a demo available at http://www.funnelback.com/our-products/demo. Definitely worth a look.

One question: “How can a single consulting firm know so much about so many search and retrieval systems?” Hyperbole, exceptional intelligence, or marketing?

Andrea Hayden, March 24, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

Are There Lessons for Enterprise Search in the Pew Publishing Study 2013?

March 19, 2013

If you have not looked at the Pew report, you will want to check out the basic information in “The State of the News Media 2013.” The principal surprise in the report is that the situation seems to be less positive than I assumed.

Here’s the snippet which I tucked in my notebook:

Estimates for newspaper newsroom cutbacks in 2012 put the industry down 30% since its peak in 2000 and below 40,000 full-time professional employees for the first time since 1978. In local TV, our special content report reveals, sports, weather and traffic now account on average for 40% of the content produced on the newscasts studied while story lengths shrink. On CNN, the cable channel that has branded itself around deep reporting, produced story packages were cut nearly in half from 2007 to 2012. Across the three cable channels, coverage of live events during the day, which often require a crew and correspondent, fell 30% from 2007 to 2012 while interview segments, which tend to take fewer resources and can be scheduled in advance, were up 31%. Time magazine, the only major print news weekly left standing, cut roughly 5% of its staff in early 2013 as a part of broader company layoffs.  And in African-American news media, the Chicago Defender has winnowed its editorial staff to just four while The Afro cut back the number of pages in its papers from 28-32 in 2008 to 16-20 in 2012. A growing list of media outlets, such as Forbes magazine, use technology by a company called Narrative Science to produce content by way of algorithm, no human reporting necessary. And some of the newer nonprofit entrants into the industry, such as the Chicago News Cooperative, have, after launching with much fanfare, shut their doors.

Professional publishing companies like Ebsco, Elsevier, ProQuest, Thomson Reuters, and Wolters Kluwer are going to affected too. If the content streams on which these companies “go away,” the firms will have to demonstrate that they too can act in an agile manner. Since the database centric crowd has crowed about its technical acumen for years, I think the agility trick might be a tough one to pull off.

But what about specialist software vendors of search, content processing, and indexing? Are there lessons in the Pew report which provide some hints about the search of these information centric businesses?

My view is that there are three signals in the Pew data which seem to be germane to search and related service vendors.

First, the drop off which the Pew report documents has been quicker than I and probably some of the senior publishing executives expected. These folks were cruising along with belt tightening and minor adjustments. Now the collision between revenue and expenses are coming together quickly. How will these companies react as the time for figuring out a course correction slips away? My view is that there will be some wild and crazy decisions coming down the runway and soon. Search and content processing sector vendors are facing a similar situation. A run though my Overflight service reveals quite a few vendors who have gone quiet or simply turned out the lights.

Second, the lack of information is not unique to publishing. Organizations have quite a lot of data. The problem is that making use of the data in a way that enhances revenue seems to be difficult. There are quite a few companies pitching fancy analytics, but the vendors are facing long buying cycles and price pressure. Sure there are billions of bits but there is neither the money, expertise, or time to cope with the winnowing and selecting work. In short, there are some big hopes but little evidence that the marketing hyperbole translates into revenue and profits.

Third, traditional publishing is on the outside looking in when it comes to new business models. Google and a handful of other companies seem to be in a commanding position for online advertising. Enterprise search and content processing vendors have not been able to find a business model beyond license fees and consulting. Just like the traditional publishing sector, the statement “We can’t do that” seems to be a self fulfilling prophecy. In search, I think there will be some business model innovation and it will take place at the expense of the vendors who are sticking to the “tried and true” approach to revenue generation.

My take is that the decline of traditional publishing may be a glimpse of the future for search and content processing vendors.

Stephen E Arnold, March 20, 2013

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta