May 9, 2015
My team and I are working on a new project. With our Overflight system, we have an archive of memorable and not so memorable factoids about search and content processing. One of the goslings who was actually working yesterday asked me, “Do you recall this presentation?”
The presentation was “Implementing Semantic Search in the Enterprise,” created in 2009, which works out to six years ago. I did not recall the presentation. But the title evoked an image in my mind like this:
I asked, “How is this germane to our present project?’
The reply the gosling quacked was, “Semantic search means taxonomy.” The gosling enjoined me to examine this impressive looking diagram:
I don’t want a document. I don’t want formatted content. I don’t want unformatted content. I want on point results I can use. To illustrate the gap between dumping a document on my lap and presenting some useful, look at this visualization from Geofeedia:
The idea is that a person can draw a shape on a map, see the real time content flowing via mobile devices, and look at a particular object. There are search tools and other utilities. The user of this Geofeedia technology examines information in a manner that does not produce a document to read. Sure, a user can read a tweet, but the focus is on understanding information, regardless of type, in a particular context in real time. There is a classification system operating in the plumbing of this system, but the key point is the functionality, not the fact that a consulting firm specializing in taxonomies is making a taxonomy the Alpha and the Omega of an information access system.
The deck starts with the premise that semantic search pivots on a taxonomy. The idea is that a “categorization scheme” makes it possible to index a document even though the words in the document may be the words in the taxonomy.
For me, the slide deck’s argument was off kilter. The mixing up of a term list and semantic search is the evidence of a Rube Goldberg approach to a quite important task: Accessing needed information in a useful, actionable way. Frankly, I think that dumping buzzwords into slide decks creates more confusion when focus and accuracy are essential.
At lunch the goslings and I flipped through the PowerPoint deck which is available via LinkedIn Slideshare. You may have to register to view the PowerPoint deck. I am never clear about what is viewable, what’s downloadable, and what’s on Slideshare. LinkedIn has its real estate, publishing, and personnel businesses to which to attend, so search and retrieval is obviously not a priority. The entire experience was superficially amusing but on a more profound level quite disturbing. No wonder enterprise search implementations careen in a swamp of cost overruns and angry users.
Now creating taxonomies or what I call controlled term lists can a darned exciting process. If one goes the human route, there are discussions about what term maps to what word or phrase. Think buzz group and discussion group and online collaboration. What terms go with what other terms. In the good old days, these term lists were crafted by subject matter and indexing specialists. For example, the guts of the ABI/INFORM classification coding terms originated in the 1981-1982 period and was the product of more than 14 individuals, one advisor (the now deceased Betty Eddison), and the begrudging assistance of the Courier Journal’s information technology department which performed analyses of the index terms and key words in the ABI/INFORM database. The classification system was reasonably, and it was licensed by the Royal Bank of Canada, IBM, and some other savvy outfits for their own indexing projects.
As you might know, investing two years in human and some machine inputs was an expensive proposition. It was the initial step in the reindexing of the ABI/INFORM database, which at the time was one of the go to sources of high value business and management information culled from more than 800 publications worldwide.
The only problem I have with the slide deck’s making a taxonomy a key concept is that one cannot craft a taxonomy without knowing what one is indexing. For example, you have a flow of content through and into an organization. In a business engaged in the manufacture of laboratory equipment, there will be a wide range of information. There will be unstructured information like Word documents prepared by wild eyed marketing associates. There will be legal documents artfully copied and pasted together from boiler plate. There will be images of the products themselves. There will be databases containing the names of customers, prospects, suppliers, and consultants. There will be information that employees download from the Internet or tote into the organization on a storage device.
The key concept of a taxonomy has to be anchored in reality, not an external term list like those which used to be provided by Oracle for certain vertical markets. In short, the time and cost of processing these items of information so that confidentiality is not breached is likely to make the organization’s accountant sit up and take notice.
Today many vendors assert that their systems can intelligently, automatically, and rapidly develop a taxonomy for an organization. I suggest you read the fine print. Even the whizziest taxonomy generator is going to require some baby sitting. To get a sense of what is required, track down an experienced licensee of the Autonomy IDOL system. There is a training period which requires a cohesive corpus of representative source material. Sorry, no images or videos accepted but the existing image and video metadata can be processed. Once the system is trained, then it is run against a test set of content. The results are examined by a human who knows what he or she is doing, and then the system is tuned. After the smart system runs for a few days, the human inspects and calibrates. The idea is that as content flows through the system and periodic tweaks are made, the system becomes smarter. In reality, indexing drift creeps in. In effect, the smart software never strays too far from the human subject matter experts riding herd on algorithms.
The problem exists even when there is a relatively stable core of technical terminology. The content of a lab gear manufacturer is many times greater than the problem of a company focusing on a specific branch of engineering, science, technology, or medicine. Indexing Halliburton nuclear energy information is trivial when compared to indexing more generalized business content like that found in ABI/INFORM or the typical services organization today.
I agree that a controlled term list is important. One cannot easily resolve entities unless there is a combination of automated processes and look up lists. An example is figuring out if a reference to I.B.M., Big Blue, or Armonk is a reference to the much loved marketers of Watson. Now handle a transliterated name like Anwar al-Awlaki and its variants. This type of indexing is quite important. Get it wrong and one cannot find information germane to a query. When one is investigating aliases used by bad actors, an error can become a bad day for some folks.
The remainder of the slide deck rides the taxonomy pony into the sunset. When one looks at the information created 72 months ago, it is easy for me to understand why enterprise search and content processing has become a “oh, my goodness” problem in many organizations. I think that a mid sized company would grind to a halt if it needed a controlled vocabulary which matched today’s content flows.
My take away from the slide deck is easy to summarize: The lesson is that putting the cart before the horse won’t get enterprise where it must go to retain credibility and deliver utility.
Stephen E Arnold, May 9, 2015
November 30, 2014
I seem to run into references to the write up by a “expert”. I know the person is an expert because the author says:
As an Enterprise Search expert, I get a lot of questions about Search and Information Architecture (IA).
The source of this remarkable personal characterization is “Prevent Enterprise Search from going to the Weeds.” Spoiler alert: I am on record as documenting that enterprise search is at a dead end, unpainted, unloved, and stuck on the margins of big time enterprise information applications. For details, read the free vendor profiles at www.xenky.com/vendor-profiles or, if you can find them, read one of my books such as The New Landscape of Search.
Okay. Let’s assume the person writing the Weeds’ article is an “expert”. The write up is about misconcepts [sic]; specifically, crazy ideas about what a 50 year plus old technology can do. The solution to misconceptions is “information architecture.” Now I am not sure what “search” means. But I have no solid hooks on which to hang the notion of “information architecture” in this era of cloud based services. Well, the explanation of information architecture is presented via a metaphor:
The key is to understand: IA and search are business processes, rather than one-time IT projects. They’re like gardening: It’s up to you if you want a nice and tidy garden — or an overgrown jungle.
Gentle reader, the fact that enterprise search has been confused with search engine optimization is one thing. The fact that there are a number of companies happily leapfrogging the purveyors of utilities to make SharePoint better or improve automatic indexing is another.
Let’s look at each of the “misconceptions” and ask, “Is search going to the weeds or is search itself weeds?”
The starting line for the write up is that no one needs to worry about information architecture because search “will do everything for us.” How are thoughts about plumbing and a utility function equivalent. The issue is not whether a system runs on premises, from the cloud, or in some hybrid set up. The question is, “What has to be provided to allow a person to do his or her job?” In most cases, delivering something that addresses the employee’s need is overlooked. The reason is that the problem is one that requires the attention of individuals who know budgets, know goals, and know technology options. The confluence of these three characteristics is quite rare in my experience. Many of the “experts” working enterprise search are either frustrated and somewhat insecure academics or individuals who bounced into a niche where the barriers to entry are a millimeter or two high.
Next there is a perception, asserts the “expert”, that search and information architecture are one time jobs. If one wants to win the confidence of a potential customer, explaining that the bills will just keep on coming is a tactic I have not used. I suppose it works, but the incredible turnover in organizations makes it easy for an unscrupulous person to just keep on billing. The high levels of dissatisfaction result from a number of problems. Pumping money into a failure is what prompted one French engineering company to buy a search system and sideline the incumbent. Endless meetings about how to set up enterprise systems are ones to which search “experts” are not invited. The information technology professionals have learned that search is not exactly a career building discipline. Furthermore, search “experts” are left out of meetings because information technology professionals have learned that a search system will consume every available resource and produce a steady flow of calls to the help desk. Figuring out what to build still occupies Google and Amazon. Few organizations are able to do much more that embrace the status quo and wait until a mid tier consultant, a cost consultant, or a competitor provides the stimulus to move. Search “experts” are, in my experience, on the outside of serious engineering work at many information access challenged organizations. That’s a good thing in my view.
The middle example is what the expert calls “one size fits all.” Yep, that was the pitch of some of the early search vendors. These folks packaged keyword search and promised that it would slice, dice, and chop. The reality of information, even for the next generation information access companies with which I work, focus on making customization as painless as possible. In fact, these outfits provide some ready-to-roll components, but where the rubber meets the road is providing information tailored to each team or individual user. At Target last night, my wife and I bought Christmas gifts for needy people. One of the gifts was a 3X sweater. We had a heck of a time figuring out if the store offered such a product. Customization is necessary for more and more every day situations. In organizations, customization is the name of the game. The companies pitching enterprise search today lag behind next generation information access providers in this very important functionality. The reason is that the companies lack the resources and insight needed to deliver. But what about information architecture? How does one cloud based search service differ from another? Can you explain the technical and cost and performance differences between SearchBlox and Datastax?
The penultimate point is just plain humorous: Search is easy. I agree that search is a difficult task. The point is that no one cares how hard it is. What users want are systems that facilitate their decision making or work. In this blog I reproduced a diagram showing one firm’s vision for indexing. Suffice it to say that few organizations know why that complexity is important. The vendor has to deliver a solution that fits the technical profile, the budget, and the needs of an organization. Here is the diagram. Draw your own conclusion:
The final point is poignant. Search, the “expert” says, can be a security leak. No, people are the security link. There are systems that process open source intelligence and take predictive, automatic action to secure networks. If an individual wants to leak information, even today’s most robust predictive systems struggle to prevent that action. The most advanced systems from Centripetal Networks and Zerofox offer robust systems, but a determined individual can allow information to escape. What is wrong with search has to do with the way in which provided security components are implemented. Again we are back to people. Information architecture can play a role, but it is unlikely that an organization will treat search differently from legal information or employee pay data. There are classes of information to which individuals have access. The notion that a search system provides access to “all information” is laughable.
I want to step back from this “expert’s” analysis. Search has a long history. If we go back and look at what Fulcrum Technologies or Verity set out to do, the journeys of the two companies are quite instructive. Both moved quickly to wrap keyword search with a wide range of other functions. The reason for this was that customers needed more than search. Fulcrum is now part of OpenText, and you can buy nubbins of Fulcrum’s 30 year old technology today, but it is wrapped in huge wads of wool that comprise OpenText’s products and services. Verity offered some nifty security features and what happened? The company chewed through CEOs, became hugely bloated, struggled for revenues, and end up as part of Autonomy. And what about Autonomy? HP is trying to answer that question.
Net net: This weeds write up seems to have a life of its own. For me, search is just weeds, clogging the garden of 21st century information access. The challenges are beyond search. Experts who conflate odd bits of jargon are the folks who contribute to confusion about why Lucene is just good enough so those in an organization concerned with results can focus on next generation information access providers.
Stephen E Arnold, November 30, 2014
March 25, 2013
I don’t want to pick on government funding of research into search and retrieval. My goodness, pointing out that payoffs from government funded research into information retrieval would bring down the wrath of the Greek gods. Canada, the European Community, the US government, Japan, and dozens of other nation states have poured funds into search.
In the US, a look at the projects underway at the Center for Intelligent Information Retrieval reveals a wide range of investigations. Three of the projects have National Science Foundation support: Connecting the ephemeral and archival information networks, Transforming long queries, and Mining a million scanned books. These are interesting topics and the activity is paralleled in other agencies and in other countries.
Is fundamental research into search high level busy work. Researchers are busy but the results are not having a significant impact on most users who struggle with modern systems usability, relevance, and accuracy.
In 2007 I read “Meeting of the MINDS: An Information Retrieval Research Agenda.” The report was sponsored by various US government agencies. The points made in the report were, like the University of Massachusetts’ current research run down, were excellent. The 2007 recent influences are timely six years later. The questions about commercial search engines, if anything, are unanswered. The challenges of heterogeneous data also remain. Information analysis and organization which is today associated with analytics and visualization-centric systems could be reprinted with virtually no changes. I cite one example, now 72 months young, for your consideration:
We believe the next generation of IR systems will have to provide specific tools for information transformation and user-information manipulation. Tools for information transformation in real time in response to a query will include, for example, (a) clustering of documents or document passages to identify both an information group and also the document or set of passages that is representative of the group; (b) linking retrieved items in timelines that reflect the precedence or pseudo-causal relations among related items; (c) highlighting the implicit social networks among the entities (individuals) in retrieved material;
and (d) summarizing and arranging the responses in useful rhetorical presentations, such as giving the gist of the “for” vs. the “against” arguments in a set of responses on the question of whether surgery is recommended for very early-stage breast cancer. Tools for information manipulation will include, for example, interfaces that help a person visualize and explore the information that is thematically related to the query. In general, the system will have to support the user both actively, as when the user designates a specific information transformation (e.g., an arrangement of data along a timeline), and also passively, as when the system recognizes that the user is engaged in a particular task (e.g., writing a report on a competing business). The selection of information to retrieve, the organization of results, and how the results are displayed to the user all are part of the new model of relevance.
In Europe, there are similar programs. Examples range from Europa’s sprawling ambitions to Future Internet activities. There is Promise. There are data forums, health competence initiatives, and “impact”. See, for example, Impact. I documented Japan’s activities in the 1990s in my monograph Investing in an Information Infrastructure, which is now out of print. A quick look at Japan’s economic situation and its role in search and retrieval reveals that modest progress has been made.
Stepping back, the larger question is, “What has been the direct benefit of these government initiatives in search and retrieval?”
On one hand, a number of projects and companies have been kept afloat due to the funds injected into them. In-Q-Tel has supported dozens of commercial enterprises, and most of them remain somewhat narrowly focused solution providers. Their work has been suggestive, but none has achieved the breathtaking heights of Facebook or Twitter. (Search is a tiny part of these two firms, of course, but the government funding has not had a comparable winner in my opinion.) The benefit has been employment, publications like the one cited above, and opportunities for researchers to work in a community.,
On the other hand, the fungible benefits have been modest. As the economic situation in the US, Europe, and Japan has worsened, search has not kept pace. The success story is Google, which has used search to sell advertising. I suppose that’s an innovation, but it is not one which is a result of government funding. The Autonomy, Endeca, Fast Search-type of payoff has been surprising. Money has been made by individuals, but the technology has created a number of waves. The Hewlett Packard Autonomy dust up is an example. Endeca is a unit of Oracle and is becoming more of a utility than a technology game changer. Fast Search has largely contracted and has, like Endeca, become a component.
Some observations are warranted.
First, search and retrieval is a subject of intense interest. However, the progress in information retrieval is advancing just slowly in my opinion. I think there are fundamental issues which researchers have not been able to resolve. If anything, search is more complicated today than it was when the Minds Agenda cited above was published. The question is, “Maybe search is more difficult than finding the Higgs Boson?” If so, more funding for search and retrieval investigations is needed. The problem is that the US, Europe, and Japan are operating at a deficit. Priorities must come into play.
Second, the narrow focus of research, while useful, may generate insights which affect the margins of larger information retrieval questions. For example, modern systems can be spoofed. Modern systems generate strong user antipathy more than half the time because they are too hard to use or don’t answer the user’s question. The problem is that the systems output information which is quite likely incorrect or not useful. Search may contribute to poor decisions, not improve decisions. The notion that one is better off using more traditional methods of research is something not discussed by some of the professionals engaged in inventing, studying, or selling search technology.
Third, search has fragmented into a mind boggling number of disciplines and sub-disciplines. Examples range from Coveo (a company which has ingested millions in venture funding and support from the province of Québec) which is sometimes a customer support system and sometimes a search system to Palantir (a recipient of venture funding and US government funding) which outputs charts and graphs, relegating search to a utility function.
Net net: I am not advocating the position that search is unimportant. Information retrieval is very important. One cannot perform some work today unless one can locate a specific digital item in many cases.
The point is that money is being spent, energies invested, and initiatives launched without accountability. When programs go off the rails, these programs need to be redirected or, in some cases, terminated.
What’s going on is that information about search produced in 2007 is as fresh today as it was 72 months ago. That’s not a sign of progress. That’s a sign that very little progress is evident. The government initiatives have benefits in terms of making jobs and funding some start ups. I am not sure that the benefits affect a broader base of people.
With deficit financing the new normal, I think accountability is needed. Do we need some conferences? Do we need giveaways like pens and bags? Do we need academic research projects running without oversight? Do we need to fund initiatives which generate Hollywood type outputs? Do we need more search systems which cannot detect semantically shaped or incorrect outputs?
Time for change is upon us.
Stephen E Arnold, March 25, 2013
January 27, 2013
I don’t know zip about public relations. First, I don’t do much “public” work. The connotation of “relations” remains mildly distasteful to me. I suppose that is a consequence of a high school English teacher who did not permit certain words to be used in class. If a student were to encounter a word on the banned list, he or she had to skip it when reading aloud. The notion of “public relations” gives me the willies.
You can check out the best in PR and real journalism in the scary “Microsoft: Google Blames Us for All Its Problems.” I thought I was jaded with corporate slickness. One is never too old to learn how the big guys handle communications.
I had a client ask me about a company which could post messages to LinkedIn and other social media. I motioned that the work was getting difficult. For example, Instagram wants a person who posts a picture to register with a government issued ID card. Now that is interesting because I use a passport for identification, and I am not too keen on having that information in the hands of a 20 something engineer working from a drafty apartment in a country to which the data processing has been outsourced. Also, LinkedIn has a number of groups which are managed by those who start the groups. LinkedIn wants anyone who found the group interesting to participate or the “member” is kicked out of the group. Some groups are lax about advertising. Other groups are not. LinkedIn has turned into a job hunting and marketing service, so its utility to me has declined. I find the “expert” commentary sent to me by LinkedIn employees annoying tool. Facebook is a wild and crazy place. I am not sure how the new Facebook search will work when a person posting can be linked to “interesting” topics and “friends.” The Google Plus thing is mandatory with each post linked to a “real” person. Maybe Google will just issue official ID cards and skip the government angle. Google’s mission to North Korea was fascinating, and I hope no one draws a connection between the Google visit and the increasingly hostile rhetoric from that country toward the United States.
So what about public relations.
I did a quick check online and found that a consulting and publishing company called O’Dwyer Company, Inc. publishes a list of the PR firms ranked by revenue. After all, what could be more important than revenue in today’s economic climate. (Do I hear a tiny voice saying, “Quality and integrity”? No, not here in Harrod’s Creek.
The list exists in a couple of different forms. The dates covered by the list are not clear to me. But the PR league table I reviewed contained 118 firms. Of these 118, the total revenue reported by O’Dwyer was $1,776,859,523, slightly more than the revenues for the enterprise search market which I wrote about here. The top 10 firms generated $1,120,706,215 or 63 percent of the total revenue in the O’Dwyer report. What’s interesting is that this concentration of money is similar to the concentration of revenues in enterprise search prior to the consolidation craze which peaked in 2012. Once a search vendor is absorbed into a giant new owner like Microsoft or Oracle, the revenues from search related deals disappears into the accounting miasma. Become too open about enterprise search revenues and an Autonomy type of situation may unfold.
What I found interesting was that of the top ten firms, two were flat with no significant increase in revenue and one new entrant was able to pump out $21 million quickly. Whoa, Nelly.
Another point I found interesting is that I recognized the “name” of these firms of the 118:
- Edelman, not sure why
- Waggener Ekstrom, the Microsoft PR outfit
- Ruder Finn, not sure why.
- PR seems to be a low profile business. I am confident that the big dogs know how to market, but I am quite certain that most of the firms do not build a “brand” nor do they play a role in my world as “thought leaders.” I presume the reason is that the PR firms are so focused on their clients that any visibility for the PR firm would be a big no no.
- The revenues for PR are almost identical to those reported for enterprise search by Forrester. Does this mean that PR is a better business from a revenue point of view that search or content processing. Presumably the search vendors hire PR firms so the cash available for search marketing helps pump up the PR revenues. Interesting, particularly at a time when it is difficult to track sales to PR. (After all, if PR worked, wouldn’t the firms showing flat and declining revenue use their own tools to get those sales going?)
- PR, like enterprise search, generates one of those nifty long tale graphs which are so popular in today’s learned discussions about “concentration,” “oligopolies,” and “market forces.”
I told the client to take the O’Dwyer list and pick a firm close to home. The challenge is that the biggest firms are in the big cities; for example, Manhattan boasts 31 firms on the list, more if I include New Jersey and Connecticut. A quick check of Louisville, Kentucky’s PR density revealed 18 firms. More were listed if I tossed in marketing communications, social media, and similarly nebulous terms. PR advisors are as plentiful as consultants it seems. The swelling ranks of the unemployed creates a fertile ground for advisors, wizards, mavens, and poobahs in search, business consulting, and public relations.
My big finding is that the vast majority of public relations firms are likely to be struggling to generate revenue. What’s new in today’s economy? Is PR a discipline? Don’t know. Don’t care. I do know I tell those who write me PR spam that I am not a journalist. I get pretty frisky when people ignore my about page and assume I am, at age 69, a real journalist. Heaven forbid that I should be confused with a real journalist, a PR person, or an effective marketer. I am none of those things. Never will be.
Stephen E Arnold, January 26, 2013
June 19, 2012
Let’s start off with a recommendation. Snag a copy of the Wall Street Journal and read the hard copy front page story in the Marketplace section, “Computers Carry Water of Pretrial Legal Work.” In theory, you can read the story online if you don’t have Sections A-1, A-10 of the June 18, 2012, newspaper. Check out a variant of the story appears as “Why Hire a Lawyer? Computers Are Cheaper.”
Now let me offer a possibly shocking observation: The costs of litigation are not going down for certain legal matters. Neither bargain basement human attorneys nor Fancy Dan content processing systems make the legal bills smaller. Your mileage may vary, but for those snared in some legal traffic jams, costs are tough to control. In fact, search and content processing can impact costs, just not in the way some of the licensees of next generation systems expect. That is one of the mysteries of online that few can penetrate.
The main idea of the Wall Street Journal story is that “predictive coding” can do work that human lawyers do for a higher cost but sometimes with much less precision. That’s the hint about costs in my opinion. But the article is traditional journalistic gold. Coming from the Murdoch organization, what did I expect? i2 Group has been chugging along with relationship maps for case analyses of important matters since 1990. Big alert: i2 Ltd. was a client of mine. Let’s see that was more than a couple of weeks ago that basic discovery functions were available.
The write up quotes published analyses which indicate that when humans review documents, those humans get tired and do a lousy job. The article cites “experts” who from Thomson Reuters, a firm steeped in legal and digital expertise, who point out that predictive coding is going to be an even bigger business. Here’s the passage I underlined: “Greg McPolin, an executive at the legal outsourcing firm Pangea3 which is owned by Thomson Reuters Corp., says about one third of the company’s clients are considering using predictive coding in their matters.” This factoid is likely to spawn a swarm of azure chip consultants who will explain how big the market for predictive coding will be. Good news for the firms engaged in this content processing activity.
What goes faster? The costs of a legal matter or the costs of a legal matter that requires automation and trained attorneys? Why do companies embrace automation plus human attorneys? Risk certainly is a turbo charger?
The article also explains how predictive coding works, offers some cost estimates for various actions related to a document, and adds some cautionary points about predictive coding proving itself in court. In short, we have a touchstone document about this niche in search and content processing.
My thoughts about predictive coding are related to the broader trends in the use of systems and methods to figure out what is in a corpus and what a document is about.
First, the driver for most content processing is related to two quite human needs. First, the costs of coping with large volumes of information is high and going up fast. Second, the need to reduce risk. Most professionals find quips about orange jump suits, sharing a cell with Mr. Madoff, and the iconic “perp walk” downright depressing. When a legal matter surfaces, the need to know what’s in a collection of content like corporate email is high. The need for speed is driven by executive urgency. The cost factor clicks in when the chief financial officer has to figure out the costs of determining what’s in those documents. Predictive coding to the rescue. One firm used the phrase “rocket docket” to communicate speed. Other firms promise optimized statistical routines. The big idea is that automation is fast and cheaper than having lots of attorneys sifting through documents in printed or digital form. The Wall Street Journal is right. Automated content processing is going to be a big business. I just hit the two key drivers. Why dance around what is fueling this sector?
June 14, 2012
Several PR mavens have sent me today multiple unsolicited emails about their clients’ predictive statistical methods. I don’t like spam email. I don’t like PR advisories that promise wild and crazy benefits for predictive analytics applied to big data, indexing content, or figuring out what stocks to buy.
March Communications was pitching Lavastorm and Kabel Deutschland. The subject analytics—real time, predictive, and discovery driven.
Predictive analytics can be helpful in many business and technical processes. Examples range from figuring out where to sell an off lease mint green Ford Mustang convertible to planning when to ramp up outputs from a power generation station. Where predictive analytics are not yet ready for prime time is identifying which horse will win the Kentucky Derby and determining where the next Hollywood starlet will crash a sports car. Predictive methods can suggest how many cancer cells will die under certain conditions and assumptions, but the methods cannot identify which cancer cells will die.
Can predictive analytics make you a big winner at the race track? If firms with rock sold predictive analytics could predict a horse race, would these firms be selling software or would these firms be betting on horse races?
That’s an important point. Marketers promise magic. Predictive methods deliver results that provide some insight but rarely rock solid outputs. Prediction is fuzzy. Good enough is often the best a method can provide.
In between is where hopes and dreams rise and fall with less clear cut results. I am, of course, referring to the use by marketers of lingo like this:
The idea behind these buzzwords is that numerical recipes can process information or data and assign probabilities to outputs. When one ranks the outputs from highest probability to lowest probability, an analyst or another script can pluck the top five outputs. These outputs are the most likely to occur. The approach works for certain Google-type caching methods, providing feedback to consumer health searchers, and figuring out how much bandwidth is needed for a new office building when it is fully occupied. Picking numbers at the casino? Not so much.
June 8, 2012
You will want to read the Wall Street Journal hard copy edition’s story “Google Monopoly and Internet Freedom.” (You may be able to access the online version at this link, but no promises where News Corp.’s business model is in action.) The print version is important. The article—more accurately, the “essay,” “op-ed,” or “gentrified blog post”—has price of place. Perched at the top of the “Opinion” page A-15, the four-column item comes with a beefy headline and a color picture. The author is Jeffrey Katz, who is “the CEO of Nextag, and a former CEO of Orbitz Inc., Swissair, and LeapFrog Enterprises.”
Is distortion inevitable or is a part of decision making?
I was not familiar with Mr. Katz. A biography appears on the Nextag Web site. He is a Stanford graduate, and he flew from the airline industry to learning products to Nextag. That company loves shopping. The company says:
Expert deal-hunters since 1999, we make it surprisingly easy for you to find everything from tech to travel to tiki torches all at the price, place and moment that’s right for you. Browse, review, share, get the 411, get the deal: with Nextag, you’ll love the way you shop. 30+ million people consult us each month to make their online purchases, and we use our best-in-class search technology and proven expertise to ensure that each and every one of those shoppers is a happy one. This focus and commitment benefits our partners as well, delivering impressive sales volume and ROI for merchants and a streamlined user experience for search providers. (Source: http://www.nextag.com/about/main)
The background helps because I understand that online ticket agencies and online shopping comparison sites need utility services to allow these enterprises to do business without having to build a global infrastructure, attract and cultivate large numbers of users, and have a business model based on advertising.
Point of view is important.
In the News Corp. essay, Mr. Katz points out that Google is powerful. Well, that’s not much of a surprise. The company is more than a decade old, has an enviable business model, and online technology which works. I enjoy comparing Google’s ability to deliver online services when I sit in an airport waiting for United Airlines to cope with the 300 people stranded in London Heathrow on Friday June 1, 2012. Have you had an experience similar to mine with an airline. I also recall fondly turning up at a hotel with my Orbitz reservation in hand to hear, “Sir, we have no record of your reservation.” I also enjoy the many messages which induce me to compare prices at Nextag.com. In 2009 Nextag filled my Yahoo page with Nextag ads. (See this Yahoo Answers response.) Nextag has implemented an “advertising cookie opt out.” You can learn more here. I, therefore, find the suggestions Mr. Katz offers to Google fascinating.
First, Mr. Katz asserts that “Google needs to be transparent about how its search engine operates.” He believes that Google “hides behind forded-tongue gobbledygook that is meant to obfuscate.” I don’t agree. I have written three monographs based on open source information provided by Google to anyone who takes the time to read it. The disconnect is that Google is a deeply technical company, and it does a very good job of explaining its systems and methods. However, if a person is an expert because he or she can use a browser to surf the Web, that type of knowledge is not going to be particularly helpful. For example, one of the systems and methods in use at Google involves populating missing cells in a database. The approach is clearly explained again and again and again. Most recently Dr. Alon Halevy gave yet another repetitive presentation about this methods at the EDBT/ICDT 2012 Joint Conference on March 26 to 30, 2012 in Berlin, Germany. Of the major information retrieval companies with which I am familiar, Google does one of the best jobs making crystal clear exactly what it does, when, and under what circumstances. The problem is that if one lacks the motivation, resources, or sticktoitivity, the Google information is tough to parse. Want to know how Google search works, read U.S. Patent 628599. There it is. English. Clear. Equations. Background. Functions. What exactly does Mr. Katz want Google to do that it is not doing? Believe me, my relative Vladimir Ivanovich Arnold would have had zero trouble figuring out what Google does, and he would have been able to replicate it. The problem is that some folks are less sharp than Googlers and my uncle. If one does not take time to learn from what is publicly available, why should Google invest time and money in what amounts to remedial education?
Second, Mr. Katz opines, “Google should provide consumers with access to the unbiased search results it was once known for—regardless of which company or organization owns the service. It should also allow users to reduce the number of ads shown or incorporate a user’s preferred services in search results.” First, no set of search results from any vendor or any system at any time has delivered unbiased search results. The decision to use a specific relevancy method, what stop words to use, how to implement a default Boolean AND or OR, or any of hundreds of other key decisions introduces variants in search results. Research itself is not unbiased. As soon as sampling is used within any online system, objectivity is sacrificed. Hey, ask two advisors what to do about a personnel issue and you get non-objective results. Google is upfront and clear about the systems and methods used to determine what gets shown under what circumstances. Pick one of Google’s public disclosures—say, for example, US8065311. Google has dozens of open source publications that explains the exact system and method used to perform a specific task. What Mr. Katz wants is for Google to explain something that most Googlers could not figure out in a month of Sundays. Google uses “smart” software. When inputs change, then the selection of a particular method occurs. Not every method gets selected for every input. As a result, the outputs adapt to inputs. With millions of these decisions made in an interdependent system, exactly what does Mr. Katz want Google to explain? My suggestion. Read what Google has written. The cloud of unknowing is not caused by Google. But asking for an explanation of a particular action within a massively parallel intelligent system is what I would describe as “uninformed.”
Third, Mr. Katz wants one of those categorical affirmatives which I find logically uncomfortable. He says, “
Google should grant all companies equal access to advertising opportunities regardless of whether they are considered a competitor. Given its market share and public commitment to providing users with the most relevant, helpful information, Google has an obligation to provide a level playing field.
My hunch is that in Mr. Katz’s own business operations, there are business processes which are of great interest to consumers; for example, when I run a query on Nextag.com, “Why do I see eBay results at the top of a results list with a big logo?” I don’t want eBay results. How does Mr. Katz implement this specific function? Does it apply to “all” result sets? You don’t need me to write down trade secret type of questions because no executive is going to reveal these unless there are quite specific circumstances and safeguards in place. Why should a company which has an obligation to its shareholders do anything other than focus on delivering value to those shareholders as long as those actions are within the letter and spirit of applicable regulations. I don’t own shares in Google, but if I did, I would expect Google to take appropriate steps to grow the company’s revenue and profits. The reason is anchored in how capitalism works. Is Mr. Katz uncomfortable with capitalism when practiced with considerable skill and finesse?
The final point is an interesting one. Mr. Katz offers:
But mostly, Google should take a good, hard look at its philosophy and business model, and ask if this is the company Sergey Brin and Larry Page set out to build when they chose as their motto: “Don’t be evil.”
Ah, the chestnut “Don’t be evil.” In my research, the phrase originated with another Googler and it ended up becoming the shibboleth waved in front of the bulls running after Messrs. Brin and Page. The current business environment is easy to explain: If you can generate revenue by an appropriate business model, do it. One does not need to flip through Shcumpeter’s or Austrian school economists’ writings for an explanation. Good and evil have zero to do with business. I have experienced the pragmatism of changing a flight using Orbitz. I have to pay. I have experienced the thrill of contacting a merchant, ordering a product identified by Nextag, and then receiving a bait-and-switch in a week. I had to live with the trickery because neither the online service nor the delivery company was “responsible.” Hmmm. Why not do some local investigation into business practices, Mr. Katz.
Now what this News Corp. write up is “about” in my opinion is:
- Nextag wants more traffic and preferential listings for its Web pages. I understand the desire to get more from Google’s free service, but why should Google do any more or any less than it is now doing. Google is tweaking its systems, methods, and business models. Are these actions not permitted? “Compete more effectively. Complain less.” might be a starting point.
- I believe the News Corp. wants to advance agendas. I hope that the Wall Street Journal is above the alleged criminal behavior associated with some News Corp. properties. But there is Fox News, and it seems to advance an agenda. When I read Mr. Katz-type opinion pieces, I wonder, “Is the Wall Street Journal looking for clicks or just poking Google in the ribs because it is thriving and the Wall Street Journal is dogpaddling in terms of advertising revenue?” Just a question. Nothing concrete. But there is potential for bias when making decisions about what action to take, what story to feature, what numerical recipe to employ.
- Writing about Google serves the needs of the readers. I think that the Wall Street Journal is adopting some of the methods which have made Mr. Murdoch’s properties successful for many years. Hard business reporting is expensive and Google is important. I would like to see more analysis of Google’s enterprise strategy as articulated by the most recent vice president responsible for what seems to me a most disappointing market initiative. I would like to see less of the Monday morning quarterbacking.
I don’t have any direct involvement with Google. In fact, I spend less and less of ArnoldIT’s research resources chasing down the company’s innovations. The reason warrants an in-depth article in a newspaper like the Wall Street Journal. Why has Google’s ability to innovate internally become such a problem? What are the management methods Google will use to integrate its recent spate of acquisitions into the firm’s existing service line? How will Google’s dataspace and semantic technology contribute to predictive search outputs; that is, search without search? I at 68, and I think I will go gently into that good night without reading substantive business analyses about an important company in a Murdoch publication. I will have ample opportunities to read baloney about Google. That’s too bad. Who’s being “evil”? Am I? Google? The Wall Street Journal?
Stephen E Arnold, June 8, 2012
Freebie from ArnoldIT.com
May 13, 2012
In 1981, I joined the Courier Journal and Louisville Times. That was 31 years ago. I am not sure how I made the decision to leave the Washington, DC, area to journey to a city whose zip code and telephone area code were unknown to me. I am a 212, 202, and 301 type of person.
I recall meeting Barry Bingham Jr. He asked me what I did in my spare time. I was thunderstruck. My former employers—Halliburton Nuclear Utility Services and Booz, Allen & Hamilton—never asked me those questions. Those high powered, hard charging outfits wanted to know how much revenue I had generated and how much money I had saved the company, when the next meeting with the Joint Committee on Atomic Energy was, and how the Cleveland Design & Development man trip vehicle was rolling along. The personal stuff floored me.
I did not have an answer. As a Type A, Midwestern, over-achieving, no-brothers-and-no sisters worker bee, fun was not a big part of my personal repertoire.
I asked him, “Why?”
I recall to this day his answer, “I want our officers and employees to have time with their families, get involved in the community, and do great work without getting into that New York City thing.”
Interesting. The Courier Journal had a very good reputation. The newspaper was profitable, operated a wide range of businesses, printed the New York Times’s magazine for the Gray Lady, and operated a commercial database company. In fact, in 1980 the Courier Journal was one of the leaders in commercial online information, competing with a handful of other companies in the delivery of information via digital channels, not the dead-tree, ruin-the-environment, and dump-chemicals approach of most publishing companies.
In 1986, Gannet bought the Courier Journal. The commercial database unit was of zero interest to Gannet, so it and I were sold to Bell+Howell. After a short stint at a company entrenched in 16 mm motion film projectors, I headed back to New York City.
I retained my residence in Louisville, and I have watched the trajectory of the Courier Journal as it moved forward.
I have to be blunt. The Courier Journal is not the newspaper, the company, or the community force it was when I joined Mr. Bingham and a surprisingly diverse, bright, forward-looking team 31 years ago. The 1981 management approach of the Courier Journal was a culture shock to me. Think of the difference between Dick Cheney and Mr. Rogers. The 2012 approach saddens me.
This morning I read “Answering Your Questions on CJ Changes,” written by a person whom I do not know. The author of the article is Wesley Jackson, publisher of the Courier Journal. (I never liked the acronym CJ and still do not.)
The main point of the article is that the Courier Journal has to raise its prices. Last week, Mr. Jackson wrote a short article in the Courier Journal informing subscribers a letter would arrive explaining the new services that would be available. We received our letter on Wednesday, May 9, 2012. We called on Thursday, May 10, 2012, and cancelled our subscription. I am not sure how many other subscribers took this action, but a sufficient number of Courier Journal readers called to kill the phone system at the newspaper.
Mr. Jackson wrote this morning:
Unfortunately our Customer Service Center’s phone system had technical problems, and many of you had long wait times or could not get through to get your questions answered. That I know was frustrating.
I bet. I would love to see the data about the number of calls and the number of cancellations that the paper received when it announced the rate hike, a free iPad application for subscribers, and an email copy of the newspaper sent each day to paying customers.
The write up troubled me for several other reasons:
- Some of the word choices were of the touchy-feely school of communication. There are 19 “we’s”. The word “value” appears twice, there are seven categoricals: six all’s and one never; and the word “conversation” appears twice.
- There is at least one split infinitive “to personally apologize”
- An absolutely amazing promise expressed in this statement: “For those of you who would like to ask questions directly, please email me at email@example.com or send a letter to Publisher, Courier-Journal Media, 525 W. Broadway, Louisville, KY 40202. I promise you will each receive a response.”
“Promise,” “all,” and “never”—yep, I believe those assertions.
I would have included an image of Wesley Jackson but I had to pay for it. Not today, sorry.
My view is that I hear a death rattle from the Courier Journal. The reality of the newspaper is that it runs more and more syndicated content. The type of local coverage for which the paper was known when I joined in 1981 has decreased over the years. When I want news, I look at online services. What I have noticed is that what appears in the Courier Journal has been mentioned on Facebook, Twitter, or headline aggregation services two or three days before the information appears in either the Courier Journal’s hard copy edition or its online site, www.courier-journal.com.
Dave Kellogg, the former president of MarkLogic, used to chide me that I should not refer to major publishing operations and “dead tree publishers.” My view was and is that I am entitled to my opinion. Traditional publishing companies have failed to respond to new opportunities to disseminate and profit from information opportunities.
The list of mistakes include:
- Belief that an app will generate new revenue. Unfortunately apps are not automatic money machines. (Print-centric apps are not the go-to medium for many digital device users.)
- Assumptions about a person’s appetite for paying for “nice to have content.” (One pays for “must have” content, not “nice to have” content.)
- Failure to control costs. (Print margins continue to narrow as traditio0nal publishers try to regain the glory of the pre digital business models.)
- Firing staff who then go on to compete by generating content funded by a different business model. (This blog is an example. We do online advertising and inclusions and sell technical services. For some reason, this works for me thanks to my team which includes some former “real” journalists.)
- Assuming that new technology for printing color on newsprint equips an information technology department that it can handle other information technologies in an effective manner. (Skill in one technical area does not automatically transfer to another technical field.)
I can hear the labored breathing of a local newspaper struggling to stay alive. What do you hear?
Stephen E Arnold, May 13, 2012
March 31, 2012
I find the notion of pundits fascinating. The US in 2012 pivots on a news hook, the Warhol fame thing, and a desire to share viewpoints to Flipbook and Pulse users.
This morning I was listening to the crackle of small arms fire in rural Kentucky. Dawn had not yet extended its crepuscular reach to my hollow but two write ups did. Neither is one of those magnum loads squirrel hunters desire here in the Commonwealth. Nope, these were birdshot, but each write up is interesting nonetheless.
Both indirectly concern search and retrieval. Both found their way into my “gems of the poobahs” folder.
First, I noted the digital Atlantic’s write up “The Advertising Industry’s Definition of ‘Do Not Track’ Doesn’t Make Sense.” What caught my attention was the juxtaposition of the word “advertising” with the phrase “doesn’t make sense.” Advertising making sense? The Atlantic “real” journalist has not watched television with a 67 year old. More than half of the TV commercials which I find embedded in basketball games every four minutes don’t make sense. Advertising is about creating a demand for must-have products. Advertising is part of the popular culture and an engine of growth for companies unable to generate sales without the craft and skill of psychological tactics. Check out an advertisement for Kentucky bourbon. Does this headline make sense?
“Honk if you’re proud to be a redneck?
As a resident of Kentucky, I am not sure I know what a redneck is, but I bet those folks in Boston do. But what’s “making sense” part. What advertising does is tickle the brain to make some folks want to drink. And we all know how important it is to imbibe whiskey, engage in “real” journalism, ferry children to soccer practice. Yep, makes “sense” to me.
But here’s the passage which caught my attention:
Stanford’s Aleecia McDonald found that 61 percent of people expect that clicking a Do Not Track button should shut off *all* data collection. Only 7 percent of people expected that websites could collect the same data before and after clicking a ‘Do Not Track’ button. That is to say, 93 percent of people do not understand the industry’s definition of DNT. Which totally makes sense! Who would ever think saying, “Do not track me,” actually means, “It’s fine to collect data on me, but don’t show me any signs that you’re doing so.” Simply because the industry itself has defined ‘Do Not Track’ in an idiosyncratic way doesn’t mean their self-serving decision should be the basis for all policy and practice in this field.
Almost any redneck would understand this passage, the implications of persistent cookies, and the distinction between various types of tracking, including my favorite, iFrames-based method.
Second, I read “Debunking Senator Al Franken On Google, The Internet & Privacy.” This screed is from a “real” journalist and favorite source of juicy quotes on the subject of search and retrieval. The point of the write up is that despite the author’s affection for a US senator as a comedian, the US senator does not know beans about tracking, Google, and, by extension, search and retrieval. Now “search” does not mean find. Search, I believe, means to the “real” journalist using methods to generate traffic to a Web site. I define “search” differently, but the good part in my opinion is this passage:
Ya think? But I mean, Facebook kind of does sell my friends. I can export all of them out to Yahoo and Bing, because Facebook and Yahoo and Bing all have deals. I can’t export them to Google, because, you know, they aren’t friends. Would you call that selling to the highest bidder? When I go over to search on Bing, by default, all my Facebook friends are being used to personalize my search results. Oh, I can opt-out, but you know how hard that is. Since that’s part of a Bing-Facebook deal, is that a line that’s crossed?
Please, read the entire “real” journalistic analysis of a talk by a US senator. I must admit I don’t relate to the questions and analytic points in this paragraph. I recognize the names of the companies mentioned, but “the deal” baffles me.
Why do I care? Three points:
- I sense the emotion in these write ups. Passion is good for advertising and good for capturing attention. However, I am struggling to figure out what the problem is. Advertising seems to be what America is. Untangling the warp and woof of this fabric is difficult for me.
- The ad hominem method and charged language causes me to think that the lingo of advertising has become the common parlance of “real” journalists.
- I struggle to unravel the meaning of certain parts of these two write ups. Am I alone?
Net net: technology and advertising are an interesting compound. Now “real” journalism is quite similar. To quote one “real” journalist, “Ya think?” Well, not much.
Stephen E Arnold, March 31, 2012
Sponsored by Pandia.com
March 31, 2012
David Bamman, Brendan O’Connor, Noah A. Smith present some interesting facts based on a study they wrote about in their article, Censorship and Deletion Practices in Chinese Social Media. Their study touches on a variety of different aspects regarding how China allegedly controls the intake and outflow of information.
The Chinese government methods are far different from the United States’ approach. My understanding of the situation is that China takes censorship to extremes and infringes on the freedom of their citizens using the GFW (Great Firewall of China) , which filters key phrases and words, preventing access to sites like America’s Facebook and Google. However, Sina Weibo is the Chinese equivalent of Facebook where bloggers post and pass information presumably in a way the officials perceive as more suitable for the Middle Kingdom.
Sina Weibo is monitored and as long as members stay within the boundaries or disguise their information, posts go unnoticed. If any of the outlawed phrases are entered, the user’s post is deleted and anyone searching for the information is met with the phrase ‘Target weibo does not exist’. If the user properly masks the phrase or words used, the information will get through, showing that there is the possibility of future change regarding the censorship practices in China.
The GFW will catch obvious outgoing information such as political figures, which was monitored during the study. The article asserted:
In late June/early July 2011, rumors began circulating in the Chinese media that Jiang Zemin, general secretary of the Communist Party of China from 1989 to 2002, had died. These rumors reached their height on 6 July, with reports in the Wall Street Journal, Guardian and other Western media sources that Jiang’s name had been blocked in searches on Sina Weibo (Chin, 2011; Branigan, 2011). If we look at all 532 messages published during this time period that contain the name Jiang Zemin, we note a striking pattern of deletion: on 6 July, the height of the rumor, 64 of the 83 messages containing that name were deleted (77.1 percent); on 7 July, 29 of 31 (93.5 percent) were deleted.
No firewall is perfect, but according to the studies done on searches, blogs and texts containing prohibited information, China has a pretty impressive figure. It may not seem reasonable by American standards, but by filtering anything they deem as politically sensitive, China protects the privacy of their country, preventing global rumors and interference.
On one level, censorship makes sense, in particular regarding the business world. The Chinese government makes its corporations responsible for their employees, meaning if an employee is blogging instead of working and puts in illegal information, the company itself is fined, or worst case scenario, shut down. Thus Chinese factories have a high rate of productivity because their workers are actually doing their job.
How is China’s alleged position relevant to the US? There may be little relevance, but to officials in other countries, the article’s information may be just what one needs to check into a Holiday Inn of censorship.
Jennifer Shockley, March 31, 2012
Sponsored by Pandia.com