Internet Business: Slightly Different Points of View

September 29, 2014

First, navigate to “Another Top Investor Sounds the Alarm: When the Market Turns, a Bunch of Startups Are Going to Vaporize.” No big surprise here. The main idea is, in my opinion:

Over the past few years, it’s been relatively easy for startups to raise money from venture capitalists. In some cases, they’re raising hundreds of millions of dollars to keep their companies afloat. But behind the scenes, they’re plowing through that money either on marketing, overhead, or some other expense, which results in high burn rates. These bloated companies are using their millions to hide serious flaws in their business models.

At some point, those who provide the bucks to the venture firms will want a return. Many of the Fancy Dan outfits are not among the world’s most liquid operations. To raise cash, MBAs and accountants can cook up some quite remarkable solutions. The actions cascade down the line and end up pushing technology companies like those that pitch wild and crazy content technology into an Iron Maiden. This is essentially a casket with spikes protruding into the box and spikes pointing into the box on its lid.

Ta da.

The individual is placed into the Iron Maiden and the door is shut. Ouch.

Now navigate to either the Google book itself or the concepts Web site at http://bit.ly/1mr9OvS. Eric Schmidt argues that businesses should be like Google. You know the moon shots, trying stuff and failing fast (I am not sure how fast Google has failed at social networking, but I don’t want to be argumentative), and value numbers/data over any humanoid subjectivity.

For many search and content processing companies, the senior managers have been failing for years in some cases. I want to make a list of would be start ups and then provide their date of inception. Heck, why embarrass outfits like Attivio, Coveo, Digital Reasoning, Lucid Imagination (now Lucid Works to which I am tempted to add “Really? but I will not.”), and quite a few others.

The point is that we have two somewhat conflicting interpretations of the present business climate. The tweets that inspired the Business Insider write up are taking a hard look at what happens when the money goes away. No money means that affected firms first people, raise prices, and pivot along with a half dozen or so MBA maneuvers before shutting the doors as Convera, Delphes, did Entopia. A few lucky outfits will sell out like Endeca, Exalead, and iPhrase. A few will struggle along sort of open and sort of closed like a number of French search and content processing firms.

On one hand, these outfits are toast if more money is not “found.” On the other hand, forget money. In Google’s world view, these companies need to be more like Google or out Google Google.

The reality is that the contraction of search and content processing has already begun. Some outfits are going to have to find a way to deliver a solution that solves an actual problem and generates sustainable revenue. Companies in this spot include IBM with its Watson project, Hewlett Packard with its Autonomy IDOL technology, and Palantir, a billion dollar baby of considerable note.

My view is that the doom and gloom expressed in the Business Insider write up is more likely to occur than a Google style entity arising from the Google Moon shot and allied suggestions. I am not sure the Google recommendations apply to Google. A company that is 15 years old and has one revenue stream may be a success that fulfills Steve Ballmer’s one trick pony observation.

For search and content processing vendors, there is no easy way out unless money remains plentiful and Google’s advice actually works for an information retrieval company.

Stephen E Arnold, September 29, 2014

Spy Tools for Investors

September 26, 2014

Never let it be said that financiers don’t leverage all the useful technology they can find. The Silicon Valley Business Journal reports that “Addepar’s Palantir Veterans Use Spy Tools to Map Investment Risk.” Hmm, I wonder whether the company will want to work the phrase “spy tools” into its advertising. Writer Jason McCormick, citing a New York Times article, summarizes:

“Addepar’s software, which launched five years ago, maps investors’ holdings to determine risk and portfolio sustainability. The company, whose leadership did turns at Palantir Technologies, last year raised significant capital to bring its big-data platform to market.

The company was founded by Palantir veteran Jason Mirra and Joe Lonsdale, who was a co-founder at Palantir. Addepar’s current CEO, Eric Poirier, also worked at Palantir.

The Times reported that Addepar’s users include family offices, banks and wealth managers, such as Iconiq Capital, which oversees a part of Mark Zuckerberg’s portfolio.”

McCormick goes on to point out that Addepar’s services can run from $50,000 to “well over” $1 million, depending on the amount of data involved. These companies must be pretty convinced of Addepar’s abilities.

Much is (rightly) made of Addepar’s roots in Palantir, an outfit we’ve been following with interest (though I’d like to add to the description above the fact that CEO Poirier also spent time at the financial powerhouse Lehman Brothers.) I think it interesting, though, that the team pulled in a former Oracle executive, who happens to have experience leading a private equity firm’s software investing team, to be COO: Karen White. So far, that seems to have been a wise choice.

Cynthia Murrell, September 26, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

Lucid Works: Pando Daily Sets the Record Straight

September 23, 2014

On LinkedIn I learned about this Pando Daily write up: “How Disgruntled Ex-Employees and Bad Reporting Hung LucidWorks Out to Dry.” I noted the Venture Beat analysis of Lucid Works in my post on September 6, 2014. My focus was the wild and crazy information from an “expert” about various factoids. You can read my reaction to the “Trouble at LucidWorks” story here.

The Pando Daily story comes at the issue in a different way. I was delighted to see that Pando found the “expert’s” comments a bit wobbly. There was an interesting run down about Lucid Works that seems to have come from a different point of view. In a way, the two stories—Venture Beat’s and Pando Daily’s—are a bit like the he said, she said information provided to police investigating a married couple’s disturbing the peace incident. I am no cop, so I can’t figure out who is correct and who is incorrect.

Pando takes this tack:

More accurately: It’s [Lucid Works] a startup, and this shit is hard.

I understand that search is hard, but is an eight year old company a start up? That time span baffled me. Coveo asserts that it too is a start up. Other search vendors dating from the implosion of the Big Five in 2006 also use the start up moniker.

the article points out that there are happy employees and positive investors. More money is likely to be needed. Pando Daily quotes a backer as saying:

We won’t start looking for an expansion round until early next year.

ElasticSearch has amassed about $90 million in funding. So LucidWorks may be thinking it needs the same scale of investment to take wing.

With regard to management, Pando Daily reports that the new top dog is the type of CEO who can deliver revenues. The new president—Will Smith—is described in this context:

On this point, VentureBeat seems oddly hung up on the idea that Hayes is a first-time CEO, perhaps failing to realize that Silicon Valley was (and continues to be) literally built on the success of first-time CEOs. Not to over egg the point, but Mark Zuckerberg and Steve Jobs were first-time CEOs.

Pando Daily added:

As an early member of the Splunk team, Hayes is certainly more qualified for this job than 99 percent of the candidates out there, and more importantly, given that he didn’t found the company, he appears excited about the category.

Pando Daily reminded me that good start ups fire people. I understand the difference between the Silicon Valley approach to management and that practiced at Halliburton and Booz, Allen & Hamilton where I worked for many years. The idea of stability is not always congruent with the needs of a fast moving, pivoting technology company.

Pando Daily also takes issue with Venture Beat’s report that Lucid Works fumbled deals with some real big companies. Pando Daily asserted:

These accounts may or may not have any basis in reality, but they hardly indicate a failing company. The very nature of sales and business development is that deals fall apart all the time. Sometimes those are big deals, sometimes not. The facts are that LucidWorks counts Apple, Sears, Verizon, ADP, Raytheon, Zappos, Qualcomm, Ford, eHarmony, Cisco, and others among current customers.

My reaction to this is okay, but won’t naming these firms give ElasticSearch and other firms a target at which to shoot. Some content processing vendors like Palantir and Recorded Future don’t provide too much information about their customers.

On the all important revenue front, Pando Daily quoted the new top dog at Lucid Works as saying:

“$12 million in services revenue isn’t worth shit,” Hayes says. “But $12 million in product sales on subscription? That’s a $100 million business.”

I agree. Unless the subscriber terminates the subscription. As the competition among content processing vendors heats up, some firms will be quite aggressive in their attempts to take away business. Amazon, for example, seems to be struggling with search, but it could get its act together and offer both a good enough solution at very competitive prices. Amazon is not the only sharp toothed outfit in the pond.

Pando Daily tracked down its own search wizard. That poobah said:

Not everyone agrees that enterprise search is quite this sexy. One enterprise analyst, speaking to Pando on the condition of anonymity, describes it as “not that big of an end market.” But at the same time, it’s one that’s still out there for the taking. “There isn’t really a single company or set of companies that have dominant products in the space,” this analyst says. Google and Microsoft have entered the market (the latter via acquisition) with low-cost offerings that would seem to make the competitive environment more challenging for LucidWorks and other upstarts. But according to the company’s supporters, these products are targeting different, less big data-centric applications and are thus not a valid comparison.

If you have ever listened to opposing expert witnesses in a legal dispute, the same factoid gets very different treatment by each expert. That’s what makes subjective expertise difficult to interpret. My view is that enterprise search is struggling for credibility. Some of the value for information retrieval has been exhausted by vendors now out of business. These include Convera, Delphes, Entopia, Siderean, and others. Some credibility has been eroded as a result of the Fast Search & Transfer matter. The CEO was hit with a jail term and a ban on working in search for a couple of years. Then there is the on going dispute between Hewlett Packard and Autonomy. IDOL is an aging technology like Endeca. But the mud slinging about search and content processing does not improve the image of those working in this sector.

Consequently information retrieval companies are working overtime to explain their solutions in terms that do not invoke memories of Convera or Fast Search. Palantir is a data mining company. Record Future does predictive analytics. Coveo is eDiscovery and customer support. Search vendors are using a wide range of jargon to describe findability. Lucid Works is brave in using enterprise search with a dash of Big Data in its marketing.

Pando Daily said:

Journalism is tough, particularly in the technology sector. Reporters in this industry asked to cover complex and rapidly evolving companies that often take on hordes of venture cash and set outrageous performance expectations. Unseemly as it may be, stories of failure and calamity make for good scoops, and in these cases ex-employees and competitors often make the best sources. Unfortunately, they also can be the most biased sources and are often are in the best position to credibly lead a journalist astray. LucidWorks certainly has its warts and its scars. But that doesn’t make it trouble, that only makes it a startup.

One question remains: When does a company cease to be a start up and start to be a viable company? Is it one years, four years, or eight years? I just don’t know, but I think that companies that have been in business for almost a decade may not be start ups. Management with a start up mentality may not want to face the cold realities expected of established, stable firms. With Lucid’s technology originating with a community, management may be the issue to watch at Lucid Works. Good management can produce revenue, happy employees, and contented customers. Its absence is often evidenced by a lack of harmony.

Stephen E Arnold, September 23, 2014

Lucid Works: Really?

September 21, 2014

Editor’s Note: This amusing open letter to Chrissy Lee at Launchsquad Public Relations points out some of the challenges Lucid Imagination (now Lucid Works) faces. Significant competition exists from numerous findability vendors. The market leader in open source search is, in Beyond Search’s view, ElasticSearch.

Dear Ms. Lee,

I sent you an email on September 18, 2014, referring you to my response to Stacy Wechsler at Hired Gun public relations. I told you I would create a prize for the news release you sent me. I am retired, but I don’t have too much time to write for PR “professionals” who send me spam, fail to do some research about my background, and understand the topic addressed in your email.

Some history: I recall the first contact I had from Lucid Imagination in 2008. A fellow named Anil Uberoi sent me an email. He and I had a mutual connection, Mark Krellenstein who was the CTO for Northern Light when it was a search vendor.

I wrote a for fee report for Mr. Uberoi, who shortly thereafter left Lucid for an outfit called Kitana. His replacement was a fellow named David. He left and migrated to another company as well. Then a person named Nancy took over marketing and quickly left for another outfit. My recollection is that in a span of 24 months, Lucid Imagination churned through technical professionals, marketers, and presidents. Open source search, it seemed, was beyond the management expertise of the professionals at Lucid.

Then co founder Mark Krellenstein cut his ties with the firm, I wondered how Mr. Krellenstein could deliver the innovative folders function for Northern Light and flop at Lucid. Odd.

Recently I have been the recipient of several emails sent to my two major email accounts. For me, this is an indication of spam. I knew about the appointment of another president. I read  “Trouble at Lucid Works: Lawsuits, Lost Deals, and Layoffs Plague the Search Startup Despite Funding.” Like other pundit-fueled articles, there is probably some truth, some exaggeration, and some errors in the article. The overall impression left on me by the write up is that Lucid Works seems to be struggling.

Your emails to me indicate that you perceive me as a “real” journalist. Call me quirky, but I do not like it when a chipper young person writes me, uses my first name, and then shovels baloney at me. As the purveyor of search silliness for your employer Launchsquad, which seems Lucid Works’ biggest fan and current content marketing agent. Not surprisingly, the new Lucid Fusion products is the Popeil pocket fisherman of search. Fusion slices, dices, chops, and grates. Here’s what  Lucid Works allegedly delivers via Lucene/Solr and proprietary code:

  • Modular integration. Sorry, Ms. Lee, I don’t know what this means.
  • Big Data Discovery Engine. Ms. Lee, Lucid has a search and retrieval system, not a Cybertap, Palantir, or Recorded Future type system.
  • Connector Framework. Ms. Lee licensees want connectors included. Salesforce bought Entropy Soft to meet this need. Oracle bought Outside In for the same reason. Even Microsoft includes some connectors with the quite fragile Delve system for Office 365.
  • Intelligent Search Services.Ms. Lee, I suggest you read my forthcoming article in KMWorld about smart software. Today, most search services are using the word intelligent when the technology in use has been available for decades.
  • Signals Processing.Ms. Lee, I suggest you provide some facts for signals processing. I think in terms of SIGINT, not crude click log file data.
  • Advanced Analytics.Ms. Lee, I lecture at several intelligence and law enforcement conferences about “analytics.” The notion of “advanced” analytics is at odds with the standard numerical recipes that most vendors use. The reason “advanced” is not a good word is that there are mathematical methods that can deliver significant return. Unfortunately today’s computer systems cannot get around the computational barriers that bring x86 architectures to their knees.
  • Natural Language Search.Ms. Lee, I have been hearing about NLP for many years. Perhaps you have not experimented with the voice search functions on Apple and Android devices? You should. Software does a miserable job of figuring out what a human “means.”

So what?

Frankly I am not confident that Lucid Works can close the gap between your client and ElasticSearch’s. Furthermore, I don’t think Lucid Works can deliver the type of performance available from Searchdaimon or ElasticSearch. The indexing and query processing gap between Lucid Works and Blossom Software is orders of magnitude. How do I know? Well, my team tested Lucid Works’ performance against these systems. Why don’t you know this when you write directly to the person who ran the tests? I sent a copy of the test results to one of Lucid Works’ many presidents.

Do I care about Ms. Lee, the new management team, the investors, or the “new” Lucid?

Nope.

The sun has begun to set on vendors and their agents who employ meaningless jargon to generate interest from potential licensees.

What’s my recommendation? I suggest a person interested in Lucid navigate to my Search Wizards Speak series and read the Lucid Imagination and Lucid Works interviews. Notice how the story drifts. You can find these interviews at www.arnoldit.com/search-wizards-speak.

Why does Lucid illustrate “pivoting”? It is easy to sit around and dream about what software could do. It is another task to deliver software that matches products and services from industry leaders and consistent innovators.

For open source search, I suggest you pay attention to www.Flax.co.uk, www.Searchdaimon.com, www.sphinxsearch.com, and www.elasticsearch.com for starters. Keep in mind that other competitors like IBM and Attivio use open source search technology too.

You will never have the opportunity to work directly for me. I can offer one small piece of advice: Do your homework before writing about search to me.

Your pal,

Stephen E Arnold, September 21, 2014

Nowcasting: Lots of Behind the Scenes Human Work Necessary

September 10, 2014

Some outfits surf on the work of others. A good example is the Schubmehl-Arnold tie up. Get some color  and details here.

Other outfits have plenty of big thinkers and rely on nameless specialists to perform behind the scenes work.

A good example of this approach is revealed in “Predicting the Present with Bayesian Structural Time Series.” The scholarly write up explains a procedure to perform “nowcasting.” The idea is that one can use real time information to help predict other now happenings.

Instead of doing the wild and crazy Palantir/Recorded Future forward predicting, these Googlers focus on the now.

I am okay with whatever outputs predictive systems generate. What’s important about this paper is that the authors document when humans have to get involved in the processes constructed from numerical recipes known to many advanced math and statistics whizzes.

Here are several I noted:

  1. The modeler has to “choose components for the modeling trend.” No problem, but it is tedious and important work. Get this step wrong and the outputs can be misleading.
  2. Selecting sampling algorithms, page 6. Get this wrong and the outputs can be misleading.
  3. Simplify by making assumptions, page 7. “Another strategy one could pursue (but we have not) is to subjectively segment predictors into groups based on how likely the would be to enter the model.”
  4. Breaking with Bayesian, page 8. “Scaling by “s^2/y”* is a minor violation of the Bayesian paradigm because it means our prior is data determined.”

There are other examples. These range from selecting what outputs from Google Trends and Correlate to use to the sequence of numerical recipes implemented in the model.

My point is that Google is being upfront about the need for considerable manual work in order to make its nowcasting predictive model “work.”

Analytics deployed in organizations depend on similar human behind the scenes work. Get the wrong thresholds, put the procedures in a different order, or use bad judgment about what method to use and guess what?

The outputs are useless. As managers depend on analytics to aid their decision making and planners rely on models to predict the future, it is helpful to keep in mind that an end user may lack the expertise to figure out if the outputs are useful. If useful, how much confidence should a harried MBA put in predictive models.

Just a reminder that ISIS caught some folks by surprise, analytics vendor HP seemed to flub its predictions about Autonomy sales, and the outfits monitoring Ebola seem to be wrestling with underestimations.

Maybe enterprise search vendors can address these issues? I doubt it.

Note: my blog editor will not render mathematical typography. Check the original Google paper on page 8, line 4 for the correct representation.

Stephen E Arnold, September 10, 2014

Autumn Approaches: Time for Realism about Search

September 1, 2014

Last week I had a conversation with a publisher who has a keen interest in software that “knows” what content means. Armed with that knowledge, a system can then answer questions.

The conversation was interesting. I mentioned my presentations for law enforcement and intelligence professionals about the limitations of modern and computationally expensive systems.

Several points crystallized in my mind. One of these is addressed, in part, in a diagram created by a person interested in machine learning methods. Here’s the diagram created by SciKit:

image

The diagram is designed to help a developer select from different methods of performing estimation operations. The author states:

Often the hardest part of solving a machine learning problem can be finding the right estimator for the job. Different estimators are better suited for different types of data and different problems. The flowchart below is designed to give users a bit of a rough guide on how to approach problems with regard to which estimators to try on your data.

First, notice that there is a selection process for choosing a particular numerical recipe. Now who determines which recipe is the right one? The answer is the coding chef. A human exercises judgment about a particular sequence of operation that will be used to fuel machine learning. Is that sequence of actions the best one, the expedient one, or the one that seems to work for the test data? The answer to these questions determines a key threshold for the resulting “learning system.” Stated another way, “Does the person licensing the system know if the numerical recipe is the most appropriate for the licensee’s data?” Nah. Does a mid tier consulting firm like Gartner, IDC, or Forrester dig into this plumbing? Nah. Does it matter? Oh, yeah. As I point out in my lectures, the “accuracy” of a system’s output depends on this type of plumbing decision. Unlike a backed up drain, flaws in smart systems may never be discerned. For certain operational decisions, financial shortfalls or the loss of an operation team in a war theater can be attributed to one of many variables. As decision makers chase the Silver Bullet of smart, thinking software, who really questions the output in a slick graphic? In my experience, darned few people. That includes cheerleaders for smart software, azure chip consultants, and former middle school teachers looking for a job as a search consultant.

Second, notice the reference to a “rough guide.” The real guide is understanding of how specific numerical recipes work on a set of data that allegedly represents what the system will process when operational. Furthermore, there are plenty of mathematical methods available. The problem is that some of the more interesting procedures lead to increased computational cost. In a worst case, the more interesting procedures cannot be computed on available resources. Some developers know about N=NP and Big O. Others know to use the same nine or ten mathematical procedures taught in computer science classes. After all, why worry about math based on mereology if the machine resources cannot handle the computations within time and budget parameters? This means that most modern systems are based on a set of procedures that are computationally affordable, familiar, and convenient. Does this similar of procedures matter? Yep. The generally squirrely outputs from many very popular systems are perceived as completely reliable. Unfortunately, the systems are performing within a narrow range of statistical confidence. Stated in a more harsh way, the outputs are just not particularly helpful.

In my conversation with the publisher, I asked several questions:

  1. Is there a smart system like Watson that you would rely upon to treat your teenaged daughter’s cancer? Or, would you prefer the human specialist at the Mayo Clinic or comparable institution?
  2. Is there a smart system that you want directing your only son in an operational mission in a conflict in a city under ISIS control? Or, would you prefer the human-guided decision near the theater about the mission?
  3. Is there a smart system you want managing your retirement funds in today’s uncertain economy? Or, would you prefer the recommendations of a certified financial planner relying on a variety of inputs, including analyses from specialists in whom your analyst has confidence?

When I asked these questions, the publisher looked uncomfortable. The reason is that the massive hyperbole and marketing craziness about fancy new systems creates what I call the Star Trek phenomenon. People watch Captain Kirk talking to devices, transporting himself from danger, and traveling between far flung galaxies. Because a mobile phone performs some of the functions of the fictional communicator, it sure seems as if many other flashy sci-fi services should be available.

Well, this Star Trek phenomenon does help direct some research. But in terms of products that can be used in high risk environments, the sci-fi remains a fiction.

Believing and expecting are different from working with products that are limited by computational resources, expertise, and informed understanding of key factors.

Humans, particularly those who need money to pay the mortgage, ignore reality. The objective is to close a deal. When it comes to information retrieval and content processing, today’s systems are marginally better than those available five or ten years ago. In some cases, today’s systems are less useful.

Read more

The Guardian Explores HP Autonomy

August 16, 2014

I read “Hewlett-Packard Allegations: Autonomy Founder Mike Lynch Tries to Clear Name.” The British “real” newspaper focuses on Mike Lynch, the founder of Autonomy. I am convinced that Autonomy pitched the value of its company to a number of firms. I know that Hewlett Packard bought Autonomy. I assume that spending $11 billion was not a K Mart blue light special impulse purchase. I know that HP has had what the MBAs call “governance challenges.” These range from allegations of getting frisky with folks to management churn. I know that for me, the HP of electronic devices yielded to the HP of the ink cartridges.

Here’s a point I highlighted in the Guardian’s write up:

Meanwhile, lawyers on all sides are using legal privilege to sling mud. Lynch says it is not only his name that has been stained, but that of the British technology industry. Autonomy’s accounting and marketing methods had attracted criticism before the HP acquisition, but Lynch was also a poster child for the achievements of Cambridge’s Silicon Fen. The Autonomy affair casts a shadow, and a conclusion from the SFO is overdue.

I have a slightly different view of the dust up. Folks want to believe that information retrieval will generate another Google. Because of those expectations, executives whose expertise in search extends to running a Google search on a mobile device assume they know about content processing.

When buyers get excited about a purchase, some people buy Bugatti Veyrons and spring for gold iPhones. Others snap up search companies and expect the money to roll in like the oohs and aahs at the golf club when the Veyron rolls up.

Wrong. The dust up between HP and Autonomy is an illustration of what happens when folks without too much understanding of content processing’s complexities covet a home run. The impact does affect Mike Lynch, a Cambridge PhD and real live inventor.

The collateral damage is on the buyers of search companies who toss millions at a sector without understanding how difficult it is to create a search company that is not selling ads or living exclusively on Department of Defense largesse.

HP bought a company with a strong brand, customers, and technology that when properly resourced works. HP did not buy a Google scale money stream, a Palantir clinging to the US government, or a break even metasearch system.

The impact on the reputation of Autonomy professionals is significant. What does this dispute do to other search and content processing companies? Search is tough enough without having a megaton dispute played out in the datasphere.

HP did not have to buy Autonomy. Microsoft passed. Oracle passed. HP bought. HP had time and resources to dig through Autonomy. If it did not, then HP created its own problem. If it did, HP created its own problem. Autonomy, with 15 years of history, was looking for a buyer. My hunch is that HP was looking for a Google and bought a different business because HP convinced itself it could generate more money than Autonomy could. HP found out that it could not match Autonomy’s revenues. Whom does any self respecting MBA or lawyer blame? The other guy.

This hassle says much about HP. Sadly it affects other search and content processing companies as well.

Stephen E Arnold, August 16, 2014

More Knowledge Quotient Silliness: The Florida Gar of Search Marketing

August 1, 2014

I must be starved for intellectual Florida Gar. Nibble on this fish’s lateral line and get nauseous or dead. Knowledge quotient as a concept applied to search and retrieval is like a largish Florida gar. Maybe a Florida gar left too long in the sun.

image

Lookin’ yummy. Looks can be deceiving in fish and fishing for information. A happy quack to https://www.flmnh.ufl.edu/fish/Gallery/Descript/FloridaGar/FloridaGar.html

I ran a query on one of the search systems that I profile in my lectures for the police and intelligence community. With a bit of clicking, I unearthed some interesting uses of the phrase “knowledge quotient.”

What surprised me is that the phrase is a favorite of some educators. The use of the term as a synonym for plain old search seems to be one of those marketing moments of magic. A group of “experts” with degrees in home economics, early childhood education, or political science sit around and try to figure out how to sell a technology that is decades old. Sure, the search vendors make “improvements” with ever increasing speed. As costs rise and sales fail to keep pace, the search “experts” gobble a cinnamon latte and innovate.

In Dubai earlier this year, I saw a reference to a company engaged in human resource development. I think this means “body shop,” “lower cost labor,” or “mercenary registry,” but I could be off base. The company is called Knowledge Quotient FZ LLC. If one tries to search for the company, the task becomes onerous. Google is giving some love to the recent IDC study by an “expert” named Dave Schubmehl. As you may know, this is the “professional” who used by information and then sold it on Amazon until July 2014 without paying me for my semi-valuable name. For more on this remarkable approach to professional publishing, see http://wp.me/pf6p2-auy.

Also, in Dubai is a tutoring outfit called Knowledge Quotient which delivers home tutoring to the children of parents with disposable income. The company explains that it operates a place where learning makes sense.

Companies in India seem to be taken with the phrase “knowledge quotient.” Consider Chessy Knowledge Quotient Private Limited. In West Bengal, one can find one’s way to Mukherjee Road and engage the founders with regard to an “effective business solution.” See http://chessygroup.co.in. Please, do not confuse Chessy with KnowledgeQ, the company operating as Knowledge Quotient Education Services India Pvt Ltd. in Bangalore. See http://www.knowledgeq.org.

What’s the relationship between these companies operating as “knowledge quotient” vendors and search? For me, the appropriation of names and applying them to enterprise search contributes to the low esteem in which many search vendors are held.

Why is Autonomy IDOL such a problem for Hewlett Packard? This is a company that bought a mobile operating system and stepped away from. This is a company that brought out a tablet and abandoned it in a few months. This is a company that wrote off billions and then blamed the seller for not explaining how the business worked. In short, Autonomy, which offers a suite of technology that performs as well or better than any other search system, has become a bit of Florida gar in my view. Autonomy is not a fish. Autonomy is a search and content processing system. When properly configured and resourced, it works as well as any other late 1990s search system. I don’t need meaningless descriptions like “knowledge quotient” to understand that the “problem” with IDOL is little more than HP’s expectations exceeding what a decades old technology can deliver.

Why is Fast Search & Transfer an embarrassment to many who work in the search sector. Perhaps the reason has to do with the financial dealings of the company. In addition to fines and jail terms, the Fast Search system drifted from its roots in Web search and drifted into publishing, smart software, and automatic functions. The problem was that when customers did not pay, the company did not suck it up, fix the software, and renew their efforts to deliver effective search. Nah, Fast Search became associated with a quick sale to Microsoft, subsequent investigations by Norwegian law enforcement, and the culminating decision to ban one executive from working in search. Yep, that is a story that few want to analyze. Search marketers promised and the technology did not deliver, could not deliver given Fast Search’s circumstances.

What about Excalibur/Convera? This company managed to sell advanced search and retrieval to Intel and the NBA. In a short time, both of these companies stepped away from Convera. The company then focused on a confection called “vertical search” based on indexing the Internet for customers who wanted narrow applications. Not even the financial stroking of Allen & Co. could save Convera. In an interesting twist, Fast Search purchased some of Convera’s assets in an effort to capture more US government business. Who digs into the story of Excalibur/Convera? Answer: No one.

What passes for analysis in enterprise search, information retrieval, and content processing is the substitution of baloney for fact-centric analysis. What is the reason that so many search vendors need multiple injections of capital to stay in business? My hunch is that companies like Antidot, Attivio, BA Insight, Coveo, Sinequa, and Palantir, among others, are in the business of raising money, spending it in an increasingly intense effort to generate sustainable revenue, and then going once again to capital markets for more money. When the funding sources dry up or just cut off the company, what happens to these firms? They fail. A few are rescued like Autonomy, Exalead, and Vivisimo. Others just vaporize as Delphes, Entopia, and Siderean did.

When I read a report from a mid tier consulting firm, I often react as if I had swallowed a chunk of Florida gar. An example in my search file is basic information about “The Knowledge Quotient: Unlocking the Hidden Value of Information.” You can buy this outstanding example of ahistorical analysis from IDC.com, the employer of Dave Schubmehl. (Yep, the same professional who used my research without bothering to issue me a contract or get permission from me to fish with my identity. My attorney, if I understand his mumbo jumbo, says this action was not identity theft, but Schubmehl’s actions between May 2012 and July 2014 strikes me as untoward.)

Net net: I wonder if any of the companies using the phrase “knowledge quotient” are aware of brand encroachment. Probably not. That may be due to the low profile search enjoys in some geographic regions where business appears to be more healthy than in the US.

Can search marketing be compared to Florida gar? I want to think more about this.

Stephen E Arnold, August 1, 2014

The IHS Invention Machine: US 8,666,730

July 31, 2014

I am not an attorney. I consider this a positive. I am not a PhD with credentials as impressive Vladimir Igorevich Arnold, my distant relative. He worked with Andrey Kolmogorov, who was able to hike in some bare essentials AND do math at the same time. Kolmogorov and Arnold—both interesting, if idiosyncratic, guys. Hiking in the wilderness with some students, anyone?

Now to the matter at hand. Last night I sat down with a copy of US 8,666,730 B2 (hereinafter I will use this shortcut for the patent, 730), filed in an early form in 2009, long before Information Handing Service wrote a check to the owners of The Invention Machine.

The title of the system and method is “Question Answering System and Method Based on  Semantic Labeling of Text Documents and User Questions.” You can get your very own copy at www.uspto.gov. (Be sure to check out the search tips; otherwise, you might get a migraine dealing with the search system. I heard that technology was provided by a Canadian vendor, which seems oddly appropriate if true. The US government moves in elegant, sophisticated ways.

Well, 730 contains some interesting information. If you want to ferret out more details, I suggest you track down a friendly patent attorney and work through the 23 page document word by word.

My analysis is that of a curious old person residing in rural Kentucky. My advisors are the old fellows who hang out at the local bistro, Chez Mine Drainage. You will want to keep this in mind as I comment on this James Todhunter (Framingham, Mass), Igor Sovpel (Minsk, Belarus), and Dzianis Pastanohau (Minsk, Belarus). Mr. Todhunter is described as “a seasoned innovator and inventor.” He was the Executive Vice President and Chief Technology Officer for Invention Machine. See http://bit.ly/1o8fmiJ, Linked In at (if you are lucky) http://linkd.in/1ACEhR0, and  this YouTube video at http://bit.ly/1k94RMy. Igor Sovpel, co inventor of 730, has racked up some interesting inventions. See http://bit.ly/1qrTvkL. Mr. Pastanohau was on the 730 team and he also helped invent US 8,583,422 B2, “System and Method for Automatic Semantic Labeling of Natural Language Texts.”

The question answering invention is explained this way:

A question-answering system for searching exact answers in text documents provided in the electronic or digital form to questions formulated by user in the natural language is based on automatic semantic labeling of text documents and user questions. The system performs semantic labeling with the help of markers in terms of basic knowledge types, their components and attributes, in terms of question types from the predefined classifier for target words, and in terms of components of possible answers. A matching procedure makes use of mentioned types of semantic labels to determine exact answers to questions and present them to the user in the form of fragments of sentences or a newly synthesized phrase in the natural language. Users can independently add new types of questions to the system classifier and develop required linguistic patterns for the system linguistic knowledge base.

The idea, as I understand it, is that I can craft a question without worrying about special operators like AND or field labels like CC=. Presumably I can submit this type of question to a search system based on 730 and its related inventions like the automatic indexing in 422.

The references cited for this 2009 or earlier invention are impressive. I recognized Mr. Todhunter’s name, that of a person from Carnegie Mellon, and one of the wizards behind the tagging system in use at SAS, the statistics outfit loved by graduate students everywhere. There were also a number of references to Dr. Liz Liddy, Syracuse University. I associated her with the mid to late 1990s system marketed then as DR LINK (Document Retrieval Linguistic Knowledge). I have never been comfortable with the notion of “knowledge” because it seems to require that subject matter experts and other specialists update, edit, and perform various processes to keep the “knowledge” from degrading into a ball of statistical fuzz. When someone complains that a search system using Bayesian methods returns off point results, I look for the humans who are supposed to perform “training,” updates, remapping, and other synonyms for “fixing up the dictionaries.” You may have other experiences which I assume are positive and have garnered you rapid promotion for your search system competence. For me, maintaining knowledge bases usually leads to lots of hard work, unanticipated expenses, and the customary termination of a scapegoat responsible for the search system.

I am never sure how to interpret extensive listings of prior art. Since I am not qualified to figure out if a citation is germane, I will leave it to you to wade through the full page of US patent, foreign patent documents, and other publications. Who wants to question the work of the primary examiner and the Faegre Baker Daniels “attorney, agent, or firm” tackling 730.

On to the claims. The patent lists 28 claims. Many of them refer to operations within the world of what the inventors call expanded Subject-Action-Object or eSAO. The idea is that the system figures out parts of speech, looks up stuff in various knowledge bases and automatically generated indexes, and presents the answer to the user’s question. The lingo of the patent is sufficiently broad to allow the system to accommodate an automated query in a way that reminded me of Ramanathan Guha’s massive semantic system. I cover some of Dr. Guha’s work in my now out of print monograph, Google Version 2.0, published by one of the specialist publishers that perform Schubmehl-like maneuvers.

My first pass through the 730’s claims was a sense of déjà vu, which is obviously not correct. The invention has been award the status of a “patent”; therefore, the invention is novel. Nevertheless, these concepts pecked away at me with the repetitiveness of the woodpecker outside my window this morning:

  1. Automatic semantic labeling which I interpreted as automatic indexing
  2. Natural language process, which I understand suggests the user takes the time to write a question that is neither too broad nor too narrow. Like the children’s story, the query is “just right.”
  3. Assembly of bits and chunks of indexed documents into an answer. For me the idea is that the system does not generate a list of hits that are probably germane to the query. The Holy Grail of search is delivering to the often lazy, busy, or clueless user an answer. Google does this for mobile users by looking at a particular user’s behavior and the clusters to which the user belongs in the eyes of Google math, and just displaying the location of the pizza joint or the fact that a parking garage at the airport has an empty space.
  4. The system figures out parts of speech, various relationships, and who-does-what-to-whom. Parts of speech tagging has been around for a while and it works as long as the text processed in not in the argot of a specialist group plotting some activity in a favela in Rio.
  5. The system performs the “e” function. I interpreted the “e” to mean a variant of synonym expansion. DR LINK, for example, was able in 1998 to process the phrase white house and display content relevant to presidential activities. I don’t recall how this expansion from bound phrase to presidential to Clinton. I do recall that DR LINK had what might be characterized as a healthy appetite for computing resources to perform its expansions during indexing and during query processing. This stuff is symmetrical. What happens to source content has to happen during query processing in some way.
  6. Relevance ranking takes place. Various methods are in use by search and content processing vendors. Some of based on statistical methods. Others are based on numerical recipes that the developer knows can be computed within the limits of the computer systems available today. No N=NP, please. This is search.
  7. There are linguistic patterns. When I read about linguistic patterns I recall the wild and crazy linguistic methods of Delphes, for example. Linguistics are in demand today and specialist vendors like Bitext in Madrid, Spain, are in demand. English, Chinese, and Russian are widely used languages. But darned useful information is available in other languages. Many of these are kept fresh via neologisms and slang. I often asked my intelligence community audiences, “What does teddy bear mean?” The answer is NOT a child’s toy. The clue is the price tag suggested on sites like eBay auctions.

The interesting angle in 730 is the causal relationship. When applied to processes in the knowledge bases, I can see how a group of patents can be searched for a process. The result list could display ways to accomplish a task. NOTting out patents for which a royalty is required leaves the searcher with systems and methods that can be used, ideally without any hassles from attorneys or licensing agents.

Several questions popped into my mind as I reviewed the claims. Let me highlight three of these:

First, computational load when large numbers of new documents and changed content has to be processed. The indexes have to be updated. For small domains of content like 50,000 technical reports created by an engineering company, I think the system will zip along like a 2014 Volkswagen Golf.

image

Source: US8666730, Figure 1

When terabytes of content arrived every minute, then the functions set forth in the block diagram for 730 have to be appropriately resourced. (For me, “appropriately resourced” means lots of bandwidth, storage, and computational horsepower.)

Second, the knowledge base, as I thought about when I first read the patent, has to be kept in tip top shape. For scientific, technical, and medical content, this is a more manageable task. However, when processing intercepts in slang filled Pashto, there is a bit more work required. In general, high volumes of non technical lingo become a bottleneck. The bottleneck can be resolved, but none of the solutions are likely to make a budget conscious senior manager enjoy his lunch. In fact, the problem of processing large flows of textual content is acute. Short cuts are put in place and few of those in the know understand the impact of trimming on the results of a query. Don’t ask. Don’t tell. Good advice when digging into certain types of content processing systems.

Third, the reference to databases begs this question, “What is the amount of storage required to reduce index latency to less than 10 seconds for new and changed content?” Another question, “What is the gap that exists for a user asking a mission critical question between new and changed content and the indexes against which the mission critical query is passed?” This is not system response time, which as I recall for DR LINK era systems was measured in minutes. The user sends a query to the system. The new or changed information is not yet in the index. The user makes a decision (big or small, significant or insignificant) based on incomplete, incorrect, or stale information. No big problem is one is researching a competitor’s new product. Big problem when trying to figure out what missile capability exists now in an region of conflict.

My interest is enterprise search. IHS, a professional publishing company that is in the business of licensing access to its for fee data, seems to be moving into the enterprise search market. (See http://bit.ly/1o4FyL3.) My researchers (an unreliable bunch of goslings) and I will be monitoring the success of IHS. Questions of interest to me include:

  1. What is the fully loaded first year cost of the IHS enterprise search solution? For on premises installations? For cloud based deployment? For content acquisition? For optimization? For training?
  2. How will the IHS system handle flows of real time content into its content processing system? What is the load time for 100 terabytes of text content with an average document size of 50 Kb? What happens to attachments, images, engineering drawings, and videos embedded in the stream as native files or as links to external servers?
  3. What is the response time for a user’s query? How does the user modify a query in a manner so that result sets are brought more in line with what the user thought he was requesting?
  4. How do answers make use of visual outputs which are becoming increasingly popular in search systems from Palantir, Recorded Future, and similar providers?
  5. How easy is it to scale content processing and index refreshing to keep pace with the doubling of content every six to eight weeks that is becoming increasingly commonplace for industrial strength enterprise search systems? How much reengineering is required for log scale jumps in content flows and user queries?

Take a look at 730 an d others in the Invention Machine (IHS) patent family. My hunch is that if IHS is looking for a big bucks return from enterprise search sales, IHS may find that its narrow margins will be subjected to increased stress. Enterprise search has never been nor is now a license to print money. When a search system does pump out hundreds of millions in revenue, it seems that some folks are skeptical. Autonomy and Fast Search & Transfer are companies with some useful lessons for those who want a digital Klondike.

Is New Math Really New Yet?

July 21, 2014

I read “Scientific Data Has Become So Complex, We Have to Invent New Math to Deal With It.” My hunch is that this article will become Google spider food with a protein punch.

In my lectures for the police and intelligence community, I review research findings from journals and my work that reveal a little appreciated factoid; to wit: The majority of today’s content processing systems use a fairly narrow suite of numerical recipes that have been embraced for decades by vendors, scientists, mathematicians, and entrepreneurs. Due to computational constraints and limitations of even the slickest of today’s modern computers, processing certain data sets is a very difficult and expensive in humans, programming, and machine time job.

Thus, the similarity among systems comes from several factors.

  1. The familiar is preferred to the onerous task of finding a slick new way to compute k-means or perform one of the other go-to functions in information processing
  2. Systems have to deliver certain types of functions in order to make it easy for a procurement team or venture oriented investor to ask, “Does your system cluster?” Answer: Yes. Venture oriented investor responds, “Check.” The procedure accounts for the sameness of the feature lists between Palantir, Recorded Future, and simile systems. When the similarities make companies nervous, litigation results. Example: Palantir versus i2 Ltd. (now a unit of IBM).
  3. Alternative methods of addressing tasks in content processing exist, but they are tough to implement in today’s computing systems. The technical reason for the reluctance to use some fancy math from my uncle Vladimir Ivanovich Arnold’s mentor Andrey Kolmogorov is that in many applications the computing system cannot complete the computation. The buzzword for this is P=NP? Here’s MIT’s 2009 explanation
  4. Savvy researchers have to find a way to get from A to B that works within the constraints of time, confidence level required, and funding.

The Wired article identifies other hurdles; for example, the need for constant updating. A system might be able to compute a solution using fancy math on a right sized data set. But toss in constantly updating information and the computing resources often just keep getting hungrier for more storage, bandwidth, and computational power. Then the bigger the data, the computing system has to shove that data around. As fast as an iPad or modern Dell notebook seems, the friction adds latency to a system. For some analyses, delays can have significant repercussions. Most Big Data systems are not the fleetest of foot.

The Wired article explains how fancy math folks cope with these challenges:

Vespignani uses a wide range of mathematical tools and techniques to make sense of his data, including text recognition. He sifts through millions of tweets looking for the most relevant words to whatever system he is trying to model. DeDeo adopted a similar approach for the Old Bailey archives project. His solution was to reduce his initial data set of 100,000 words by grouping them into 1,000 categories, using key words and their synonyms. “Now you’ve turned the trial into a point in a 1,000-dimensional space that tells you how much the trial is about friendship, or trust, or clothing,” he explained.

Wired labels this approach as “piecemeal.”

The fix? Wired reports:

the big data equivalent of a Newtonian revolution, on par with the 17th century invention of calculus, which he [Yalie mathematician Ronald Coifman] believes is already underway.

Topological analyses and sparsity,  may offer a path forward.

The kicker in the Wired story is the use of the phrase “tractable computational techniques.” The notion of “new math” is an appealing one.

For the near future, the focus will be on optimization of methods that can be computed on today’s gizmos. One widely used method in Autonomy, Recommind, and many other systems originates with Sir Thomas Bayes who died in 1761. My relative died 2010. I understand there were some promising methods developed after Kolmogorov died in 1987.

Inventing new math is underway. The question is, “When will computing systems become available to use these methods without severe sampling limitations?” In the meantime, Big Data keep on rolling in, possibly mis-analyzed and contributing to decisions with unacceptable levels of risk.

Stephen E Arnold, July 21, 2014

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta