Free Stopword List
January 29, 2014
A happy quack to the reader who alerted me to www.libertypages.com. The site provides a downloadable list of stopwords. You can find the link at http://bit.ly/1fnubsY. It appears that this original list was generated by Dr. Gerald Salton. A quick scan of the list suggests that some updating may be needed. The Liberty Pages Web site redirects to Lextek, developers of Onix. I have a profile of the Onix system. Once the Autonomy IDOL and TeraText profiles are on the Xenky site, I will hunt around for my Lextek analysis. The company is still in business, operating out of a home in Provo, Utah.
Stephen E Arnold, January 29, 2014
IBM Wrestling with Watson
January 8, 2014
“IBM Struggles to turn Watson into Big Business” warrants a USA Today treatment. You can find the story in the hard copy of the newspaper on page A 1 and A 2. I saw a link to the item online at http://on.wsj.com/1iShfOG but you may have to pay to read it or chase down a Penguin friendly instance of the article.
The main point is that IBM targeted $10 billion in Watson revenue by 2023. Watson has generated less than $100 million in revenue I presume since the system “won” the Jeopardy game show.
The Wall Street Journal article is interesting because it contains a number of semantic signals, for example:
- The use of the phrase “in a ditch” in reference to a a project at the University of Texas M.D. Anderson Cancer Center
- The statement “Watson is having more trouble solving real-life problems”
- The revelation that “Watson doesn’t work with standard hardware”
- An allegedly accurate quote from a client that says “Watson initially took too long to learn”
- The assertion that “IBM reworked Watson’s training regimen”
- The sprinkling of “could’s” and “if’s”
I came away from the story with a sense of déjà vu. I realized that over the last 25 years I have heard similar information about other “smart” search systems. The themes run through time the way a bituminous coal seam threads through the crust of the earth. When one of these seams catches fire, there are few inexpensive and quick ways to put out the fire. Applied to Watson, my hunch is that the cost of getting Watson to generate $10 billion in revenue is going to be a very big number.
The Wall Street Journal story references the need for humans to learn and then to train Watson about the topic. When Watson goes off track, more humans have to correct Watson. I want to point out that training a smart system on a specific corpus of content is tricky. Algorithms can be quite sensitive to small errors in initial settings. Over time, the algorithms do their thing and wander. This translates to humans who have to monitor the smart system to make sure it does not output information in which it has generated confidence scores that are wrong or undifferentiated. The Wall Street Journal nudges this state of affairs in this passage:
In a recent visit to his [a Sloan Kettering oncologist] pulled out an iPad and showed a screen from Watson that listed three potential treatments. Watson was less than 32% confident that any of them were [sic] correct.
Then the Wall Street Journal reported that tweaking Watson was tough, saying:
The project initially ran awry because IBM’s engineers and Anderson’s doctors didn’t understand each other.
No surprise, but the fix just adds to the costs of the system. The article revealed:
IBM developers now meet with doctors several times a week.
Why is this Watson write up intriguing to me? There are four reasons:
First, the Wall Street Journal makes clear that dreams about dollars from search and content processing are easy to inflate and tough to deliver. Most search vendors and their stakeholders discover the difference between marketing hyperbole and reality.
Second, the Watson system is essentially dependent on human involvement. The objective of certain types of smart software is to reduce the need for human involvement. Watching Star Trek and Spock is not the same as delivering advanced systems that work and are affordable.
Third, the revenue generated by Watson is actually pretty good. Endeca hit $100 million between 1998 and 2011 when it was acquired by Oracle. Autonomy achieved $800 million between 1996 and 2011 when it was purchased by Hewlett Packard. Watson has been available for a couple of years. The problem is that the goal is, it appears, out of reach even for a company with IBM’s need for a hot new product and the resources to sell almost anything to large organizations.
Fourth, Watson is walking down the same path that STAIRS III, an early IBM search system, followed. IBM embraced open source to help reduce the cost of delivering basic search. Now IBM is finding that the value-adds are more difficult than key word matching and Boolean centric information retrieval. When a company does not learn from its own prior experiences in content processing, the voyage of discovery becomes more risky.
Net net: IBM has its hands full. I am confident that an azure chip consultant and a couple of 20 somethings can fix up Watson in a nonce. But if remediation is not possible, IBM may vie with Hewlett Packard as the pre-eminent example of the perils of the search and content processing business.
Stephen E Arnold, January 8, 2014
HP and Its New IDOL Categorizer
January 1, 2014
I read “Analytics for Human Information: Optimize Information Categorization with HP IDOL.” I noticed that HP did not reference the original reference to the 1998 categorization technology in its write up. From my point of view, news about something developed 15 years ago and referenced in subsequent Autonomy collateral is not something fresh to me. In fact, presenting the categorizer as something “amazing” suggests a superficial grasp of the history of IDOL technology which dates from the late 1980s and early 1990s. It is fascinating how some “experts” in content processing reinvent the wheel and display their intellectual process in such an amusing way. Is it possible to fool oneself and others? Remarkable.
Update, January 1, 2014, 11 am Eastern:
Hewlett Packard is publicizing IDOL’s automatic categorization capability. As a point of fact, this function has been available for 15 years. Here’s a description from a 2001 Autonomy IDOL Server Technical Brief, 2001.
DOL server can automatically categorize data with no requirement for manual input whatsoever. The flexibility of Autonomy’s Categorization feature allows you to precisely derive categories using concepts found within unstructured text. This ensures that all data is classified in the correct context with the utmost accuracy. Autonomy’s Categorization feature is a completely scalable solution capable of handling
high volumes of information with extreme accuracy and total consistency. Rather than relying on rigid rule based category definitions such as Legacy Keyword and Boolean Operators, Autonomy’s infrastructure relies on an elegant pattern matching process based on concepts to categorize documents and automatically insert tag data sets, route content or alert users to highly relevant information pertinent to the users profile. This highly efficient process means that Autonomy is able to categorize upwards of four million documents in 24 hours per CPU instance, that’s approximately one document, every 25 milliseconds. Autonomy hooks into virtually all repositories and data formats respecting all security and access entitlements, delivering complete reliability. IDOL server accepts a category or piece of content and returns categories ranked by conceptual similarity. This determines for which categories the piece of content is most appropriate, so that the piece of content can subsequently be tagged, routed or filed accordingly.
Stephen E Arnold, January 1, 2014
A Non Search Person Explains Why Search Is a Lost Cause
December 16, 2013
The author of “2013: the Year ‘the Stream’ Crested” is focused on tapping into flows of data. Twitter and real time “Big Data” streams are the subtext for the essay. I liked the analysis. In one 2,500 word write up, the severe weaknesses of enterprise and Web search systems are exposed.
The main point of the article is that “the stream”—that is, flows of information and data—is what people want. The flow is of sufficient volume that making sense of it is difficult. Therefore, an opportunity exists for outfits like The Atlantic to provide curation, perspective, and editorial filtering. The write up’s code for this higher-value type of content process is “the stock.”
The article asserts:
This is the strange circumstance that obtained in 2013, given the volume of the stream. Regular Internet users only had three options: 1) be overwhelmed 2) hire a computer to deploy its logic to help sort things 3) get out of the water.
The take away for me is that the article makes clear that search and retrieval just don’t work. Some “new” is needed. Perhaps this frustration with search is the trigger behind the interest in “artificial intelligence” and “machine learning”? Predictive analytics may have a shot at solving the problem of finding and identifying needed information, but from what I have seen, there is a lot of talk about fancy math and little evidence that it works at low cost in a manner that makes sense to the average person. Data scientists are not a dime a dozen. Average folks are.
Will the search and content processing vendors step forward and provide concrete facts that show a particular system can solve a Big Data problem for Everyman and Everywoman? We know Google is shifting to an approach to search that yields revenue. Money, not precision and recall, is increasingly important. The search and content vendors who toss around the word “all” have not been able to deliver unless the content corpus is tightly defined and constrained.
Isn’t it obvious that processing infinite flows and changes to “old” content are likely to cost a lot of money. Google, Bing, and Yandex search are not particularly “good.” Each is becoming a system designed to support other functions. In fact, looking for information that is only five or six years “old” is an exercise in frustration. Where has that document “gone.” What other data are not in the index. The vendors are not talking.
In the enterprise, the problem is almost as hopeless. Vendors invent new words to describe a function that seems to convey high value. Do you remember this catchphrase: “One step to ROI”? How do you think that company performed? The founders were able to sell the company and some of the technology lives on today, but the limitations of the system remain painfully evident.
Search and retrieval is complex, expensive to implement in an effective manner, and stuck in a rut. Giving away a search system seems to reduce costs? But are license fees the major expense? Embracing fancy math seems to deliver high value answers? But are the outputs accurate? Users just assume these systems work.
Kudos to Atlantic for helping to make clear that in today’s data world, something new is needed. Changing the words used to describe such out of favor functions as “editorial policy”, controlled terms, scheduled updates, and the like is more popular than innovation.
Stephen E Arnold, December 16, 2013
Business Intelligence: Free Pressures For Fee Solutions
December 14, 2013
I read “KB Crawl sort la tête de l’eau,” published by 01Business. The hook for the article is that KB Crawl, a company harvesting Internet content for business intelligence analyses, has emerged from bankruptcy. Good news for KB Crawl, whose parent company is reported to be KB Intelligence.
The write up contained related interesting information.
First, the article points out that business intelligence services like KB Crawl are perceived as costs, not revenue producers. If this is accurate, the same problem may be holding back once promising US vendors like Digital Reasoning and Ikanow, among others.
Second, the article seems to suggest that for fee business intelligence services are in direct competition with free services like Google. Although Google’s focus on ads continues to have an impact on the relevance of the Google results, users may be comfortable with information provided by free services. Will the same preference for free impact the US business intelligence sector?
Third, the article identifies a vendor (Ixxo) as facing some financial headwinds, writing:
D’autres éditeurs du secteur connaissent des difficultés, comme Ixxo, éditeur de la solution Squido.
But the most useful information in the story is the list of companies that compete with KB Crawl. Some of the firms are:
- AMI Software. www.amisw.com. This company has roots in enterprise search and touts 1500 customers
- Data Observer. www.data-observer.com. The company is a tie up between Asapspot and Data-Deliver. The firm offers “an all-encompassing Internet monitoring and e-reputation services company.”
- Digimind. www.digimind.com. The firm makes sense of social media.
- Eplica. A possible reference to a San Diego employment services firm.
- iScop. Unknown.
- Ixxo. www.ixxo.fr. The firm “develops innovative software applications to boost business responsiveness when faced with unstructured data.”
- Pikko. www.pikko-software.com. A visualization company.
- Qwam. www.qwamci.com. Another “content intelligence” company.
- SindUp. www.sindup.fr. The company offers a monitoring platform for strategic and e reputation information.
- Spotter. www.spotter.com. A company that provides the “power to understand.”
- Synthesio. www.synthesio.com. The company says, “We help brands and agencies find valuable social insights to drive real business value.”
- TrendyBuzz. www.trendybuzz.com. The company lets a client measure “Internet visibility units.”
My view is that 01Busienss may be identifying a fundamental problem in the for fee business intelligence, open source harvesting, and competitive intelligence sector.
Information about business and competitive intelligence that I see in my TRAX Overflight service is mostly of the “power of positive thinking” variety. Companies like Palantir capture attention because the firms are able to raise astounding amounts of funding. Less visible are the financial pressures on the companies trying to generate revenue with systems aimed at commercial enterprises.
If the 01Business article is on the money, what US vendors are like to have their heads under water in 2014? Use the comments section of this blog to identify the stragglers in the North American market.
Stephen E Arnold, December 14, 2013
Semantria and Diffbot: Clever Way to Forge a Tie Up
December 12, 2013
Short honk. I came across an interesting marketing concept in “Diffbot and Semantria Join to Find and Parse the Important Text on the ‘Net (Exclusive).”
Semantria (a company that offers sentiment analysis as a service) participated in a hackathon in San Francisco. The explains:
To make the Semantria service work quickly, even for text-mining novices, Rogynskyy’s team decided to build a plugin for Microsoft’s popular Excel spreadsheet program. The data in a spreadsheet goes to the cloud for processing, and Semantria sends back analysis in Excel format.
Semantria sponsored a prize for the best app. Diffbot won:
A Diffbot developer built a simple plugin for Google’s Chrome browser that changes the background color of messages on Facebook and Twitter based on sentiment — red for negative, green for positive. The concept won a prize from Semantria, Rogynskyy said. A Diffbot executive was on hand at the hackathon, and Rogynskyy started talking with him about how the two companies could work together.
I like the “sponsor”, “winner” and “team up” approach. The pay off, according to the article, is “While Semantria and Diffbot technologies continue to be available separately, they can now be used together.”
Sentiment analysis is one of the search submarkets that caught fire and then, based on the churning at some firms like Attensity, may be losing some momentum. Marketing innovation may be a goal other firms offering this functionality in 2014.
Stephen E Arnold, December 12, 2013
Quote to Note: NLP and Recipes for Success and Failure
December 11, 2013
I read “Natural language Processing in the Kitchen.” The post was particularly relevant because I had worked through “The Main Trick in Machine Learning.” The essay does an excellent job of explaining coefficients (what I call for ease of recall, “thresholds.”) The idea is that machine learning requires a human to make certain judgments. Autonomy IDOL uses Bayesian methods and the company has for many years urged licensees to “train” the IDOL system. Not only that, successful Bayesian systems, like a young child, have to be prodded or retrained. How much and how often depends on the child. For Bayesian-like systems, the “how often” and “how much” varies by the licensees’ content contexts.
Now back to the Los Angeles Times’ excellent article about indexing and classifying a small set of recipes. Here’s the quote to note:
Computers can really only do so much.
When one jots down the programming and tuning work required to index recipes, keep in mind the “The Main Trick in Machine Learning.” There are three important lessons I draw from the boundary between these two write ups:
- Smart software requires programming and fiddling. At the present time (December 2013), this reality is as it has been for the last 50 years, maybe more.
- The humans fiddling with or setting up the content processing system have to be pretty darned clever. The notion of “user friendliness” is strongly disabused by these two articles. Flashy graphics and marketers’ cooing are not going to cut the mustard or the sirloin steak.
- The properly set up system with filtered information processed without some human intervention hits 98 percent accuracy. The main point is that relevance is a result of humans, software, and consistent, on point content.
How many enterprise search and content processing vendors explain that a failure to put appropriate resources toward the search or content processing implementation guarantees some interesting issues. Among them, systems will routinely deliver results that are not germane to the user’s query.
The roots of dissatisfaction with incumbent search and retrieval systems is not the systems themselves. In my opinion, most are quite similar, differing only in relatively minor details. (For examples of the similarity, review the reports at Xenky’s Vendor Profiles page.)
How many vendors have been excoriated because their customers failed to provide the cash, time, and support necessary to deliver a high-performance system? My hunch is that the vendors are held responsible for failures that are predestined by licensees’ desire to get the best deal possible and believe that magic just happens without the difficult, human-centric work that is absolutely essential for success.
Stephen E Arnold, December 11, 2013
Palantir: What Is the Main Business of the Company?
December 11, 2013
I read about Palantir and its successful funding campaign in “Palantir’s Latest Round Valuing It at $9B Swells to $107.8M in New Funding.” Compared to the funding for ordinary search and content processing companies, Palantir is obviously able to attract investors better than most of the other companies that make sense out of data.
If you run a query for “Palantir” on Beyond Search, you will get links to articles about the company’s previous funding and to a couple of stories about the companies interaction with IBM i2 related to an allegation about Palantir’s business methods.
Image from the Louisiana Lottery.
I find Palantir interesting for three reasons.
First, it is able to generate significant buzz in police and intelligence entities in a number of countries. Based on what I have heard at conferences, the Palantir visualizations knock the socks off highly placed officials who want killer graphics in their personal slide presentations.
Second, the company has been nosing into certain financial markets. The idea is that the Palantir methods will give some of the investment outfits a better way to figure out what’s going up and what’s going down. The visuals are good, I have heard, but the Palantir analytics are perceived, if my sources are accurate, as better than those from companies like IBM SPSS, Digital Reasoning, Recorded Future, and similar analytics firms.
Third, the company may have moved into a new business sector. The firm’s success in fund raising begs the question, “Is Palantir becoming a vehicle to raise more and more cash?”
Palantir is worth monitoring. The visualizations and the math are not really a secret sauce. The magic ingredient at Palantir may be its ability to sell its upside to investors. Is Palantir introducing a new approach to search and content processing? The main business of the company could be raising more and more money.
Stephen E Arnold, December 11, 2013
Exclusive Silobreaker Interview: Mats Bjore, Silobreaker
November 25, 2013
With Google becoming more difficult to use, many professionals need a way to locate, filter, and obtain high value information that works. Silobreaker is an online service and system that delivers actionable information.
The co-founder of Silobreaker said in an exclusive interview for Search Wizards Speaks says:
I learned that in most of the organizations, information was locked in separate silos. The information in those silos was usually kept under close control by the silo manager. My insight was that if software could make available to employees the information in different silos, the organization would reap an enormous gain in productivity. So the idea was to “break” down the the information and knowledge silos that exists within companies, organizations and mindsets.
And knock down barriers the system has. Silobreaker’s popularity is surging. The most enthusiastic supporters of the system come from the intelligence community, law enforcement, analysts, and business intelligence professionals. A user’s query retrieves up-to-the-minute information from Web sources, commercial services, and open source content. The results are available as a series of summaries, full text documents, relationship maps among entities, and other report formats. The user does not have to figure out which item is an advertisement. The Silobreaker system delivers muscle, not fatty tissue.
Mr. Bjore, a former intelligence officer, adds:
Silobreaker is an Internet and a technology company that offers products and services which aggregate, analyze, contextualize and bring meaning to the ever-increasing amount of digital information.
Underscoring the difference between Silobreaker and other online systems, Mr. Bjore points out:
What sets us apart is not only the Silobreaker technology and our commitment to constant innovation. Silobreaker embodies the long term and active experience of having a team of users and developers who can understand the end user environment and challenges. Also, I want to emphasize that our technology is one integrated technology that combines access, content, and actionable outputs.
The ArnoldIT team uses Silobreaker in our intelligence-related work. We include a profile of the system in our lectures about next-generation information gathering and processing systems.
You can get more information about Silobreaker at www.silobreaker.com. A 2008 interview with Mr. Bjore is located at on the Search Wizards Speak site at http://goo.gl/f7niAH.
Stephen E Arnold, November 25, 2013
Search Boundaries. Explode.
November 14, 2013
I read a quite remarkable news release. The title? Grab your blood pressure medicine because you may “explode.”
Expertmaker: Artificial Intelligence (AI) Explodes the Boundaries of Enterprise Search
I expect a sign to warn me off. Was it safe to read about such a potentially powerful technology?
Expertmaker Info
Straightaway I poked through my information about search vendors. I did not recall the name “Expertmaker.” I think it is catchy, echoing the Italian outfit Expert System.
Expertmaker is located at www.expertmaker.com. The company offers the following products:
-
Consulting
-
Products that are “an online solution and/or mobile solution.”
-
Big Data Anti Churn. I am not exactly sure what this means, and I did not want to contact Expertmaker to learn more.
-
Flow, a virtual assistant platform.
The technology is positioned as “artificial intelligence.” The description of the company’s technology is located at this link. I scanned the information on the Expertmaker Web site. I noted some points that struck me as interesting, particularly in relation to the news release that triggered my interest. (Who says news releases are irrelevant? Expertmaker has my attention. I suppose that is a good thing, but there are other possible viewpoints too. My attention can be annoying, but, hey, this is a free blog about going “beyond search.”)
First, the label “artificial intelligence” is visible in the description. The AI angle is “machine learning and evolutionary computing.” The point is that the system performs functions that would be difficult using an old fashioned database like DB2, Oracle, or SQL Server. (I assume that the owners of these traditional databases will have some counter arguments to offer.)
Second, the system makes it possible to build search-based applications. (Dassault Exalead has been beating this tom tom for six or seven years. I presume that the Cloud 360 technology is relegated to the user car lot because Expertmaker has rolled into the search dealership.)
Third, a development environment is available, including a “Desktop Artificial Intelligence Toolkit.” There are “solvers.” There are various AI technologies. There is knowledge discovery. There is a “published solution.” And there is this component:
Semantic, value based, meta-data structures allow high precision understanding and value based searches. With the solution you can create your own semantic structures for handling complex solutions.
Okay, this is pretty standard fare for search start ups. I am not sure what the system does, but I looked at examples, including screenshots.