April 25, 2015
Need patent information? Lots of folks believed that making sense of the public documents available from the USPTO were the road to riches. Before I kicked back to enjoy the sylvan life in rural Kentucky, I did some work on Fancy Dan patent systems. There was a brush with the IBM Intelligent Patent Miner system. For those who do not recall their search history, you can find a chunk of information in “Information Mining with the IBM Intelligent Miner Family.” Keep in mind that the write up is about 20 years old. (Please, notice that the LexisNexis system discussed below uses many of the same, time worn techniques.)
Patented dog coat.
Then there was the Manning & Napier “smart” patent analysis system with analyses’ output displayed in three-D visualizations. I bumped into Derwent (now Intellectual Property & Science) and other Thomson Corp. solutions as well. And, of course, there was may work for an unnamed, mostly clueless multi billion dollar outfit related to Google’s patent documents. I summarized the results of this analysis in my Google Version 2.0 monograph, portions of which were published by BearStearns before it met its thrilling end seven years ago. (Was my boss the fellow carrying a box out of the Midtown BearStearns’ building?)
Why the history?
Well, patents are expensive to litigate. For some companies, intellectual property is a revenue stream.
There is a knot in the headphone cable. Law firms are not the go go business they were 15 or 20 years ago. Law school grads are running gyms; some are Uber drivers. Like many modern post Reagan businesses, concentration is the name of the game. For the big firms with the big buck clients, money is no object.
The problem in the legal information business is that smaller shops, including the one and two person outfits operating in Dixie Highway type of real estate do not want to pay for the $200 and up per search commercial online services charge. Even when I was working for some high rollers, the notion of a five or six figure online charge elicited what I would diplomatically describe as gentle push back.
I read “LexisNexis TotalPatent Keeps Patent Research out of the Black Box with Improved Version of Semantic Search.” For those out of touch with online history, I worked for a company in the 1980s which provided commercial databases to LexisNexis. I knew one of the founders (Don Wilson). I even had reasonably functional working relationships with Dan Prickett and people named “Jim” and “Sharon.” In one bizarre incident, a big wheel from LexisNexis wanted to meet with me in the Cherry Hill Mall’s parking lot across from the old Bell Labs’ facility where I was a consultant at the time. Err, no thanks. I was okay with the wonky environs of Bell Labs. I was not okay with the lash up of a Dutch and British company.
Snippet of code from a Ramanathan Guha invention. Guha used to be at IBM Almaden and he is a bright fellow. See US7593939 B2.
What does LexisNexis TotalPatent deliver for a fee? According to the write up:
TotalPatent, a web-based patent research, retrieval and analysis solution powered by the world’s biggest assortment of searchable full-text and bibliographic patent authorities, allows researchers to enter as much as 32,000 characters (comparable to more than 10 pages of text)—much over along a whole patent abstract—into its search industry. The newly enhanced semantic brains, pioneered by LexisNexis during 2009 and continually improved upon utilizing contextual information supplied by the useful patent data offered to the machine, current results in the form of a user-adjustable term cloud, where the weighting and positioning of terms may be managed for lots more precise results. And countless full-text patent documents, TotalPatent in addition utilizes systematic, technical also non-patent literature to go back the deepest, most comprehensive serp’s.
April 15, 2015
I have a view of Yahoo. Sure, it was formed when I was part of the team that developed The Point (Top 5% of the Internet). Yahoo had a directory. We had a content processing system. We spoke with Yahoo’s David Filo. Yahoo had a vision, he said. We said, No problem.
The Point became part of Lycos, embracing Fuzzy and his round ball chair. Yahoo, well, Yahoo just got bigger and generally went the way of general purpose portals. CEOs came and went. Stakeholders howled and then sulked.
I read or rather looked at “Yahoo. Semantic Search From Document Retrieval to Virtual Assistants.” You can find the PowerPoint “essay” or “revisionist report” on SlideShare. The deck was assembled by the director of research at Yahoo Labs. I don’t think this outfit is into balloons, self driving automobiles, and dealing with complainers at the European Commission. Here’s the link. Keep in mind you may have to sign up with the LinkedIn service in order to do anything nifty with the content.
The premise of the slide deck is that Yahoo is into semantic search. After some stumbles, semantic search started to become a big deal with Google and rich snippets, Bing and its tiles, and Facebook with its Like button and the magical Open Graph Protocol. The OGP has some fascinating uses. My book CyberOSINT can illuminate some of these uses.
And where is Yahoo in the 2008 to 2010 interval when semantic search was abloom? Patience, grasshopper.
Yahoo was chugging along with its Knowledge Graph. If this does not ring a bell, here’s the illustration used in the deck:
The date is 2013, so Yahoo has been busy since Facebook, Google, and Microsoft were semanticizing their worlds. Yahoo has a process in place. Again from the slide deck:
I was reminded of the diagrams created by other search vendors. These particular diagrams echo the descriptions of the now defunct Siderean Software server’s set up. But most content processing systems are more alike than different.
April 8, 2015
When a bean counter tallies up the cost of an enterprise search system, the reaction, in my experience, is, “How did we get to this number?” The question is most frequently raised in larger organizations, and it is one to which enterprise search staff and their consultants often have no acceptable answer.
Search-splainers position the cost overruns, diminish the importance of the employees’ dissatisfaction with the enterprise search system, and unload glittering generalities to get a consulting deal. Meanwhile, enterprise search remains a challenged software application.
Consulting engineers, upgrades, weekend crash recoveries, optimizing, and infrastructure hassles balloon the cost of an enterprise search system. At some point, a person charged with figuring out why employees are complaining, implementing workarounds, and not using the system have to be investigated. When answers are not satisfying, financial meltdowns put search vendors out of business. Examples range from Convera and the Intel and NBA matters to the unnoticed death of Delphes, Entopia, Siderean, et al.
Search to most professionals, regardless of occupation, means Google. Bang in a word or two and Google delivers the bacon or the soy bean paste substitute. Most folks do not know the difference, nor, in my view, do they care. Google is how one finds information.
The question, “Why can’t enterprise search be like Google?”
Another question, “How can a person with a dog in the search find search-plain; that is, “prove” how important search is to kith and kin, truth and honor, sales and profit.
For most professionals, search Google style is “free.” The perception is fueled with the logs of ignorance. Google is providing objective information. Google is good. Google is the yardstick by which enterprise search is measured. Enterprise search comes up short. Implement a Google Search Appliance, and the employees don’t like that solution either.
Inside an organization, finding information is an essential part of a job. One cannot work on a report unless that person can locate information about the topic. Most of the data are housed in emails, PowerPoints, multiple drafts of Word documents stuffed with change tracking emendations, and maybe some paper notes. In some cases, a professional will have to speak face to face or via the phone to a colleague. The information then requires massaging, analysis, and reformation.
Ah, the corporate life is little more than one more undergraduate writing assignment with some Excel tossed in.
March 31, 2015
I read an article from the outfit that relies on folks like Dave Schubmehl for expertise. The write up is “HP Links Vertica and IDOL Seeking Better Unstructured Data Analysis.” But I quite like the subtitle because it provides a timeline; to wit:
The company built a connector server for the products, which it acquired separately in 2011.
Let’s see that is just about three years plus a few months. The story reminded me of Rip Van Winkle who woke to a different world when he emerged from his slumber. The Sleepy Hollow could be a large technology company in the act of performing mitosis in order to generate [a] excitement, [b] money, and [c] the appearance of progress. I wonder if the digital Sleepy Hollow is located near Hanover Street? I will have to investigate that parallel.
What’s a few years of intellectual effort in a research “cave” when you are integrating software that is expected to generate billions of dollars in sales. Existing Vertica and Autonomy licensees are probably dancing in the streets.
The write up states:
Promising more thorough and timelier data analysis, Hewlett-Packard has released a software package that combines the company’s Vertica database with its IDOL data analysis platform. The HP Haven Connector Framework Server may allow organizations to study data sets that were too large or unwieldy to analyze before. The package provides “a mixture of statistical and contextual understanding,” of data, said Jeff Veis, HP vice president of marketing for big data. “You can pull in any form of data, and then do real-time high performance analysis.”
Hmm. “Promising” and “may allow” are interesting words and phrases. It seems as if the employer of Mr. Schubmehl is hedging on the HP assertions. I wonder, “Why?”
March 19, 2015
I review a couple of times a week a free digital “newspaper” called Paper.li. I learned about this Paper.li “newspaper” When Vivisimo sent me its version of “search news.” The enterprise search newspaper I receive is assembled under the firm hand of Edwin Stauthamer. The stories are automatically assembled into “The Enterprise Search Daily.”
The publication includes a wide range of information. The referrer’s name appears with each article. The title page for the March 18, 2015, issue is looks like this.
In the last week or so, I have noticed a stridency in the articles about search and the disciplines the umbrella term protects from would-be encroachers. Search is customer support, but from the enterprise search vendors’ viewpoint, enterprise search is the secret sauce for a great customer support soufflé. Enterprise search also does Big Data, business intelligence, and dozens of other activities.
The reason for the primacy of search, as I understand the assertions of the search companies and the self appointed search “experts” is that information retrieval makes the business work. Improve search. It follows, according to the logic, that revenues will increase, profits will rise, and employee and customer satisfaction will skyrocket.
Unfortunately enterprise search is difficult to position at the alpha and omega of enterprise software. Consider this article from the March 18 edition of The Enterprise Search Daily.
The article begins:
Enterprise search has notoriously been a problem in the content management equation. Various content and document management systems have made it possible to store files. But the ability to categorize that information intuitively and in a user-friendly way, and make that information easy to retrieve later, has been one of several missing pieces in the ECM market. When will enterprise search be as easy to use and insightful as Google’s external search engine? If enterprise search worked anywhere near as effectively as Google, it might be the versatile new item in our content management wardrobes, piecing content together with a clean sophistication that would appeal to users by making everything findable, accessible and easy to organize.
I am not sure how beginning with the general perception that enterprise search has been, is, and may well be a failure flips to a “must have” product. My view is that keyword search is a utility. For organizations with cash to invest, automated indexing and tagging systems can add some additional findability hooks. The caveat is that the licensee of these systems must be prepared to spend money on a professional who can ride herd on the automated system. The indexing strays have to be rounded up and meshed with the herd. But the title’s assertion is a dream, a wish. I don’t think enterprise content management is particularly buttoned up in most organizations. Even primitive search systems struggle to figure out what version is the one the user needs to find. Indexing by machine or human often leads to manual inspection of documents in order to locate the one the user requires. Google wanders into the scene because most employees give Google.com a whirl before undertaking a manual inspection job. If the needed document is on the Web somewhere, Google may surface it if the user is lucky enough to enter the secret combination of keywords. Google is deeply flawed, but for many employees, it is better than whatever their employer provides.
February 28, 2015
For years, I have posted a public indexing Overflight. You can examine the selected outputs at this Overflight link. (My non public system is more robust, but the public service is a useful temperature gauge for a slice of the content processing sector.)
When it comes to indexing, most vendors provide keyword, concept tagging, and entity extraction. But are these tags spot on? No, most are good enough.
A happy quack to Jackson Taylor for this “good enough” cartoon. The salesman makes it clear that good enough is indeed good enough in today’s marketing enabled world.
I chose about 50 companies that asserted their systems performed some type of indexing or taxonomy function. I learned that the taxonomy business is “about to explode.” I find that to be either an interesting investment tip or a statement that is characteristic of content processing optimists.
Like search and retrieval, plugging in “concepts” or other index terms is a utility function. For example, if one indexes each word in an article appearing in this blog, the article might be about another subject. For example, in this post, I am talking about Overflight, but the real topic is the broader use of metadata in information retrieval systems. I could assign the term “faceted navigation” to this article as a way to mark this article as germane to point and click navigation systems.
If you examine the “reports” Overflight outputs for each of the companies, you will discover several interesting things as I did on February 28, 2015 when I assembled this short article.
- Mergers or buying failed vendors at fire sale prices are taking places. Examples include Lucidea’s purchase of Cuadra and InMagic. Both of these firms are anchored in traditional indexing methods and seemed to be within a revenue envelope until their sell out. Business Objects acquired Inxight and then SAP acquired Business Objects. Bouvet acquired Ontopia. Teradata acquired Revelytix
- Moving indexing into open source. Thomson Reuters acquired ClearForest and made most of the technology available as OpenCalais. OpenText, a rollup outfit, acquired Nstein. SAS acquired Teragram. Smartlogic acquired Schemalogic. (A free report about Schemalogic is available at www.xenky.com/vendor-profiles.)
- A number of companies just failed, shut down, or went quiet. These include Active Classification, Arikus, Arity, Forth ICA, MaxThink, Millennium Engineering, Navigo, Progris, Protege, punkt.net, Questans, Quiver, Reuse Company, Sandpiper,
- The indexing sector includes a number of companies my non public system monitors; for example, the little known Data Harmony with six figure revenues after decades of selling really hard to traditional publishers. Conclusion: Indexing is a tough business to keep afloat.
There are numerous vendors who assert their systems perform indexing, entity, and metadata extraction. More than 18 of these companies are profiled in CyberOSINT, my new monograph. Oracle owns Triple Hop, RightNow, and Endeca. Each of these acquired companies performs indexing and metadata operations. Even the mashed potatoes search solution from Microsoft includes indexing tools. The proprietary XML data management vendor MarkLogic asserts that it performs indexing operations on content stored in its repository. Conclusion: More cyber oriented firms are likely to capture the juicy deals.
So what’s going on in the world of taxonomies? Several observations strike me as warranted:
First, none of the taxonomy vendors are huge outfits. I suppose one could argue that IBM’s Lucene based system is a billion dollar baby, but that’s marketing peyote, not reality. Perhaps MarkLogic which is struggling toward $100 million in revenue is the largest of this group. But the majority of the companies in the indexing business are small. Think in terms of a few hundred thousand in annual revenue to $10 million with generous accounting assumptions.
What’s clear to me is that indexing, like search, is a utility function. If a good enough search system delivers good enough indexing, then why spend for humans to slog through the content and make human judgments. Why not let Google funded Recorded Future identify entities, assign geo codes, and extract meaningful signals? Why not rely on Haystax or RedOwl or any one of more agile firms to deliver higher value operations.
I would assert that taxonomies and indexing are important to those who desire the accuracy of a human indexed system. This assumes that the humans are subject matter specialists, the humans are not fatigued, and the humans can keep pace with the flow of changed and new content.
The reality is that companies focused on delivering old school solutions to today’s problems are likely to lose contracts to companies that deliver what the customer perceives as a higher value content processing solution.
What can a taxonomy company do to ignite its engines of growth? Based on the research we performed for CyberOSINT, the future belongs to those who embrace automated collection, analysis, and output methods. Users may, if the user so chooses, provide guidance to the system. But the days of yore, when monks with varying degrees of accuracy created catalog sheets for the scriptoria have been washed to the margin of the data stream by today’s content flows.
What’s this mean for the folks who continue to pump money into taxonomy centric companies? Unless the cyber OSINT drum beat is heeded, the failure rate of the Overflight sample is a wake up call.
Buying Apple bonds might be a more prudent financial choice. On the other hand, there is an opportunity for taxonomy executives to become “experts” in content processing.
Stephen E Arnold, February 28, 2015
February 11, 2015
Download an open source enterprise search system or license a proprietary system. Once the system has been installed, the content crawled, the index built, the interfaces set up, and the system optimized the job is complete, right?
Not quite. Retrofitting a keyword search system to meet today’s security requirements is a complex, time consuming, and expensive task. That’s why “experts” who write about search facets, search as a Big Data system, and search as a business intelligence solution ignore security or reassure their customers that it is no big deal. Security is a big deal, and it is becoming a bigger deal with each passing day.
There are a number of security issues to address. The easiest of these is figuring out how to piggyback on access controls provided by a system like Microsoft SharePoint. Other organizations use different enterprise software. As I said, using access controls already in place and diligently monitored by a skilled security administrator is the easy part.
A number of sticky wickets remain; for example:
- Some units of the organization may do work for law enforcement or intelligence entities. There may be different requirements. Some are explicit and promulgated by government agencies. Others may be implicit, acknowledged as standard operating procedure by those with the appropriate clearance and the need to know.
- Specific administrative content must be sequestered. Examples range from information assembled for employee health or compliance requirements for pharma products or controlled substances.
- Legal units may require that content be contained in a managed system and administrative controls put in place to ensure that no changes are introduced into a content set, access is provided to those with specific credential, or kept “off the radar” as the in house legal team tries to figure out how to respond to a discovery activity.
- Some research units may be “black”; that is, no one in the company, including most information technology and security professionals are supposed to know where an activity is taking place, what the information of interest to the research team is, and specialized security steps be enforced. These can include dongles, air gaps, and unknown locations and staff.
An enterprise search system without NGIA security functions is like a 1960s Chevrolet project car. Buy it ready to rebuild for $4,500 and invest $100,000 or more to make it conform to 2015’s standards. Source: http://car.mitula.us/impala-project
How do enterprise search systems deal with these access issues? Are not most modern systems positioned to index “all” content? Is the procedures for each of these four examples part of the enterprise search systems’ administrative tool kit?
Based on the research I conducted for CyberOSINT: Next Generation Information Access and my other studies of enterprise search, the answer is, “No.”
February 5, 2015
One of the content challenges traditional enterprise search trips over is geographic functions. When an employee looks for content, the implicit assumption is that keywords will locate a list of documents in which the information may be located. The user then scans the results list—whether in Google style laundry lists or in the graphic display popularized by Grokker and Kartoo which have gone dark. (Quick aside: Both of these outfits reflect the influence of French information retrieval wizards. I think of these as emulators of Datops “balls” displays.)
A results list displayed by the Grokker system. The idea is that the user explores the circular areas. These contain links to content germane to the user’s keyword query.
The Kartoo interface displays sources connected to related sources. Once again the user clicks and goes through the scan, open, read, extract, and analyze process.
In a broad view, both of these visualizations are maps of information. Do today’s users want these type of hard to understand maps?
In CyberOSINT I explore the role of “maps” or more properly geographic intelligence (geoint), geo-tagging, and geographic outputs) from automatically collected and analyzed data.
The idea is that a next generation information access system recognizes geographic data and displays those data in maps. Think in terms of overlays on the eye popping maps available from commercial imagery vendors.
What do these outputs look like? Let me draw one example from the discussion in CyberOSINT about this important approach to enterprise related information. Keep in mind that an NGIA can process any information made available to the systems; for example, enterprise accounting systems or databased content along with text documents.
In response to either a task, a routine update when new information becomes available, or a request generated by a user with a mobile device, the output looks like this on a laptop:
Source: ClearTerra, 2014
The approach that ClearTerra offers allows a person looking for information about customers, prospects, or other types of data which carries geo-codes appears on a dynamic map. The map can be displayed on the user’s device; for example a mobile phone. In some implementations, the map is a dynamic PDF file which displays locations of items of interest as the item of interest moves. Think of a person driving a delivery truck or an RFID tagged package.
February 4, 2015
I have been following the “blast from the past” articles that appear on certain content management oriented blogs and news services. I find the articles about federated search, governance, and knowledge related topics oddly out of step with the more forward looking developments in information access.
I am puzzled because the keyword search sector has been stuck in a rut for many years. The innovations touted in the consulting-jargon of some failed webmasters, terminated in house specialists, and frustrated academics are old, hoary with age, and deeply problematic.
There are some facts that cheerleaders for the solutions of the 1970s, 1980s, and 1990s choose to overlook:
- Enterprise search typically means a subset of content required by an employee to perform work in today’s fluid and mobile work environment. The mix of employees and part timers translates to serious access control work. Enterprise search vendors “support” an organization’s security systems in the manner of a consulting physician to heart surgery. Inputs but no responsibility are the characteristics.
- The costs of configuring, testing, and optimizing an old school system are usually higher than the vendor suggests. When the actual costs collide with the budget costs, the customer gets frisky. Fast Search & Transfer’s infamous revenue challenges came about in part because customers refused to pay when the system was not running and working as the marketers suggested it would.
- Employees cannot locate needed information and don’t like the interfaces. The information is often “in” the system but not in the indexes. And if in the indexes, the users cannot figure out which combination of keywords unlocks what’s needed. The response is, “Who has time for this?” When a satisfaction measure is required somewhere between 55 and 75 percent of the search system’s users don’t like it very much.
Obviously organizations are looking for alternatives. These range from using open source solutions which are good enough. Other organizations put up with Windows’ search tools, which are also good enough. More important software systems like an enterprise resource planning or accounting system come with basis search functions. Again: These are good enough.
The focus of information access has shifted from indexing a limited corpus of content using a traditional solution to a more comprehensive, automated approach. No software is without its weaknesses. But compared to keyword search, there are vendors pointing customers toward a different approach.
Who are these vendors? In this short write up, I want to highlight the type of information about next generation information access vendors in my new monograph, CyberOSINT: Next Generation Information Access.
I want to highlight one vendor profiled in the monograph and mention three other vendors in the NGIA space which are not included in the first edition of the report but for whom I have reports available for a fee.
I want to direct your attention to Knowlesys, an NGIA vendor operating in Hong Kong and the Nanshan District, Shenzhen. On the surface, the company processes Web content. The firm also provides a free download of a scraping software, which is beginning to show its age.
Dig a bit deeper, and Knowlesys provides a range of custom services. These include deploying, maintaining, and operating next generation information access systems for clients. The company’s system can process and make available automatically content from internal, external, and third party providers. Access is available via standard desktop computers and mobile devices:
Source: Knowlesys, 2014.
The system handles both structured and unstructured content in English and a number of other languages.
The company does not reveal its clients and the firm routinely ignores communications sent via the online “contact us” mail form and faxed letters.
How sophisticated in the Knowlesys system? Compared to the other 20 systems analyzed for the CyberOSINT monograph, my assessment is that the company’s technology is on a part with that of other vendors offering NGIA systems. The plus of the Knowlesys system, if one can obtain a license, is that it will handle Chinese and other ideographic languages as well as the Romance languages. The downside is that for some applications, the company’s location in China may be a consideration.
February 2, 2015
I find the complaints about Google’s inability to handle time amusing. On the surface, Google seems to demote, ignore, or just not understand the concept of time. For the vast majority of Google service users, Google is no substitute for the users’ investment of time and effort into dating items. But for the wide, wide Google audience, ads, not time, are more important.
Does Google really get an F in time? The answer is, “Nope.”
In CyberOSINT: Next Generation Information Access I explain that Google’s time sense is well developed and of considerable importance to next generation solutions the company hopes to offer. Why the craw fishing? Well, Apple could just buy Google and make the bitter taste of the Apple Board of Directors’ experience a thing of the past.
Now to temporal matters in the here and now.
CyberOSINT relies on automated collection, analysis, and report generation. In order to make sense of data and information crunched by an NGIA system, time is a really key metatag item. To figure out time, a system has to understand:
- The date and time stamp
- Versioning (previous, current, and future document, data items, and fact iterations)
- Times and dates contained in a structured data table
- Times and dates embedded in content objects themselves; for example, a reference to “last week” or in some cases, optical character recognition of the data on a surveillance tape image.
For the average query, this type of time detail is overkill. The “time and date” of an event, therefore, requires disambiguation, determination and tagging of specific time types, and then capturing the date and time data with markers for document or data versions.
A simplification of Recorded Future’s handling of unstructured data. The system can also handle structured data and a range of other data management content types. Image copyright Recorded Future 2014.
Sounds like a lot of computational and technical work.
In CyberOSINT, I describe Google’s and In-Q-Tel’s investments in Recorded Future, one of the data forward NGIA companies. Recorded Future has wizards who developed the Spotfire system which is now part of the Tibco service. There are Xooglers like Jason Hines. There are assorted wizards from Sweden, countries the most US high school software cannot locate on a map, and assorted veterans of high technology start ups.
An NGIA system delivers actionable information to a human or to another system. Conversely a licensee can build and integrate new solutions on top of the Recorded Future technology. One of the company’s key inventions is numerical recipes that deal effectively with the notion of “time.” Recorded Future uses the name “Tempora” as shorthand for the advanced technology that makes time along with predictive algorithms part of the Recorded Future solution.