What Most Search Vendors Cannot Pull Off

July 19, 2014

I recently submitted an Information Today column that reported about Antidot’s tactical play to enter the US market. One of the fact checkers for the write up alerted me that most of the companies I identified were unknown to US readers. Test yourself. How many of these firms do you recognize? How many of them provide information retrieval services?

  • A2ia
  • Albert (originally AMI Albert and AMI does not mean friend)
  • Dassault Exalead
  • Datops
  • EZ2Find
  • Kartoo
  • Lingway
  • LUT Technologies
  • Pertimm
  • Polyspot
  • Quaero
  • Questel
  • Sinequa

How did you do? The point is that French vendors of information retrieval and content processing technology find themselves in a crowded boat. Most of the enterprise search vendors have flamed out or resigned themselves to pitching to venture capitalist that their technology is the Next Big Thing. A lucky few sell out and cash in; for example Datops. Others are ignored or forgotten.

The same situation exists for vendors of search technology in other countries. Search is a tough business. And when former Googlers like Marissa Meyer was the boss when Yahoo’s share of the Web search market sagged below 10 percent. In the same time period, Microsoft increased Bing’s share to about 14 percent. Google dogpaddled and held steady. Other Web search providers make up the balance of the market players. Business Insider reported:

This is a big problem for Yahoo since its search business is lucrative. While Yahoo’s display ad business fell 7% last quarter, revenue from search was up 6% on a year-over-year basis. Revenue from search was $428 million compared to $436 million from its display ad business.

Now enterprise search vendors have been trying to use verbal magic to unlock consistently growing revenue. So far only two vendors have been able to find a way to open the revenue vault’s lock. Autonomy tallied more than $800 million in revenue at the time of its sale to Hewlett Packard. The outcome of that deal was a multi-billion dollar write off and many legal accusations. One thing is clear through the murky rhetoric the deal produced. Hewlett Packard had zero understanding of search and has been looking for a scapegoat to slaughter for its corporate decision. This is not helping the search vendors chasing deals.

Google converted Web search into a $60 billion revenue stream. The fact that the core idea for online advertising originated with the pay-to-play company GoTo which then morphed into Overture which THEN was acquired by Yahoo. Think of the irony. Yahoo has the technology that makes Google a one trick, but very lucrative revenue pony. But, to be fair, Google Web search is not the enterprise search needed to locate a factoid for a marketing assistant. Feed this query “how me the versions of the marketing VP’s last product road map” to a Google appliance and check the results. The human has to do some old fashioned human-type work. To find this information with a Google Search Appliance or any other information retrieval engine for that matter is tricky. Basic indexing cannot do the job, so most marketing assistants hunt manually through files, folders, and hard copies looking for the Easter egg.

Many of the pioneering search engines tried explaining their products and services using euphemisms. There was question answering, content intelligence, smart content, predictive retrieval, entity extraction, and dozens and dozens of phrases that sound fine but are very difficult to define; for example, knowledge management and the phrase “enterprise search” itself or “image recognition” or “predictive analytics”, among others.

I had a hearty chuckle when I read “Don’t Sell a Product, Sell a Whole New Way of Thinking.” Search has been available for at least 50 years. Think RECON, Orbit, Fulcrum Technologies, BASIS, Teratext, and other artifacts of search and retrieval. Smart folks cooked up even the computationally challenged Delphes system, the metasearch system Vivisimo, and the essentially unknown Quertle.

A romp through these firm’s marketing collateral, PowerPoints, and PDFs makes clear that no buzzword has been left untried. Buyers did and do not know what the systems actually delivered.  This is evidence that search vendors have not been able to “sell a whole new way of thinking.”

No kidding. The synonyms search marketers have used in order to generate interest and hopefully a sale are a catalog of information technology jargon. Here is a short list of some of the terms from the 1990s:

  • Business intelligence
  • Competitive intelligence
  • Content governance
  • Content management
  • Customer support then customer relationship management.
  • Knowledge management
  • Neurodynamics
  • Text analytics

If I accept the Harvard analysis, the failing of enterprise search is not financial fiddling and jargon. As you may recall, Microsoft paid $1.2 billion for Fast Search & Transfer. The investigation into allegations of financial fancy dancing were resolved recently with one executive facing a possible jail term and employment restrictions. There are other companies that tried to blend search with content only to find that the combination was not quite like peanut butter and jelly. Do you use Factiva or Ebsco? Did I hear a “what?’ Other companies embraced slick visualizations to communicate key information at a glance. Do you remember Grokker? There was semantic search. Do you recollect Siderean Software.

One success story was Oingo, renamed Applied Semantics. Google understood the value of mapping words to ads and purchased the company to further its non search goals of generating ad revenue.

According to the HBR:

To find the shift, ask yourself a few questions. What was the original insight that led to the innovation? Where do you feel people “don’t get it” about your solution? What is the “aha” moment when someone turns from disinterested to enthusiastic?

Those who code up search systems are quite bright. Is this pat formula of shifting thinking the solution to the business challenges these firms face:

Attivio. Founded by Fast Search & Transfer alums, the company has ingested more than $35 million in venture funding. The company’s positioning is “an actionable 360 degree view of anything you need.” Okay. Dassault Exalead used the same line several years.

Coveo. The company has tapped venture firms for more than $30 million since the firm’s founding in 2004, Coveo uses the phrase “enterprise search” and wraps it in knowledge workers, custom service, engineering, and CRM. The idea is that Coveo delivers solutions tailored to a specific business functions and employee roles.

SRCH2. This is a Xoogler founded company that like Perfect Search before emphasizes speed. The alternative is better than open source search solutions.

Lucid Works. Like Vivisimo, Lucid Works has embraced Big Data and the cloud. The only slow downs Lucid has encountered has been turnover in CEOs, marketing, and engineering professionals. The most recent hurdle to trip up Lucid is the interest in ElasticSearch, fat with almost $100 million in venture funding and developers from the open source community.

IBM Watson. Based on open source and home grown technology, IBM’s marketers have showcased Watson on Jeopardy and garnered headlines for the $1 billion investment IBM is making in its “smart” information processing system. The most recent demonstration of Watson was producing a recipe for Bon Appetit readers.

Amazon’s search approach is to provide it as a service to those using Amazon Web services. Search is, in my mind, just a utility for Amazon. Amazon’s search system on its eCommerce site is not particularly good. Want to NOT out books not yet available on the system. Well, good luck with that query.

After I stopped chuckling, I realized that the Harvard article is less concerned with precision and recall than advocating deception, maybe cleverness. No enterprise search vendor has approached Autonomy’s revenues with the sole exception of Google’s licensing of the wildly expensive Google Search Appliance. At the time of its sale to Oracle, Endeca was chugging along at an estimated $150 million in revenue. Oracle paid about $1 billion for Endeca. With that benchmark, name another enterprise search vendor or eCommerce search vendor that has raced past Endeca. For the majority of enterprise search vendors, revenues of $3 to $10 million represent very significant achievements.

An MBA who takes over an enterprise search company may believe that wordsmithing will make sales. Sure, some sales may result but will the revenue be sustainable. Most enterprise search sales are a knee jerk to problems with the incumbent search system.

Without concrete positive case studies, talking about search is sophistry. There are comparatively few, specific, return on investment analyses for enterprise seach installations. I provided a link to a struggling LinkedIn person about an Italian library’s shift from the 1960s BASIS system to a Google Search Appliance.

Is enterprise search an anomaly in business software. Will the investment firms get their money back from their investments in search and retrieval?

Ask a Harvard MBA steeped in the lore of selling a whole new way of thinking. Ignore 50 years of search history. Success in search is difficult to achieve. Duplicity won’t do the job.

Stephen E Arnold, July 19, 2014

Jepsen-Testing Elasticsearch for Safety and Data Loss

July 18, 2014

The article titled Call Me Mayble: Elasticsearch on Aphyr explores potential issues with Elasticsearch. Jepsen is a section of Aphyr that tests the behaviors of different technology and software under types of network failure. Elasticsearch comes with the solid Java indexing library of Apache-Lucene. The article begins with an overview of how Elasticsearch scales through sharding and replication.

“The document space is sharded–sliced up–into many disjoint chunks, and each chunk allocated to different nodes. Adding more nodes allows Elasticsearch to store a document space larger than any single node could handle, and offers quasilinear increases in throughput and capacity with additional nodes. For fault-tolerance, each shard is replicated to multiple nodes. If one node fails or becomes unavailable, another can take over…Because index construction is a somewhat expensive process, Elasticsearch provides a faster database backed by a write-ahead log.”

Over a series of tests, (with results summarized by delightful Barbie and Ken doll memes) the article decides that while version control may be considered a “lost cause” Elasticsearch handles inserts superbly. For more information on how Elasticsearch behaved through speed bumbs, building a nemesis, nontransitive partitions, needless data loss, random and fixed transitive partitions, and more, read the full article. It ends with recommendations for Elasticsearch and for users, and concedes that the post provides far more information on Elasticsearch than anyone would ever desire.

Chelsea Kerwin, July 18, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

Will Germany Scrutinize Google Web Search More Closely?

July 14, 2014

Several years ago, I learned a hard-to-believe factoid. In Denmark, 99 percent of referrals to a major financial service firm’s Web site came via Google. Figuring prominently was Google.de. My contact mentioned that the same traffic flow characterized the company’s German affiliate; that is, if an organization wanted Web traffic, Google was then the only game in town.

I no longer follow the flips and flops of Euro-centric Google killers like Quaero. I have little or no interest in assorted German search revolutions whether from the likes of the Weitkämper Clustering Engine or the Intrafind open source play or the Transinsight Enterprise Semantic Intelligence system. Although promising at one time, none of these companies offers an information retrieval that could supplant Google for German language search. Toss in English and the other languages Google supports, and the likelihood of a German Google killer decreases.

I read “Germany Is Looking to Regulate Google and Other Technology Giants.” I found the write up interesting and thought provoking. I spend some time each day contemplating the search and content processing sectors. I don’t pay much attention to the wider world of business and technology.

The article states:

German officials are planning to clip the wings of technology giants such as Google through heavier regulation.

That seems cut and dried. I also noted this statement:

The German government has always been militant in matters of data protection. In 2013, it warned consumers against using Microsoft’s Windows 8 operating system due to perceived security risks, suggesting that it provided a back door for the US National Security Agency (NSA). Of course, this might have had something to do with the fact that German chancellor Angela Merkel was one of the first high-profile victims of NSA surveillance, with some reports saying that the NSA hacked her mobile phone for over a decade.

My view is that search and content processing may be of particular interest. After all, who wants to sit and listen to a person’s telephone calls. I would convert the speech to text and hit the output with one of the many tools available to attach metadata, generate relationship maps, tug out entities like code words and proper names. Then I would browse the information using an old fashioned tabular report. I am not too keen on the 1959 Cadillac tail fin visualizations that 20 somethings find helpful, but to each his or her own I say.

Scrutiny of Google’s indexing might reveal some interesting things to the team assigned to ponder Google from macro and micro levels. The notion of timed crawls, the depth of crawls, the content parsed and converted to a Guha type semantic store, the Alon Halevy dataspace, and other fascinating methods of generating meta-information might be of interest to the German investigate-the-US-vendors team.

My hunch is that scrutiny of Google is likely to lead to increased concern about Web indexing in general. That means even the somewhat tame Bing crawler and the other Web indexing systems churning away at “public” sites’ content may be of interest.

When it comes to search and retrieval, ignorance and bliss are bedfellows. Once a person understands the utility of the archives, the caches, and the various “representations” of the spidered and parsed source content, bliss may become FUD (a version of IBM’s fear, uncertainty and doubt method). FUD may create some opportunities for German search and retrieval vendors. Will these outfits be able to respond or will the German systems remain in the province of Ivory Tower thinking?

In the short term, life will be good for the law firms representing some of the non German Web indexing companies. I wonder, “Is the Google Germany intercept matter included in the young attorneys’ legal education in Germany?”

Stephen E Arnold, July 14, 2014

Search, Not Just Sentiment Analysis, Needs Customization

July 11, 2014

One of the most widespread misperceptions in enterprise search and content processing is “install and search.” Anyone who has tried to get a desktop search system like X1 or dtSearch to do what the user wants with his or her files and network shares knows that fiddling is part of the desktop search game. Even a basic system like Sow Soft’s Effective File Search requires configuring the targets to query for every search in multi-drive systems. The work arounds are not for the casual user. Just try making a Google Search Appliance walk, talk, and roll over without the ministrations of an expert like Adhere Solutions. Don’t take my word for it. Get your hands dirty with information processing’s moving parts.

Does it not make sense that a search system destined for serving a Fortune 1000 company requires some additional effort? How much more time and money will an enterprise class information retrieval and content processing system require than a desktop system or a plug-and-play appliance?

How much effort is required to these tasks? There is work to get the access controls working as the ever alert security manager expects. Then there is the work needed to get the system to access, normalize, and process content for the basic index. Then there is work for getting the system to recognize, acquire, index, and allow a user to access the old, new, and changed content. Then one has to figure out what to tell management about rich media, content for which additional connectors are required, the method for locating versions of PowerPoints, Excels, and Word files. Then one has to deal with latencies, flawed indexes, and dependencies among the various subsystems that a search and content processing system includes. There are other tasks as well like interfaces, work flow for alerts, yadda yadda. You get the idea of the almost unending stream of dependent, serial “thens.”

When I read “Why Sentiment Analysis Engines need Customization”, I felt sad for licensees fooled by marketers of search and content processing systems. Yep, sad as in sorrow.

Is it not obvious that enterprise search and content processing is primarily about customization?

Many of the so called experts, advisors, and vendors illustrate these common search blind spots:

ITEM: Consulting firms that sell my information under another person’s name assuring that clients are likely to get a wild and wooly view of reality. Example: Check out IDC’s $3,500 version of information based on my team’s work. Here’s the link for those who find that big outfits help themselves to expertise and then identify a person with a fascinating employment and educational history as the AUTHOR.

image

See  http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=idc%20attivio

In this example from http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=idc%20attivio, notice that my work is priced at seven times that of a former IDC professional. Presumably Mr. Schubmehl recognized that my value was greater than that of an IDC sole author and priced my work accordingly. Fascinating because I do not have a signed agreement giving IDC, Mr. Schubmehl, or IDC’s parent company the right to sell my work on Amazon.

This screen shot makes it clear that my work is identified as that of a former IDC professional, a fellow from upstate New York, an MLS on my team, and a Ph.D. on my team.

image

See http://amzn.to/1ner8mG.

I assume that IDC’s expertise embraces the level of expertise evident in the TechRadar article. Should I trust a company that sells my content without a formal contract? Oh, maybe I should ask this question, “Should you trust a high  profile consulting firm that vends another person’s work as its own?” Keep that $3,500 price in mind, please.

ITEM: The TechRadar article is written by a vendor of sentiment analysis software. His employer is Lexalytics / Semantria (once a unit of Infonics). He writes:

High quality NLP engines will let you customize your sentiment analysis settings. “Nasty” is negative by default. If you’re processing slang where “nasty” is considered a positive term, you would access your engine’s sentiment customization function, and assign a positive score to the word. The better NLP engines out there will make this entire process a piece of cake. Without this kind of customization, the machine could very well be useless in your work. When you choose a sentiment analysis engine, make sure it allows for customization. Otherwise, you’ll be stuck with a machine that interprets everything literally, and you’ll never get accurate results.

When a vendor describes “natural language processing” with the phrase “high quality” I laugh. NLP is a work in progress. But the stunning statement in this quoted passage is:

Otherwise, you’ll be stuck with a machine that interprets everything literally, and you’ll never get accurate results.

Amazing, a vendor wrote this sentence. Unless a licensee of a “high quality” NLP system invests in customizing, the system will “never get accurate results.” I quite like that categorical never.

ITEM: Sentiment analysis is a single, usually complex component of a search or content processing system. A person on the LinkedIn enterprise search group asked the few hundred “experts” in the discussion group for examples of successful enterprise search systems. If you are a member in good standing of LinkedIn, you can view the original query at this link. [If the link won’t work, talk to LinkedIn. I have no idea how to make references to my content on the system work consistently over time.] I pointed out that enterprise search success stories are harder to find than reports of failures. Whether the flop is at the scale of the HP/Autonomy acquisition or a more modest termination like Overstock’s dumping of a big name system, the “customizing” issues is often present. Enterprise search and content processing is usually:

  • A box of puzzle pieces that requires time, expertise, and money to assemble in a way that attracts and satisfies users and the CFO
  • A work in progress to make work so users are happy and in a manner that does not force another search procurement cycle, the firing of the person responsible for the search and content processing system, and the legal fees related to the invoices submitted by the vendor whose system does not work. (Slow or no payment of licensee and consulting fees to a search vendor can be fatal to the search firm’s health.)
  • A source of friction among those contending for infrastructure resources. What I am driving at is that a misconfigured search system makes some computing work S-L-O_W. Note: the performance issue must be addressed for appliance-based, cloud, or on premises enterprise search.
  • Money. Don’t forget money, please. Remember the CFO’s birthday. Take her to lunch. Be really nice. The cost overruns that plague enterprise search and content processing deployments and operations will need all the goodwill you can generate.

If sentiment analysis requires customizing and money, take out your pencil and estimate how much it will cost to make NLP and sentiment to work. Now do the same calculation for relevancy tuning, index tuning, optimizing indexing and query processing, etc.

The point is that folks who get a basic key word search and retrieval system work pile on the features and functions. Vendors whip up some wrapper code that makes it possible to do a demo of customer support search, eCommerce search, voice search, and predictive search. Once the licensee inks the deal, the fun begins. The reason one major Norwegian search vendor crashed and burned is that licensees balked at paying bills for a next generation system that was not what the PowerPoint slides described. Why has IBM embraced open source search? Is one reason to trim the cost of keeping the basic plumbing working reasonably well? Why are search vendors embracing every buzzword that comes along? I think that search and an enterprise function has become a very difficult thing to sell, make work,  and turn into an evergreen revenue stream.

The TechRadar article underscores the danger for licensees of over hyped systems. The consultants often surf on the expertise of others. The vendors dance around the costs and complexities of their systems. The buzzwords obfuscate.

What makes this article by the Lexalytics’ professional almost as painful as IDC’s unauthorized sale of my search content is this statement:

You’ll be stuck with a machine that interprets everything literally, and you’ll never get accurate results.

I agree with this statement.

Stephen E Arnold, July 11, 2014

Information Manipulation: Accountability Pipe Dream

July 5, 2014

I read an article with what I think is the original title: “What does the Facebook Experiment Teach us? Growing Anxiety About Data Manipulation.” I noted that the title presented on Techmeme was “We Need to Hold All Companies Accountable, Not Just Facebook, for How They Manipulate People.” In my view, this mismatch of titles is a great illustration of information manipulation. I doubt that the writer of the improved headline is aware of the irony.

The ubiquity of information manipulation is far broader than Facebook twirling the dials of its often breathless users. Navigate to Google and run this query:

cloud word processing

Note anything interesting in the results list displayed for me on my desktop computer:

image

The number one ad is for Google. In the first page of results, Google’s cloud word processing system is listed three more times. I did not spot Microsoft Office in the cloud except in item eight: Is Google Docs Making Microsoft Word Redundant.

For most Google search users, the results are objective. No distortion evident.

Here’s what Yandex displays for the same query:

image

No Google word processing and no Microsoft word processing whether in the cloud or elsewhere.

When it comes to searching for information, the notion that a Web indexing outfit is displaying objective results is silly. The Web indexing companies are in the forefront of distorting information and manipulating users.

Flash back to the first year of the Bush administration when Richard Cheney was vice president. I was in a meeting where the request was considered to make sure that the vice president’s office Web site would appear in FirstGov.gov hits in a prominent position. This, gentle reader, is a request that calls for hit boosting. The idea is to write a script or configure the indexing plumbing to make darned sure a specific url or series of documents appears when and where they are required. No problem, of course. We created a stored query for the Fast Search & Transfer search system and delivered what the vice president wanted.

This type of results manipulation is more common than most people accept. Fiddling Web search, like shaping the flow of content on a particular semantic vector, is trivial. Search engine optimization is a fools’ game compared with the tried and true methods of weighting or just buying real estate on a search results page, a Web site from a “real” company.

The notion that disinformation, reformation, and misinformation will be identifiable, rectified, and used to hold companies accountable is not just impossible. The notion itself reveals how little awareness of the actual methods of digital content injection work.

How much of the content on Facebook, Twitter, and other widely used social networks is generated by intelligence professionals, public relations “professionals,” and folks who want to be perceived as intellectual luminaries? Whatever your answer, what data do you have to back up your number? At a recent intelligence conference in Dubai, one specialist estimated that half of the traffic on social networks is shaped or generated by law enforcement and intelligence entities. Do you believe that? Probably not. So good for you.

Amusing, but as someone once told me, “Ignorance is bliss.” So, hello, happy idealists. The job is identifying, interpreting, and filtering. Tough, time consuming work. Most of the experts prefer to follow the path of least resistance and express shock that Facebook would toy with its users. Be outraged. Call for action. Invent an algorithm to detect information manipulation. Let me know how that works out when you look for a restaurant and it is not findable from your mobile device.

Stephen E Arnold, July 5, 2014

AeroText: A New Breakthrough in Entity Extraction

June 30, 2014

I returned from a brief visit to Europe to an email asking about Rocket Software’s breakthrough technology AeroText. I poked around in my archive and found a handful of nuggets about the General Electric Laboratories’ technology that migrated to Martin Marietta, then to Lockheed Martin, and finally in 2008 to the low profile Rocket Software, an IBM partner.

When did the text extraction software emerge? Is Rocket Software AeroText a “new kid on the block”? The short answer is that AeroText is pushing 30, maybe 35 years young.

Digging into My Archive of Search Info

As far as my archive goes, it looks as though the roots of AeroText are anchored in the 1980s, Yep, that works out to an innovation about the same age as the long in the tooth ISYS Search system, now owned by Lexmark. Over the years, the AeroText “product” has evolved, often in response to US government funding opportunities. The precursor to AeroText was an academic exercise at General Electric. Keep in mind that GE makes jet engines, so GE at one time had a keen interest in anything its aerospace customers in the US government thought was a hot tamale.

1_interface

The AeroText interface circa mid 2000. On the left is the extraction window. On the right is the document window. From “Information Extraction Tools: Deciphering Human Language, IT Pro, November December 2004, page 28.

The GE project, according to my notes, appeared as NLToolset, although my files contained references to different descriptions such as Shogun. GE’s team of academics and “real” employees developed a bundle of tools for its aerospace activities and in response to Tipster. (As a side note, in 2001, there were a number of Tipster related documents in the www.firstgov.gov system. But the new www.usa.gov index does not include that information. You will have to do your own searching to unearth these text processing jump start documents.)

The aerospace connection is important because the Department of Defense in the 1980s was trying to standardize on markup for documents. Part of this effort was processing content like technical manuals and various types of unstructured content to figure out who was named, what part was what, and what people, places, events, and things were mentioned in digital content. The utility of NLToolset type software was for cost reduction associated with documents and the intelligence value of processed information.

The need for a markup system that worked without 100 percent human indexing was important. GE got with the program and appears to have assigned some then-young folks to the project. The government speak for this type of content processing involves terms like “message understanding” or MU, “entity extraction,” and “relationship mapping. The outputs of an NLToolset system were intended for use in other software subsystems that could count, process, and perform other operations on the tagged content. Today, this class of software would be packaged under a broad term like “text mining.” GE exited the business, which ended up in the hands of Martin Marietta. When the technology landed at Martin Marietta, the suite of tools was used in what was called in the late 1980s and early 1990s, the Louella Parsing System. When Lockheed and Martin merged to form the giant Lockheed Martin, Louella was renamed AeroText.

Over the years, the AeroText system competed with LingPipe, SRA’s NetOwl and Inxight’s tools. In the hay day of natural language processing, there were dozens and dozens of universities and start ups competing for Federal funding. I have mentioned in other articles the importance of the US government in jump starting the craziness in search and content processing.

In 2005, I recall that Lockheed Martin released AeroText 5.1 for Linux, but I have lost track of the open source versions of the system. The point is that AeroText is not particularly new, and as far as I know, the last major upgrade took place in 2007 before Lockheed Martin sold the property to AeroText. At the time of the sale, AeroText incorporated a number of subsystems, including a useful time plotting feature. A user could see tagged events on a timeline, a function long associated with the original version of i2’s the Analyst Notebook. A US government buyer can obtain AeroText via the GSA because Lockheed Martin seems to be a reseller of the technology. Before the sale to Rocket, Lockheed Martin followed SAIC’s push into Australia. Lockheed signed up NetMap Analytics to handle Australia’s appetite for US government accepted systems.

AeroText Functionality

What does AeroText purport to do that caused the person who contacted me to see a 1980s technology as the next best thing to sliced bread?

AeroText is an extraction tool; that is, it has capabilities to identify and tag entities at somewhere between 50 percent and 80 percent accuracy. (See NIST 2007 Automatic Content Extraction Evaluation Official Results for more detail.)

The AeroText approach uses knowledgebases, rules, and patterns to identify and tag pre-specified types of information. AeroText references patterns and templates, both of which assume the licensee knows beforehand what is needed and what will happen to processed content.

In my view, the licensee has to know what he or she is looking for in order to find it. This is a problem captured in the famous snippet, “You don’t know what you don’t know” and the “unknown unknowns” variation popularized by Donald Rumsfeld. Obviously without prior knowledge the utility of an AeroText-type of system has to be matched to mission requirements. AeroText pounded the drum for the semantic Web revolution. One of AeroText’s key functions was its ability to perform the type of markup the Department of Defense required of its XML. The US DoD used a variant called DAML or Darpa Agent Markup Language. natural language processing, Louella, and AeroText collected the dust of SPARQL, unifying logic, RDF, OWL, ontologies, and other semantic baggage as the system evolved through time.

Also, staff (headcount) and on-going services are required to keep a Louella/AeroText-type system generating relevant and usable outputs. AeroText can find entities, figure out relationships like person to person and person to organization, and tag events like a merger or an arrest “event.” In one briefing about AeroText I attended, I recall that the presenter emphasized that AeroText did not require training. (The subtext for those in the know was that Autonomy required training to deliver actionable outputs.) The presenter did not dwell on the need for manual fiddling with AeroText’s knowledgebases and I did not raise this issue.)

Read more

Possibilities for Solving the Problem of Dimensionality in Classification

June 5, 2014

The overview of why indexing is hard on VisionDummy is titled The Curse of Dimensionality in Classification. The article provides a surprisingly readable explanation with an example of sorting images of cats and dogs. The first step would be creating features that would assign values to the images (such as different color or texture). From there, the article states,

“We now have 5 features that, in combination, could possibly be used by a classification algorithm to distinguish cats from dogs. To obtain an even more accurate classification, we could add more features, based on color or texture histograms, statistical moments, etc. Maybe we can obtain a perfect classification by carefully defining a few hundred of these features? The answer to this question might sound a bit counter-intuitive: no we can not!.”

Instead, simply adding more and more features, or increasing dimensionality, would lessen the performance of the classifier. A graph is provided with a sharp descending line after the point called the “optimal number of features.” At this point there would exist a three-dimensional feature space, making it possible to fully separate the classes (still dogs and cats). When more features are added passing the optimal amount, over fitting occurs and finding a general space without exceptions becomes difficult. The article goes on to suggest some remedies such as cross-fitting and feature extraction.

Chelsea Kerwin, June 05, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

Free Trial of X1 Enterprise Client

June 3, 2014

X1 is offering a free fourteen-day trial of their desktop search engine, X1 Enterprise Client. Read more in the sneak preview:

“X1 Enterprise Client is a desktop search engine that automatically indexes files, email messages and contacts on your computer and returns instant results for your keyword searches. The results are organized in a tabbed interface, sorted by file type and provide a quick preview for most common file types including images, PDF files, Office files, ZIP files and many other formats. You can directly interact with the results by replying to emails, sending messages to contacts, opening files, playing music and also send any file as email attachment with the click of a button.”

This product could be a good investment for those who are not exactly careful as they label, name, and store files. Effective keyword search is the most useful tool in light of bad or nonexistent indexing. If you need a little more search in your workflow, and you do not want to be the one to impose the order, a solution like X1 Enterprise Client might be worth considering.

Emily Rae Aldridge, June 03, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

News for the Non-Reader

May 23, 2014

Zeef is a new video publishing and distribution service that is still being developed and improved. Videos are becoming more and more popular, as users are inundated with a deluge of daily information. Zeef explains more about who they are and what they do on their “About” page.

It explains:

“With so much information online, finding the right product or service can become a time consuming and difficult task. ZEEF combines human (expert) knowledge, performance and customer ratings to help consumers find the best products and services online. We are still working hard on developing our product ZEEF.com.”

One area in which video struggles and continues to fall behind is search. All you have to do is visit YouTube and try to find something specific in order to be faced with a lack of successful indexing and findability. So while Zeef looks like a great resource for those who want to put video out onto the market, there’s still no relief for those who need to search through existing content and pull video out.

Emily Rae Aldridge, May 23, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

TopSEOs: Relevance, Precision or Visibility?

May 22, 2014

I have a couple of alerts running for the phrase “enterprise search.” The information gathered is not particularly useful. Potentially interesting items like the rather amazing “Future of Search” are not snagged by either Google or Yahoo (Bing). I have noticed a surprising number of alerts about a company doing business as TopSEOS.com. The url is often presented as www.topseos.co.uk and there may be other variants.

Here’s a typical hit in a Google alert. This one appeared on May 22, 2014:

topseos

The link leads to a “story” in DigitalJournal.com. a “global media network.” The site is notable because it combines a wide range o f topics, tweets, links, categories, and ads. If you want to more about the service, you can read the about page and get precious little information about this Canadian company. This site appears to be a typical news aggregation service. The “story” is a news release distributed by Google-friendly PRWeb, located in San Francisco.

What is the TopSEOs’ story that appeared as an alert this morning?

The story is a news release about an independent team that evaluates search engine optimization companies. Here’s how the story in my alert looked to me on May 22, 2014:

topselos story

Several things jumped out at me about the story. First, it lacks substance. The key point is that TopSEOS.co.uk “analyzes market and industry trends in order to remain information of the most important developments which affect the performance of competing companies.” I am not sure exactly what this means, but it sounds sort of important. The link to www.topseos.co.uk redirects to www.uk-topseos.com/rankings-of-best-seo-companies:

Read more

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta