Possibilities for Solving the Problem of Dimensionality in Classification

June 5, 2014

The overview of why indexing is hard on VisionDummy is titled The Curse of Dimensionality in Classification. The article provides a surprisingly readable explanation with an example of sorting images of cats and dogs. The first step would be creating features that would assign values to the images (such as different color or texture). From there, the article states,

“We now have 5 features that, in combination, could possibly be used by a classification algorithm to distinguish cats from dogs. To obtain an even more accurate classification, we could add more features, based on color or texture histograms, statistical moments, etc. Maybe we can obtain a perfect classification by carefully defining a few hundred of these features? The answer to this question might sound a bit counter-intuitive: no we can not!.”

Instead, simply adding more and more features, or increasing dimensionality, would lessen the performance of the classifier. A graph is provided with a sharp descending line after the point called the “optimal number of features.” At this point there would exist a three-dimensional feature space, making it possible to fully separate the classes (still dogs and cats). When more features are added passing the optimal amount, over fitting occurs and finding a general space without exceptions becomes difficult. The article goes on to suggest some remedies such as cross-fitting and feature extraction.

Chelsea Kerwin, June 05, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

Free Trial of X1 Enterprise Client

June 3, 2014

X1 is offering a free fourteen-day trial of their desktop search engine, X1 Enterprise Client. Read more in the sneak preview:

“X1 Enterprise Client is a desktop search engine that automatically indexes files, email messages and contacts on your computer and returns instant results for your keyword searches. The results are organized in a tabbed interface, sorted by file type and provide a quick preview for most common file types including images, PDF files, Office files, ZIP files and many other formats. You can directly interact with the results by replying to emails, sending messages to contacts, opening files, playing music and also send any file as email attachment with the click of a button.”

This product could be a good investment for those who are not exactly careful as they label, name, and store files. Effective keyword search is the most useful tool in light of bad or nonexistent indexing. If you need a little more search in your workflow, and you do not want to be the one to impose the order, a solution like X1 Enterprise Client might be worth considering.

Emily Rae Aldridge, June 03, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

News for the Non-Reader

May 23, 2014

Zeef is a new video publishing and distribution service that is still being developed and improved. Videos are becoming more and more popular, as users are inundated with a deluge of daily information. Zeef explains more about who they are and what they do on their “About” page.

It explains:

“With so much information online, finding the right product or service can become a time consuming and difficult task. ZEEF combines human (expert) knowledge, performance and customer ratings to help consumers find the best products and services online. We are still working hard on developing our product ZEEF.com.”

One area in which video struggles and continues to fall behind is search. All you have to do is visit YouTube and try to find something specific in order to be faced with a lack of successful indexing and findability. So while Zeef looks like a great resource for those who want to put video out onto the market, there’s still no relief for those who need to search through existing content and pull video out.

Emily Rae Aldridge, May 23, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

TopSEOs: Relevance, Precision or Visibility?

May 22, 2014

I have a couple of alerts running for the phrase “enterprise search.” The information gathered is not particularly useful. Potentially interesting items like the rather amazing “Future of Search” are not snagged by either Google or Yahoo (Bing). I have noticed a surprising number of alerts about a company doing business as TopSEOS.com. The url is often presented as www.topseos.co.uk and there may be other variants.

Here’s a typical hit in a Google alert. This one appeared on May 22, 2014:

topseos

The link leads to a “story” in DigitalJournal.com. a “global media network.” The site is notable because it combines a wide range o f topics, tweets, links, categories, and ads. If you want to more about the service, you can read the about page and get precious little information about this Canadian company. This site appears to be a typical news aggregation service. The “story” is a news release distributed by Google-friendly PRWeb, located in San Francisco.

What is the TopSEOs’ story that appeared as an alert this morning?

The story is a news release about an independent team that evaluates search engine optimization companies. Here’s how the story in my alert looked to me on May 22, 2014:

topselos story

Several things jumped out at me about the story. First, it lacks substance. The key point is that TopSEOS.co.uk “analyzes market and industry trends in order to remain information of the most important developments which affect the performance of competing companies.” I am not sure exactly what this means, but it sounds sort of important. The link to www.topseos.co.uk redirects to www.uk-topseos.com/rankings-of-best-seo-companies:

Read more

Ravn Amps Up Its Search Prowess

May 9, 2014

I read “RAVN Systems Revolutionises COWI’s SharePoint 2013 Search.” I learned several things. First, COWI means “a leading international consulting group with 50 remote locations.”

Next, RAVN delivers some performance assertions; for example:

In representative tests across their estate COWI have achieved a 57% reduction in indexing time of remote content, over 90% reduction in bandwidth usage during indexing and 70% reduction in time to preview compared with opening content. They have also estimated a saving of 12 physical servers.

Unfortunately there were no data about life before RAVN, the system’s throughput, etc. But the assertion is interesting.

Finally, the article states:

“RAVN Connect revolutionises SharePoint Search in distributed environments”.

I have heard this before from Fulcrum Technologies decades ago. I assume this time the nail in SharePoint’s findability coffin is hammered tight. No word from the legions of other SharePoint indexing systems, however.

Stephen E Arnold, May 9, 2014

RSuite Incorporates Temis into Content Management Platform

May 8, 2014

RSuite content management users can now can tap into TEMIS, we learn from “RSuite CMS Leverages TEMIS’s Content Enrichment Capabilities to Deliver a Powerful Semantic Solution.” The partnership makes TEMIS’s semantic enrichment capabilities available to RSuite’s customers in the publishing, government, and corporate arenas. The deal was announced at this year’s MarkLogic World conference, held April seventh in San Francisco; both companies are MarkLogic partners.

The press release elaborates:

“RSuite CMS provides an intuitive user interface that minimizes actions required to execute complex searches across an entire set of content. The solution can globally apply metadata, dynamically organize massive amounts of documents into collections, package and distribute content to licensing partners, and enables customers to meet their multi-channel publishing goals.

“By leveraging TEMIS’s Luxid® Content Enrichment Platform, RSuite CMS can enable customers to automatically enrich their content with domain-specific metadata directly within their publishing workflows. This enables faster and more scalable content indexing, improved metadata consistency and governance, more efficient authoring, and more powerful search and discovery features within customer applications and portals.”

With its focus on publishing and media, RSuite strives to meet today’s ever-evolving publication challenges. The company serves such big names as HarperCollins, Audible, and Oxford University Press. RSuite was launched in 2000 and is located in Audubon, Pennsylvania.

With its collaborative platform, TEMIS adds domain-specific metadata to clients’ data, allowing publishers to supply more relevant information to their own audiences. TEMIS maintains several offices across Europe and North America.

Cynthia Murrell, May 08, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

OpenText: Poetry Is Better than Its Search Systems

May 1, 2014

OpenText has a special place in the Overflight archive. The company once sort of supported the Autonomy IDOL engine in something called RedDot. Then OpenText sells mainframey search systems like Information Dimension’ now really old BASIS system and the BRS/Search system. Love those green screens! Somewhere inside the company is Dr. Tim Bray’s SGML search and data management system. And for the history buffs, can you name the 1983 technology that continues to influence Hummingbird, another OpenText information system. Now I am sure I have notes on the Nstein technology, a once much hyped search, indexing, and management system. I grow weary.

I just read “OpenText Launches Discovery Suite to Capture and Create Value in Big Content.” The write up announces something that OpenText has been selling for years. The buzzwordage is notable, and you can find my view of content processing jargon in this six minute video.

What I noted was the probably unintentional inclusion of some Latinate sentence structures and a near miss on a type of poetry not practiced since William Carlos William riffed on red wheelbarrows. Here’s the melodious sequence I noted:

OpenText can integrated the unintegrated, structure the unstructured, and manage the unmanaged.

I am sorely tempted to add some lines like “support the unsupported,” but I will not.

Stephen E Arnold, May 1, 2014

Yandex Profit Goes Up

April 24, 2014

Bloomberg’s real journalists reported some Web search news I found interesting. Navigate to “Yandex Profit Rises 19% on Russia Internet Advertising Demand.” Google gets the spotlight. Yandex warrants more attention. The English language search service at www.yandex.com is okay. The gem is the Yandex Russian service at www.yandex.ru. Content in this index is not easily available via US Web indexing services without the searcher’s performing some acrobatics. Yandex, however, is doing the me too thing. My hunch is that its usefulness will erode as the advertising revenue gains more traction. Precision, recall—just a distant memory for Bing and Google. Yandex’s utility may decline as the money rolls in. By the way, what happened to the Yandex search appliance?

Stephen E Arnold, April 24, 2014

Litigation Software dtSearch Demo

April 16, 2014

The dtSearch Desktop Demonstration Video on nlsblog.org shows how to setup and search with dtSearch for Windows. The 12 minute video begins with an introduction to dtSearch, which is able to “recognize text in over 200 common file types.” By indexing the locations of words in different files, dtSearch is able to build an almost limitless index of documents. The demo walks through the setup of dtSearch. After naming the index,

“It is important to keep in mind that when we add items here, dtSearch is not creating copies… but links to those files. A good practice is to put the files and folder that we want to run searches on into a single centralized location, before we create the index… all we need to do is add this discovery folder, and the subfolders and files will be automatically included…dtSearch reads the text in the linked files and creates a searchable words list.”

Then you are able to search which index to search through, and limit it to one case, or all cases. The word appears with a number, show how often it appears in the index. Then you can add the keyword to the search request to find the documents in which the word appears. You are able to preview a document, copy a file, and create a search report. The demo goes into great detail about all of the search options, and should certainly be viewed in full to learn the best methods, but it does not provide metrics for the time required to build the initial index or update it. These metrics are useful.

Chelsea Kerwin, April 16, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

Content Management: A $12 Billion Market in 2019!

April 8, 2014

Now I enjoy crazy numbers. I recall that someone at Yahoo allegedly said to a New York Times reporter:

Yahoo estimates that it would cost $300 million to build a search service from scratch. [See New York Times, July 10, 2008, page C5) My story about this estimate is at http://wp.me/pf6p2-e9.]

Crazy number. Three hundred million would not buy a Web search system in 2008. Today it may cover the cost of jet fuel for Google’s fleet of airplanes.

But crazy numbers get traction and create “real news.”

I read “Enterprise Content Management Market worth $12.32 Billion by 2019.” Now that is an interesting estimate. The calculation surprised me for three reasons:

  1. The outfit promulgating the good “news” is selling a report, presumably to those in the content management sector who need reassurance.
  2. There was no mention of WordPress- and SquareSpace-type outfits, which seem to be moving ahead of the pack of name brand vendors.
  3. The assumption that I actually know what content management or CMS means.

Like search, the CMS vendors have been looking for a way to become more relevant. The implementations of Broadvision, Documentum, Interwoven, Vignette, and other well known CMS systems have had some successes and failures.

The “real” news about this report mentions some aspects of CMS that are similar to the scope creep visible in enterprise search. Here are some examples of what CMS embraces:

enterprise document management, enterprise document imaging and capture, enterprise web content management, enterprise records management, enterprise document collaboration, enterprise digital rights management, content analytics, rich media management, advanced case management, enterprise document output management, enterprise workflow management, and other solutions; by type of emerging applications: social content management, mobile content management, big data management, and cloud content management; by type of deployments: hosted and on-premises; by verticals: academia and education, banking, financial services and insurance (BFSI), consumer goods and retail, energy and power, government and defense, life science and healthcare, manufacturing, media and entertainment, telecom and IT, transportation, tourism, and hospitality, and other verticals; and by regions: North America (NA), Asia Pacific including Japan (APAC), Europe (EU), Middle East and Africa (MEA), and Latin America (LA).

This list is not helpful to me. I think the collection of jargon, buzzwords, and impressive sounding concepts is designed for Web indexing systems and to give a marginalized type of software some strap on muscles.

If information about the magnitude of the CMS market requires this type of verbal legerdemain, how credible is the report, the estimate, and maybe content management itself?

My personal view is that the buzzword content management, like knowledge management, is tough to define and may ultimately lack relevance in today’s business environment. The notion that a specious estimate adds value to those laboring in the CMS sector is amusing. The puffery, apologias, and jargon generated by those trying to sell systems that “manage” content causes me to chortle. Estimates of the volume of Big Data seem to fly in the face of “content management.” Even Google’s robots are struggling to keep pace with content proliferation based on my test queries.

At a time when organizations struggle to figure out what information is in their possession, CMS seems to have failed in its “mission”: Managing content.

CMS’ weakness is the notion of management itself. Since “management” is tough to define, content management sounds like a discipline cooked up by MBA hopefuls in an innovation study group.

Stephen E Arnold, April 7, 2014

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta