Another Content Management Company Another Day

August 12, 2013

Content management companies are springing up and gaining attention due to the Big Data boom. One of the companies that our content wranglers pulled out of an Internet Search is Applied Relevance. They specialize in several aspects of the content management spectrum, but the company’s Web site prominently promotes its taxonomy services. Applied Relevance offers the AR-Classifer tagging engine that can run on a variety of platforms. Its AR-Semantics is the flagship organization and categorization software, while the AR-Taxonomy is the tool needed to edit and manage taxonomies and if you want to search your taxonomies the AR-Navigator is available.

All this talk about Applied Relevance’s taxonomy software is informative, but what is interesting is the company’s description on the main page:

“Applied Relevance produces software and services to help enterprise users find the information they need. Our solutions augment traditional search engines by providing context for the search results. The AR toolset and our partners provide cost effective technology for the full spectrum of enterprise content management and search applications. With our tools, a search term and a few clicks, users can zero-in past ambiguities and come up with the right answer in the right context. Applied Relevance is located on the west coast of the east coast of North America.”

Descriptive, but not a word on taxonomy or what exactly the company specifically does. The tagline at the end about Applied Relevance’s location is even more ambiguous.

Whitney Grace, August 12, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

Ten Big Data Best Practices

August 11, 2013

Having to manage large quantities of unstructured data is a uniquely contemporary challenge, and many folks are still unsure how to approach the issue. A slideshow from eWeek offers some clarity in, “Managing Massive Unstructured Data Troves: 10 Best Practices.” Writer Chris Preimesberger introduces his list:

“Managing unstructured data is extremely important to reduce storage and compliance costs while minimizing corporate risk. This task has been painfully difficult due to the time, resources and overhead required to collect the immense volume of metadata and digest it into actionable business intelligence. The job isn’t going to get any easier all by itself. Gartner analysts predict unstructured data will grow a whopping 800 percent over the next five years, and that 80 (or more) percent of that new data will also be unstructured. If enterprises don’t have the right software to prepare for these forthcoming storage issues, they had better start planning to do something about it.”

The slideshow is designed to help organizations do just that. Its creators consulted a number of sources: research from analytic firms, tips from industry insiders, and an IBM survey of 1,500 CEOs [PDF]. The presentation begins with the assessment of an enterprise’s needs and current setup; encourages the elimination of outdated, inefficient approaches; and describes how to tackle the issue with today’s technology. It is a good starting point for anyone who foresees a big-data implementation (or upgrade) in their future.

Cynthia Murrell, August 11, 2013

Sponsored by ArnoldIT.com, developer of Augmentext

Information Labeled As Black and White

August 9, 2013

Before the dawn of the Information Age, there was perhaps more of a general awareness of ethics regarding information and usage. The article from Digimind, “Different Shades of Big Data,” calls to our attention that this is a topic that everyone needs reminders about.

The Digimind article actually draws a distinction between two types of data, calling public and shared data white and referring to information not legally available to the public as black information.

Of course, there are some types of information that fall into grey categories as well:

“Apart from the obvious black and white information, there are various shades of grey, often referred to as the invisible web. This kind of information is available on the internet, but not easily accessible from a basic search engine, such as Google, Yahoo!, Bing, etc. Some examples are:

-Newpapers and journals with a subscription, usually protected by a login and password

-Information indexed in search engines embedded on other sites

-Blogs and forums not listed by major search engines”

What would the author choose as the label for information found on the Deep Web? Also, it is very interesting that the author chose to label information with the descriptors black and white when the objectivity of search results is anything but clear due to sponsored results.

Megan Feil, August 09, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

First Super Scalable LDAP Directory Driven by Big Data

August 9, 2013

We read a rather lengthy summary of a new solution that is classified as a “world’s first.” Yahoo Finance published the press release from Radiant Logic about their commercial solution for distributed storage and processing for enterprise identity management. “Radiant Logic Introduces HDAP: The World’s First Super Scalable LDAP Directory Driven by Big Data and Search Technology” shares the information on it.

It was at the Cloud Identity Summit in Napa that Radiant Logic announced their solution. Based on Hadoop, this new version allows enterprises to channel the power of large cluster and “elastic” computing in their identity infrastructure.

The article tells us:

“With HDAP as part of the upcoming RadiantOne 7.0 virtualization release, companies can radically scale their access and throughput, using the first highly scalable and secure directory that’s based on big data and search technology. A diverse array of forces, from federation and the cloud to an increasingly mobile workforce, is putting escalating pressure on the enterprise identity system. To keep up with authentication and authorization demands, while tapping into greater use of personalization and recommendation engines, companies need a richer view of their identity, along with better performance and greater flexibility.”

We may not typically share articles that use such terminology as LDAP (Lightweight Directory Access Protocol), however we do understand those magical words ‘business value’ that appear so close on the page.

Megan Feil, August 09, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

Red Hat Partners with MongoDB

August 8, 2013

Red Hat is a major leader in the world of open source. Founded in 1993, they are considered one of the major forerunners to the present day open source boom. So the latest Red Hat news is usually a harbinger, and is worth following. Read the latest news from Red Hat in the PC World article, “Red Hat Enterprise Linux Gets Cozy with MongoDB.”

The article describes the recent Red Hat partnership with MongoDB:

“Easing the path for organizations to launch big data-styled services, Red Hat has coupled the 10gen MongoDB data store to its new identity management package for the Red Hat Enterprise Linux (RHEL) distribution . . . Although it already has been fairly easy to set up a copy of MongoDB on RHEL — by using Red Hat’s package installation tools — the new integration minimizes a lot of work of initializing new user and administrator accounts in the data store software.”

The partnership between Red Hat and MongoDB can only mean good things for the open source community. In fact, we have been seeing more and more of these likeminded partnerships over the last several months. LucidWorks announced a partnership with MapR to strengthen their LucidWorks Big Data offering. LucidWorks is worth keeping an eye on, as they are constantly seeking innovation and advancement for the open source community.

Emily Rae Aldridge, August 8, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

Spotter Makes its Name with Sarcasm

August 5, 2013

While we are generally cheerleaders for all things big data and analytics, we are not blind to its weaknesses. One major weakness of most big data platforms would give it a devil of a time parsing much from, say, an episode of Seinfeld. That’s right, we’re talking about its inability to detect sarcasm. However, Slashdot thinks it might have the answer, according to the recent article: “Tech Companies Looking into Sarcasm Detection.”

According to the story:

Spotter’s platform scans social media and other sources to create reputation reports for clients such as the EU Commission and Air France. As with most analytics packages that determine popular sentiment, the software parses semantics, heuristics and linguistics. However, automated data-analytics systems often have a difficult time with some of the more nuanced elements of human speech, such as sarcasm and irony — an issue that Spotter has apparently overcome to some degree, although company executives admit that their solution isn’t perfect. (Duh.)

Spotter is really making a name for itself. We fell in love with the company a long while ago, after an Arnold IT interview set the tone. This is a sharp company and if their sarcasm detection comes through, they’ll be industry leaders.

Patrick Roland, August 05, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

Crowdsourcing Helps Keep Big Data Companies Straight

August 4, 2013

As big data analytics begins picking up steam, we are seeing more and more interesting outlets to learn about different platforms to choose from. Not just catalogs and boastful corporation sites, but insightful criticism. One such recent stop was when we came about the “About” story of Bamboo DIRT.

According to the site:

Bamboo DiRT is a tool, service, and collection registry of digital research tools for scholarly use. Developed by Project Bamboo, Bamboo DiRT is an evolution of Lisa Spiro’s DiRT wiki and makes it easy for digital humanists and others conducting digital research to find and compare resources ranging from content management systems to music OCR, statistical analysis packages to mindmapping software.

One look at its tips for analyzing data and we were sold. Here we were turned on to such intriguing companies as 140kit and Dataverse. The user-supported recommendations were the best. About Dataverse, it said: “Researchers and data authors get credit, publishers and distributors get credit, affiliated institutions get credit.” Concise and giving all the needed vitals, this type of crowdsourcing recommendation site could really catch on as the world of big data analytics keeps growing beyond most users’ capacity to keep up.

Patrick Roland, August 04, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

Search and Null: Not Good News for Some

August 3, 2013

I read “How Can I Pass the String ‘Null’ through WSDL (SOAP)…” My hunch is that only a handful of folks will dig into this issue. Most senior managers buy the baloney generated by search and content processing. Yesterday I reviewed for one of the outfits publishing my “real” (for fee) columns a slide deck stuff full of “all’s” and “every’s”. The message was that this particular modern system which boasted a hefty price tag could do just about anything one wanted with flows of content.

Happily overlooked was the problem of a person with a wonky name. Case in point: “Null”. The link from Hacker News to the Stackoverflow item gathered a couple of hundred comments. You can find these here. If you are involved in one of the next-generation, super-wonderful content processing systems, you may find a few minutes with the comments interesting and possibly helpful.

My scan of the comments plus the code in the “How Can I” post underscored the disconnect between what people believe a system can do and what a here-and-now system can actually do. Marketers say one thing, buyers believe another, and the installed software does something completely different.

Examples:

  1. A person’s name—in this case ‘Null’—cannot be located in a search system. With all the hoo-hah about Fancy Dan systems, is this issue with a named entity important? I think it is because it means that certain entities may not be findable without expensive, time-consuming human curation and indexing. Oh, oh.
  2. Non English names pose additional problems. Migrating a name in one language into a string that a native speaker of a different language can understand introduces some problems. Instead of finding one person, the system finds multiple people. Looking for a batch of 50 people each incorrectly identified during processing generates a lot of names which guarantees more work for expensive humans or many, many false drops. Operate this type of entity extraction system a number of times and one generates so much work there is not enough money or people to figure out what’s what. Oh, oh.
  3. Validating named entities requires considerable work. Knowledgebases today are “built automatically and on-the-fly. Rules are no longer created by humans. Rules, like some of Google’s “janitor” technology, figure out the rules themselves and then “workers” modify those rules on-the-fly. So what happens when errors are introduced via “rules.” The system keeps on truckin’. Anyone who has worked through fixing up the known tags from an smart system like Autonomy IDOL knows that degradation can set in when the training set does not represent the actual content flow. Any wonder why precision and recall scores have not improved too much in the last 20 years? Oh, oh.

I think this item about “Null” highlights the very real and important problems with assumptions about automated content processing. Whether the corpus is a telephone directory with a handful of names or the mind-boggling flows which stream from various content channels.

Buying does not solve long-standing, complicated problems in text processing. Fast talk like that which appears in some of the Search Wizards Speak interviews does not change the false drop problem.

So what does this mean for vendors of Fancy Dan systems? Ignorance on the part of buyers is one reason why deals may close. What does this mean for users of systems which generate false drops and dependent reports which are off base? Ignorance on the part of users makes it easy to use “good enough” information to make important decisions.

Interesting, Null?

Stephen E Arnold, August 3, 2013

Sponsored by Xenky

Treparel Makes Big Data Waves Overseas

August 3, 2013

Belgium is not a country we instantly associate with big data dominance. But, the small nation has recently proven that it does have excellent sensibility with analytics and who does a good job. Be discovered just how from a Treparel article, “Treparel Wins LT-Innovate Award 2013.”

Treparel just recently announced its new strategy to collaborate with other software and solution vendors to enhance their solutions with advanced content analytics and visualizations using the KMX API. Winning the LT-Innovate Award 2013 is a reward from colleagues in the language technology and text analytics industry in Europe that we are in the right place, at the right time on the right track. And it’s a salute to to the committed and hard work of our growing team!

Frankly, this company has not just been on the Belgian radar, but ours as well. We fell head over heels after surfing their website and discovered powerful vision. “Our solutions use the SVM algorithm in our unique methodology that dramatically changes the way people obtain information from data by means of text mining and visualization.” Keep this company in your head, we suspect this Belgian award is not the last big trophy they will snag.

Patrick Roland, August 03, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

Big Data and Its Less-Than-Gentle Lessons

August 1, 2013

I read “9 Big Data Lessons Learned.” The write up is interesting because it explores the buzzword that every azure chip consultant has used in their marketing pitches over the last year. Some true believers have the words Big Data tattooed on their arms like those mixed martial arts fighters sporting the names of casinos. Very attractive I say.

Because “big data” has sucked up search, content processing, and analytics, the term is usually not defined. The “problems” of Big Data are ignored. Since not much works when it comes to search and content processing, use of another undefined term is not particularly surprising. What caught my attention is that Datamation reports about some “lessons” its real journalists have tracked down and verified.

Please, read the entire original write up to get the full nine lessons. I want to highlight three of them:

First, Datamation points out that getting data from Point A to Point B can be tricky. I think that once the data has arrived at Point B, the next task is to get the data into a “Big Data” system. Datamation does not provide any cost information in its statement “Don’t underestimate the data integration challenges.” I would point out that the migration task can be expensive. Real expensive.

Second, Datamation sates, “Big Data success requires scale and speed.” I agree that scale and speed are important. Once again, Datamation does not bring these generalizations down to an accounting person’s desktop. Scale and speed cost money. Often a lot of money. In the analysis I did of “real time” a year or two ago, chopping latency down to a millisecond or two exponentiates the cost of scale and speed. Bandwidth and low latency storage are not sporting WalMart price tags.

Third, Datamation warns (maybe threatens) those with children in school and mortgages with, “If you’re not in the Big Data pool now, the lifespan of your career is shrinking by the day.” A couple of years ago this sentence would have said, “If you’re not in the social media pool now, the lifespan of your career is shrinking by the day.” How long with these all-too-frequent “next big things” sweep through information technology. I just learned that “CIO” means chief innovation officer. I also learned that the future of computing rests with synthetic biology.

The Big Data revolution is here. The problem is that the tools, the expertise, and the computational environment are inadequate for most Big Data problems. Companies with the resources like Google and Microsoft are trimming the data in order to get a handle on what today’s algorithms assert is important. Is it reasonable to think that most organizations can tackle Big Data when large organizations struggle to locate attachments in intra-organization email?

Reality has not hampered efforts to surf on the next big thing. Some waves are more challenging than others, however. I do like the fear angle. Nice touch at a time when senior managers are struggling to keep revenues and profits from drifting down. The hope is that Big Data will shore up products and services which are difficult to sell.

Catch the wave I suppose.

Stephen E Arnold, August 1, 2013

Sponsored by Xenky

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta