Temis and MarkLogic: Timid? Not on the Semantic Highway

April 12, 2013

My in box overfloweth. Temis has rolled out a number of announcements in the last 10 days. The company is one of the many firms offering “semantic” technology. Due to the vagaries of language, Temis is in the “content enrichment” business. The idea is that technology indexes key words and concepts even though a concept may not be expressed in a text document. I call this indexing, but “enrichment” is certainly okay.

The first announcement which caught my attention was a news release I saw on the Marketwatch for fee distribution service. The title of the article was “TEMIS Completes Successful Wide Scale Semantic Content Enrichment Test in Windows Azure.” A news release about a test struck me as unusual. The key point for me was that Temis is positioning itself to go after the SharePoint add in market.

The second announcement was a news story distributed by Eureka Alert called “Wiley Selects Temis for Semantic Big Data Initiative  The key point is that a traditional publishing company has licensed software to do what humans used to do in a venerable publishing company which, until recently, was sticking with traditional methods and products. Will Temis propel John Wiley to the top of the leader board of professional publishers? Hopefully some information will become available quickly.

The third announcement which I noted was “Temis and MarkLogic Strengthen Strategic Alliance.” The write up hits the concepts of semantics and big data. Here’s the passage which intrigued me:

MarkLogic® Server is the only enterprise NoSQL database designed for building reliable, scalable and secure search, analytics and information applications quickly and easily. The platform includes tools for fast application development, powerful analytics and visualization widgets for greater insight, and the ability to create user-defined functions for fast and flexible analysis of huge volumes of data.

I am uncomfortable with the notion of “only”. MarkLogic is an XML centric data management system. Software wrappers can use the XML back end for a range of applications. These include something as exotic as a Web site for the US Army to more sophisticated applications for publishing technical documents for an aircraft manufacturing firm. However, there are a number of ways to accomplish these tasks and some of the options make use of somewhat similar technology; for example, eXist-db. While not perfect, the fact that an alternative exists only increases my discomfort with an “only”.

So what’s up? My hunch is that both MarkLogic and Temis are in flat out marketing mode. Clusters of announcements are, in my experience, an indication that the pipeline needs to be filled. Equally surprising is that MarkLogic into a big data player and an enterprise search system, not a publishing system. Most vendors are morphing. The tie up with Temis suggests that Temis’ back end needs some beefing up. The MarkLogic positioning is that it is now a player in semantics and big data. I think that partnering is a quick way to fill gaps.

Will MarkLogic blast through the $100 million in revenue ceiling? Will Temis emerge as a giant slayer in semantic big data? The company recently raised $25 million to become a player in big data. (See “Big Data Boon: MarkLogic Pulls In $25 Million In VC Funding”.) Converting $25 million into high margin revenue could tax the likes of Jack Welch in his prime.

My hunch is that both firms’ management teams have this as a 2013 goal. With the patience of investors wearing thin for many search and content processing vendors, closed deals are a must. The economy may be improving for analysts on CNBC, but for search vendors, making Autonomy-scale or Endeca-scale revenues may be difficult, if not impossible.

In my opinion, the labels “big data” and semantics do not by themselves deliver revenue the way Google delivers Adwords. As more search firms chase additional funding, has the world of search switched from finding information for customers to getting money to stay in business?

No timidity visible as these two firms race down the semantic interstate.

Stephen E Arnold, April 12, 2013

All About Solr

April 8, 2013

Apache Solr has already claimed the role of one of the most popular and sought after search applications currently on the market. The Apache Solr platform uses Lucene to power its indexing and querying abilities. The Eventbrite article “Solr Unleashed SC” which was translated using Google translator gives details about the upcoming Solr Unleashed training class on June 13, 2013 in Brazil.

“Solr Unleashed is a complete training, hands-on, facing the Solr 4, or SolrCloud. The SolrCloud is a complete change of structure of Solr to facilitate installations of Big Data. Allows indexing distributed beyond search distributed, eliminating the need for master-slave configuration.”

The course will be spread out over two 8-hour days. Students will need to bring their own computer and will get the chance to develop a complete application. This application will actually be a real search prototype and students will learn it so that it can potentially be used for future projects. In addition students will also get an official certification of LucidWorks and will be given a digital copy of all the course material. The actual material will be in English but the course will be taught in Portuguese. Semantix, a LucidWorks partner company, will be giving the class. During the class students will not only get an in depth introduction to Solr but they will also get an up close and personal look at the new open source search system Solr 4. It’s great to see Solr growing and transcending to other languages. Looks like regardless of the language, search is where it’s at.

April Holmes, April 08, 2013

Sponsored by ArnoldIT.com, developer of Augmentext

Newest Version of MongoDB Includes Text Search

April 6, 2013

Some welcome enhancements to MongoDB are included in the open-source data base’s latest release, we learn from “MongoDB 2.4 Can Now Search Text,” posted at the H Open. The ability to search text indexes has been one of the most requested features, and the indexing supports 14 languages (or no language at all.) The write-up supplies this handy link to a discussion of techniques for creating and searching text indexes.

The post describes a second feature of MongoDB 2.4, the hashed index and sharding:

“Hash-based sharding allows data and CPU load to be spread well between distributed database nodes in a simple to implement way. The developers recommend it for cases of randomly accessed documents or unpredictable access patterns. New Geospatial indexes with support for GeoJSON and spherical geometry allow for 2dsphere indexing; this, in turn, offers better spherical queries and can store points, lines and polygons.”

There is also a new modular authentication system, though its availability is limited so far. The project has also: added support for fixed sized arrays in documents; optimized counting performance in the execution engine; and added a working set size analyzer. See the article for more details, or see the release notes, which include upgrade instructions. The newest version can be downloaded here.

Cynthia Murrell, April 06, 2013

Sponsored by ArnoldIT.com, developer of Augmentext

A Cause for Celebration

April 1, 2013

The best way to celebrate the successful completion of a project is with a celebration and no celebration is complete without a cake. Synaptica definitely knows how to throw a celebration party. According to the Synaptica Central piece “Elsevier Celebrates New Installation” Synaptica and Elsevier recently celebrated the successful completion of their software development project with a tasty cake.

“It is a pleasure when one of our customers has a specially decorated cake made to celebrate the successful deployment of their customized Synaptica taxonomy management software. The project, completed this month, was a collaboration between Synaptica and the content management team at Elsevier, Netherlands.”

Elsevier got its start with journal and book publishing but is also known for providing scientific, technical and medical information as well as various other products. Synaptica was started in 1995 and is owned by Trish Yancey and Dave Clarke. They are an industry leader in the taxonomy management and ontology software. Their software give users several key benefits such as increased relevance thanks to a synonym-rich indexing vocabulary and the ability to visualize taxonomies in a variety of both textual and graphical formats. Synaptica software can work in the enterprise world and has been integrated with several different third-party applications. In addition Synaptica is user friendly and can be set up in only a matter of minutes. Synaptica taxonomy software is used by a variety of organizations when it comes to their metadata management and information access applications. The company even received the “100 Companies that Matter” award. Looks like they definitely have a reason to celebrate.

 

April Holmes, April 01, 2013

Sponsored by ArnoldIT.com, developer of Augmentext

Soutron and EBSCO Join Forces

April 1, 2013

Could the library be a gold mind just waiting to be tapped for its financial resources? The Examiner article “Soutron and EBSCO Enter Partnership Agreement” talks about the technology partnership that Soutron Global and EBSCO forged. With this new partnership Soutron Global will begin to integrate EBSCO Discovery Services with Soutron’s Library and Knowledge Management system. This collaboration will provide clients with a single integrated search environment that they can use for research and information resources. Tony Saadat, President and CEO of Soutron Global made the following statement.

“This partnership means that libraries, knowledge management centers, and information resource portals can ensure optimal access to knowledge assets, physical resources, and digital resources, thus ensuring optimal exploitation of resources.”

EBSCO Publishing is the company behind EBSCOhost, which is a fee-based online research service. A variety of libraries including educational, medical and public use EBSCO services. EBSCO Discovery Service (EDS) provides better indexing and full-text searching than any other discovery service. Graham Beastall, Managing Director, UK hade the following to say regarding the collaboration.

“Soutron is very excited to be working with EBSCO on what we regard as a key initiative to develop access to digital and physical resources in an organization. It will allow us to offer customers using Soutron additional opportunities to maximize use of their collection through EDS single search indexing technologies. Our goal is to make life easier for end users and for library managers.”

Never really thought of library catalogs as a way to financial security but could they be the next technology gold mind. Looking at the big picture I think the answer is no. Most libraries already work on a limited budget and it’s unlikely that they will suddenly get additional funds. With their proven technology EBSCO should focus on acquiring library cataloging and services companies for an extra boost. “Might as well be all or nothing.”

April Holmes, April 01, 2013

Sponsored by ArnoldIT.com, developer of Augmentext

Promise Best Practices: Encouraging Theoretical Innovation in Search

March 29, 2013

The photo below shows the goodies I got for giving my talk at Cebit in March 2013. I was hoping for a fat honorarium, expenses, and a dinner. I got a blue bag, a pen, a notepad, a 3.72 gigabyte thumb drive, and numerous long walks. The questionable hotel in which I stayed had no shuttle. Hitchhiking looked quite dangerous. Taxis were as rare as an educated person in Harrod’s Creek, and I was in the same city as Leibnitz Universität. Despite my precarious health, I hoofed it to the venue which was eerily deserted. I think only 40 percent of the available space was used by Cebit this year. The hall in which I found myself reminded me of an abandoned subway stop in Manhattan with fewer signs.

image

The PPromise goodies. Stuffed in my bag were hard copies of various PPromise documents. The most bulky of these in terms of paper were also on the 3.73 Gb thumb drive. Redundancy is a virtue I think.

Finally on March 23, 2013, I got around to snapping the photo of the freebies from the PPromise session and reading a monograph with this moniker:

Promise Participative Research Laboratory for Multimedia and Multilingual Information Systems Evaluation. FP7 ICT 20094.3, Intelligent Information Management. Deliverable 2.3 Best Practices Report.

The acronym should be “PPromise,” not “Promise.” The double “P” makes searching for the group’s information much easier in my opinion.

If one takes the first letter of “Promise Participative Research Laboratory for Multimedia and Multilingual Information Systems Evaluation” one gets PPromise. I suppose the single “P” was an editorial decision. I personally like “PP” but I live in a rural backwater where my neighbors shoot squirrels with automatic weapons and some folks manufacture and drink moonshine. Some people in other places shoot knowledge blanks and talk about moonshine. That’s what makes search experts and their analyses so darned interesting.

To point out the vagaries of information retrieval, my search to a publicly accessible version of the PPromise document returned a somewhat surprising result.

image

A couple more queries did the trick. You can get a copy of the document without the blue bag, the pen, the notepad, the 3.72 gigabyte thumb drive, and the long walk at http://www.promise-noe.eu/documents/10156/086010bb-0d3f-46ef-946f-f0bbeef305e8.

So what’s in the Best Practices Report? Straightaway you might not know that the focus of the whole PPromise project is search and retrieval. Indexing, anyone?

Let me explain what PPromise is or was, dive into the best practices report, and then wrap up with some observations about governments in general and enterprise search in particular.

Read more

Search Evaluation in the Wild

March 26, 2013

If you are struggling with search, you may be calling your search engine optimization advisor. I responded to a query from an SEO expert who needed information about enterprise search. His clients, as I understood the question, were seeking guidance from a person with expertise in spoofing the indexing and relevance algorithms used by public Web search vendors. (The discussion appeared in the Search-Based Applications (SBA) and Enterprise Search group on LinkedIn. Note that you may need to be a member of LinkedIn to view the archived discussion.)

The whole notion of turning search into marketing has interested me for a number of year. Our modern technology environment creates a need for faux information. The idea, as Jacques Ellul pointed out in Propaganda, is that modern man needs something to fill a void.

How can search deliver easy, comfortable, and good enough results? Easy. Don’t let the user formulate a query. A happy quack to Resistance Quotes.

It, therefore, makes perfect sense that a customer who is buying relevance in a page of free Web results would expect an SEO expert to provide similar functionality for enterprise search. Not surprisingly, the notion of controlling search results based on an externality like key word stuffing or content flooding is a logical way to approach enterprise search.

Precision, recall, hard metrics about indexing time, and the other impedimenta of the traditional information retrieval expert are secondary to results. Like the metrics about Web traffic, a number is better than no number. If the number’s flaws are not understood, the number is better than nothing. In fact, the entire approach to search as marketing is based on results which are good enough. One can see the consequences of this thinking when one runs a query on Bing or on systems which permit users’ comments to influence relevancy. Vivisimo activated this type of value adding years ago and it still is a good example of trying to make search useful. A result which delivers a laundry list of results which forces the user to work through the document list and determine what is useful is gone. If a document has internal votes of excellence, that document is the “right” one. Instead of precision and recall, modern systems are delivering “good enough” results. The user sees one top hit and makes the assumption that the system has made decisions more informed.

There are some downsides to the good enough approach to search which deliver a concrete result which, like Web traffic statistics, looks so solid, so meaningful. That downside is that the user consumes information which may not be accurate, germane, or timely. In the quest for better search, good enough trumps the mentally exhausting methods of the traditional precision and recall crowd.

To get a better feel for the implications of this “good enough” line of thinking, you may find the September 2012 “deliverable” from Promise whose acronym should be spelled PPromise in my opinion, “Tutorial on Evaluation in the Wild.” The abstract for the document does not emphasize the “good enough” angle, stating:

The methodology estimates the user perception based on a wide range of criteria that cover four  categories,  namely  indexing,  document  matching,  the  quality  of  the  search  results  and  the user interface of the system. The criteria are established best practices in the information retrieval  domain  as  well  as  advancements  for  user  search  experience.  For  each  criterion  a test  script  has  been  defined  that  contains  step-by-step  instructions,  a  scoring  schema  and adaptations for the three PROMISE use case domains.

The idea is that by running what strike me as subjective data collection from users of systems, an organization can gain insight into the search system’s “performance” and “all aspects of his or her behavior.” (The “all” is a bit problematic to me.)

Read more

Are There Lessons for Enterprise Search in the Pew Publishing Study 2013?

March 19, 2013

If you have not looked at the Pew report, you will want to check out the basic information in “The State of the News Media 2013.” The principal surprise in the report is that the situation seems to be less positive than I assumed.

Here’s the snippet which I tucked in my notebook:

Estimates for newspaper newsroom cutbacks in 2012 put the industry down 30% since its peak in 2000 and below 40,000 full-time professional employees for the first time since 1978. In local TV, our special content report reveals, sports, weather and traffic now account on average for 40% of the content produced on the newscasts studied while story lengths shrink. On CNN, the cable channel that has branded itself around deep reporting, produced story packages were cut nearly in half from 2007 to 2012. Across the three cable channels, coverage of live events during the day, which often require a crew and correspondent, fell 30% from 2007 to 2012 while interview segments, which tend to take fewer resources and can be scheduled in advance, were up 31%. Time magazine, the only major print news weekly left standing, cut roughly 5% of its staff in early 2013 as a part of broader company layoffs.  And in African-American news media, the Chicago Defender has winnowed its editorial staff to just four while The Afro cut back the number of pages in its papers from 28-32 in 2008 to 16-20 in 2012. A growing list of media outlets, such as Forbes magazine, use technology by a company called Narrative Science to produce content by way of algorithm, no human reporting necessary. And some of the newer nonprofit entrants into the industry, such as the Chicago News Cooperative, have, after launching with much fanfare, shut their doors.

Professional publishing companies like Ebsco, Elsevier, ProQuest, Thomson Reuters, and Wolters Kluwer are going to affected too. If the content streams on which these companies “go away,” the firms will have to demonstrate that they too can act in an agile manner. Since the database centric crowd has crowed about its technical acumen for years, I think the agility trick might be a tough one to pull off.

But what about specialist software vendors of search, content processing, and indexing? Are there lessons in the Pew report which provide some hints about the search of these information centric businesses?

My view is that there are three signals in the Pew data which seem to be germane to search and related service vendors.

First, the drop off which the Pew report documents has been quicker than I and probably some of the senior publishing executives expected. These folks were cruising along with belt tightening and minor adjustments. Now the collision between revenue and expenses are coming together quickly. How will these companies react as the time for figuring out a course correction slips away? My view is that there will be some wild and crazy decisions coming down the runway and soon. Search and content processing sector vendors are facing a similar situation. A run though my Overflight service reveals quite a few vendors who have gone quiet or simply turned out the lights.

Second, the lack of information is not unique to publishing. Organizations have quite a lot of data. The problem is that making use of the data in a way that enhances revenue seems to be difficult. There are quite a few companies pitching fancy analytics, but the vendors are facing long buying cycles and price pressure. Sure there are billions of bits but there is neither the money, expertise, or time to cope with the winnowing and selecting work. In short, there are some big hopes but little evidence that the marketing hyperbole translates into revenue and profits.

Third, traditional publishing is on the outside looking in when it comes to new business models. Google and a handful of other companies seem to be in a commanding position for online advertising. Enterprise search and content processing vendors have not been able to find a business model beyond license fees and consulting. Just like the traditional publishing sector, the statement “We can’t do that” seems to be a self fulfilling prophecy. In search, I think there will be some business model innovation and it will take place at the expense of the vendors who are sticking to the “tried and true” approach to revenue generation.

My take is that the decline of traditional publishing may be a glimpse of the future for search and content processing vendors.

Stephen E Arnold, March 20, 2013

Navigation Misses the Point of Search and Retrieval

March 18, 2013

How does one become a sheeple? One answer is, “Accept search outputs without critical thinking.”

I don’t want to get into a squabble with the thinkers at Nielsen Norman Group. I suggest you read “Converting Search into Navigation” and then reflect on the fact that this was the basic premise of Endeca and then almost every other search vendor on the planet since the late 1990s. The idea is that users prefer to click than type queries or, better yet, have the system just tell the user what he or she wants without having to do so much as make a click.

Humans want information and most humans don’t want to expend much, if any, effort getting “answers.” In the late 1970s, I worked on a Booz, Allen & Hamilton study which revealed that managers in that pre-Internet Dark Age got information by asking the first person encountered in the hall, a person whom an executive could get on the phone, or by flipping through the old school trade magazines which once flowed into in boxes.

A happy quack to http://red-pill.org/are-you-one-of-the-sheeple-take-the-quiz/

What’s different today? According to the write up, as I understand it, not too much. The article asserts:

Users are incredibly bad at finding and researching things on the web. A few years ago, I characterized users’ research skills as “incompetent,” and they’ve only gotten worse over time. “Pathetic” and “useless” are words that come to mind after this year’s user testing.

There you go. When top quality minds like those Booz, Allen & Hamilton tried to hire took the path of least resistance almost 50 years ago, is it a big surprise that people are clueless when it comes to finding information?

The point of the article is that people who make interfaces have to design for mediocre searchers. Mediocre? How about terrible, clueless, inept, or naive? The article says:

… you should redirect users from a normal SERP to a category page only when their query is unambiguous and exactly matches the category. A search for “3D TV” could go to the subcategory page for these products, but a search for “3D” should generate a regular SERP. (Costco does this correctly, including both 3D televisions and other products relevant to the query.) Until people begin to grasp the complexities of search and develop skills accordingly, businesses that take such extra steps to help users find what they need will improve customer success — and the bottom line.

My view is just a little bit different and not parental like the preceding paragraph.

Read more

New Updates to Solr and Lucene

March 18, 2013

Apache Solr and Lucene are notable for good maintenance and frequent updates. These updates are one of the many reasons why Solr and Lucene are considered top choices in open source software. Another upgrade has just been announced in the default codec update 4.2. Read all the details in the article, “Apache Solr and Lucene 4.2 Update Default Codec Again.”

The article sums up some of the improvements:

“The Solr search platform now has a REST API which allows developers to read the schema; support for writing the schema is coming. DocValues are now integrated with Solr and as they allow faster loading and can use different compression algorithms, the integration offers a wide range of feature possibilities and performance benefits. Collections now support aliasing allowing for reindexing and swapping while in production, and the Collections API has now been improved to make it easier to ‘see how things turned out.’ It is also now possible to interact with a collection in a node even if it doesn’t have a replica on that node.”

The full details of the changes can be read in the Lucene 4.2 and Solr 4.2 release notes. When foundational software is improved, the value-added software attached to it gets an automatic upgrade as well. This is the case with LucidWorks and their suite of search offerings built upon the open source strength of Lucene and Solr. Interestingly, LucidWorks has been criticized for not having a RESTful API, but with the newest upgrade to Solr, the claim is no longer valid. LucidWorks will no doubt remain on top.

Emily Rae Aldridge, March 18, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta