November 19, 2012
We’ve turned up a useful summary of Endeca’s Information Discovery system; the description occurs within a post about using integration platform CloverETL with the Endeca product. “Oracle Endeca Information Discovery—CloverETL” is posted at Saichand Varanasi’s OBIEE, Endeca and ODI Blog. After referring readers to his Endeca overview, the blogger dives into the Clover. He writes:
“Today we will see how to create Clover ETL graph and populating data which will be used by MDEX engine for reporting (Studio). Endeca Information discovery helps organization to answer quickly on relevant data of both structured and Un structured. It helps to search and discover and analysis. Information is loaded from multiple data source systems and stored in a faceted data model that dynamically supports changing data. Information discovery enables an iterative approach. Integration features a new ETL tool, The integrator (Clover ETL) that lets you extract source records from a variety of source types flat files to databases.”
Next, Varanasi walks us through an example project. Along the way, he also explains how Endeca Information Discovery functions. A happy side effect, if you will. See the post for details.
Founded in 1999 and based in Cambridge, MA, Endeca was acquired by Oracle just over a year ago. The company has been at the forefront of faceted search technology, particularly for large e-commerce and online library systems.
Cynthia Murrell, November 19, 2012
July 3, 2012
Once again, the technology road behind enterprise search is being questioned and some are mapping out a new route for a company road trip. According to Norm.al’s article, ‘Search vs. Findability vs. Information Retrieval’ findability is the new buzz word of today, but utilizing a back seat driver seems questionable.
The self-appointed tour guides have determined:
“What Findability should be, and what the Semantic Web promises is a new approach. Order first and then the rest will be easy. By using Faceted Search or other Information Retrieval interfaces findability is achieved. Computer Search is based on indexing a junk of data, while Findability should be a process defined at the moment when the data are created.”
“If we could note the order, is Junk of Data, to Order by a third party who analyzes your content based on keywords, NLP and some other great metrics.”
No one really likes a back seat driver and now they are trying to hop in and bark out directions. Sometimes the search engine road may get a little bumpy, but utilizing the right landmarks will get you where you need to go without the interference of detours.
The pavement on this new road seems to still be a bit wet, so one might yet find themselves spattered with debris. Will these distinctions stick? We think not. Search is dead. Long live the next set of buzzwords from self-appointed experts, “real” analysts, and failed Webmasters.
Jennifer Shockley, July 3, 2012
January 31, 2012
I am now using the phrase “exogenous complexity” to describe systems, methods, processes, and procedures which are likely to fail due to outside factors. This initial post focuses on indexing, but I will extend the concept to other content centric applications in the future. Disagree with me? Use the comments section of this blog, please.
What is an outside factor?
Let’s think about value adding indexing, content enrichment, or metatagging. The idea is that unstructured text contains entities, facts, bound phrases, and other identifiable entities. A key word search system is mostly blind to the meaning of a number in the form nnn nn nnnn, which in the United States is the pattern for a Social Security Number. There are similar patterns in Federal Express, financial, and other types of sequences. The idea is that a system will recognize these strings and tag them appropriately; for example:
nnn nn nnn Social Security Number
Thus, a query for Social Security Numbers will return a string of digits matching the pattern. The same logic can be applied to certain entities and with the help of a knowledge base, Bayesian numerical recipes, and other techniques such as synonym expansion determine that a query for Obama residence will return White House or a query for the White House will return links to the Obama residence.
One wishes that value added indexing systems were as predictable as a kabuki drama. What vendors of next generation content processing systems participate in is a kabuki which leads to failure two thirds of the time. A tragedy? It depends on whom one asks.
The problem is that companies offering automated solutions to value adding indexing, content enrichment, or metatagging are likely to fail for three reasons:
First, there is the issue of humans who use language in unexpected or what some poets call “fresh” or “metaphoric” methods. English is synthetic in that any string of sounds can be used in quite unexpected ways. Whether it is the use of the name of the fruit “mango” as a code name for software or whether it is the conversion of a noun like information into a verb like informationize which appears in Japanese government English language documents, the automated system may miss the boat. When the boat is missed, continued iterations try to arrive at the correct linkage, but anyone who has used fully automated systems know or who paid attention in math class, the recovery from an initial error can be time consuming and sometimes difficult. Therefore, an automated system—no matter how clever—may find itself fooled by the stream of content flowing through its content processing work flow. The user pays the price because false drops mean more work and suggestions which are not just off the mark, the suggestions are difficult for a human to figure out. You can get the inside dope on why poor suggestions are an issue in Thining, Fast and Slow.
June 21, 2011
April 18, 2011
January 7, 2011
“Has Lucid Imagination Found the Open Source Solution for Enterprise Search?” asks if, like the Star Trek Enterprise, Lucid Imagination has done what no other open source search engine has done before and created a product worth paying for. Why not just use Apache Solr/Lucene?
The article points out that without Lucid you can’t just index and search a set of documents, you have to create each connection type, and, most importantly, there is no security. It’s also easy to change over to Lucid since it’s built on top of the Apache engine without any significant alterations. To sum up:
Lucid Imagination reduces the technical complexity of leveraging Solr by providing an automated installer, configurable data connectors and a web-based administration interface. In addition to the add-on to Solr/Lucene users can easily observe, multiple enhancements were made to make the solution easier to deploy to the cloud.
If you’re interested in trying it out, the annual fees are straightforward too.
Alice Wasielewski, January 7, 2011
December 17, 2010
We’ve unearthed a slideshare.net document worth mentioning: the Funnelback Enterprise Search Features list.
Acquired by the open source software services company Squiz in 2009, Funnelback is an Australian-based enterprise search engine and services company with a client list including universities, government agencies and large corporations spanning three continents. In Funnelback’s own words:
“Our technology is used to search information across the breadth of an organization. We offer externally hosted search solutions as well as in-house server installed solutions and consultancy services. We search across websites, intranets, portals, databases, fileshares and many other data sources. Our feature rich, high powered, customizable, search engine allows organizations to find accurate information quickly and easily.”
For a concise overview of what Funnelback offer, visit the link above to the four page features list. Whether you are interested in the particulars of its search features, query language, results & reporting or security, amongst even more categories, it’s all organized and detailed right there.
Sarah Rogers, December 17, 2010
October 25, 2010
“Open Source Search with Lucene & Solr” provides a useful overview of information similar to that presented at the Lucene Revolution in Boston, October 7 and 8, 2010. I found the information useful. Even though I poked my head into most sessions and met a number of speakers, Igvita.com has assembled a number of useful factoids. Here’s a selection of four.
First, the Salesforce.com implementation of Lucene “consists of roughly 16 machines, which in turn contain may small and sharded Lucene indexes. Currently, [Salesforce.com] handles 4,000 queries per second (qps) and provides an incremental indexing model where the new user data is searchable within ~ three minutes.”
Second, iTunes is a Lucene user “said to be handling up to 800 queries per second.” I thought Apple was drinking Google Kool-Aid or was before the friction between the two companies entered into a marital separation without counseling.
Third, I found this description of Lucene/Solr interesting:
If Lucene is a low-level IR toolkit, then Solr is the fully-featured HTTP search server which wraps the Lucene library and adds a number of additional features: additional query parsers, HTTP caching, search faceting, highlighting, and many others. Best of all, once you bring up the Solr server, you can speak to it directly via REST XML/JSON API’s. No need to write any Java code or use Java clients to access your Lucene indexes. Solr and Lucene began as independent projects, but just this past year both teams have decided to merge their efforts – all around, great news for both communities. If you haven’t already, definitely take Solr for a spin.
Finally, this passage opened my eyes to some interesting opportunities.
Instead of running Lucene or Solr in standalone mode, both are also easily integrated within other applications. For example, Lucandra is aiming to implement a distributed Lucene index directly on top of Cassandra. Jake Luciani, the lead developer of the project, has recently joined the Riptano team as a full-time developer, so do not be surprised if Cassandra will soon support a Lucene powered IR toolkit as one of its features! At the same time, Lily is aiming to transparently integrate Solr with HBase to allow for a much more flexible query and indexing model of your HBase datasets. Unlike Lucandra, Lily is not leveraging HBase as an index store (see HBasene for that), but runs standalone, albeit tightly integrated Solr servers for flexible indexing and query support.
Navigate to the Igvita Web site and get the full scoop, not a baby cup of goodness.
Stephen E Arnold, October 25, 2010
October 25, 2010
On a phone call last week, the participants were annoyed at the baked in enterprise database. Each upgrade cycle, several of the participants reported that their companies just “paid the bill.” Habit, not critical thinking, keeps some of those giant IBM DB2, Microsoft SQL Server, and Oracle RDBMS installations pump cash from clients into the corporate coffers.
I learned that DBSight is now at Version 4.x (a J2EE search platform) on the call. I first wrote about the system in April 2008 in “DBSight Search: Worth a Closer Look”. The system offers full text search for information stuffed into relational databases. The system can integrate with other languages via XML, JSON, and HTML. The description was that DBSight included a built in database crawler. The system provided a number of knobs and dials. I noted down such functions as faceted search, support for word lists, multi-threaded searching, and some other goodies. In order to handle big data, the system supports multiple indexes and sharded search as well as a number of other speed up methods.
The company coding DBSight has been around since 2004. DBSight has been engineered as a “re-usable search platform.” License fees begin at about $200 but there is a community edition available from this page. If you want an enterprise license, DBSight will provide a custom price quote. I did some poking around and located a link for a free download at http://www.dbsight.net/index.php?q=node/47.
Stephen E Arnold, October 25, 2010