The SIREn Call for Semi Structured Data
June 7, 2011
SIREn the new patch for Lucene has, according to its Web site found a way to better the large scale handling of semi-structured data. Commonly searching graph structured data, or RDF, was handled by using specific triplestores. However, triplestores don’t have the same scalability as the new SIREn patch, they fail to the more consumer friendly features that a typical web search engine would utilize.
Triplestores are inefficient when searching across fields and multi-valued fields cannot be handled properly. They can’t differentiate between entry terms and where the fields they belong in. We learned:
The content query operators are the only ones that access the term content of the table, and are orthogonal to the structure operators. They include extended Boolean operations such as Boolean operators (intersection, union, difference), proximity operators (phrase, near, before, after, etc.) and fuzzy or wildcard operators.These operations allow to express complex keyword queries for each cell of the table. Interestingly, it is possibly to apply these operators not only on literals, but also on URIs (subject, predicate and object).
SIREn offers the capability to search large semi- structured content collections like those with different schemas. Something the original Lucene retrieval system failed to do. However, with a patch Lucene can now index and search RDF and text based documents with less confusion and better results.
Leslie Radcliff, June 7, 2011
Sponsored by ArnoldIT.com, the resource for enterprise search information and current news about data fusion