Check out Google: Beyond TextLucene Revolution logoOrder Google: The Digital GutenbergSurf on Google

Featured

Language Computer: Why Now for Swingly and Extractiv

I did some fooling around on the Language Computer Corp. Web site. The PR blitz is on for Swingly, the question-answering service that was featured in blogs and on the quite remarkable podcast hosted by Jason Calacanis. I listened to the Swingly segment but exited once that interview concluded. Instead of wallowing in the “ask a question, get an answer” just like Ask.com, Yahoo Answers, Mahalo, Quora, Aardvark, and others, I thought I would navigate to the Overflight archive and check out the Web site. The first thing I noted was that a click on the WebFerret button now renamed “Ferret” returned a 404 error. Okay. So much for that. I then punched the entity recognition demo which I had also examined a while ago. More luck there, but I had to dismiss an “invalid security certificate,” which I supposed would have been a deal breaker for the Steve Gibson types visiting the Language Computer Web site.

I uploaded one of my for-fee columns  to CiceroLite ML.. The system accepted the file, stripped out the Word craziness, and invited me to process the file. I punched the “process” button. The system highlighted the different entities. What’s important is that Language Computer has for at least eight or nine years performed at or near the top of the heap on various US government tests of content processing systems. Here’s what the marked up text looked like. Each color represents a different type of entity. For example, red is an organization, blue a person, etc.

lcc entity

In operational use, the tagged entities are written to a file, not embedded in a document. But for demo purposes, it makes it easy to see that Language Computer did a pretty good job. Entity extraction is a big deal for some types of content activities. I find a tally of how many times an entity appears in a document quite useful. The big chunk of work, in my opinion, is mapping entities to synonyms and then to people and places. It’s great to know the entities in a document, but it is even more great to have these items hooked together. I quite like the ability to click and see the entities in the source document.

Language Computer Corporation has been around since 1995. It has an excellent reputation, and, like other next generation content processing systems, has been used by specialists in quite specific niche markets. I won’t name these, but you can figure out what outfits are interested in:

  • Entity recognition
  • Event time stamping
  • Sentiment tracking
  • Document summarization.

The plumbing for these industrial-strength applications is what makes Swingly.com work. Swingly.com is a demo of the Language Computer question answering function. In my opinion, I am not likely to do much typing or speaking of questions into a search box or device. I type queries and I shout into a phone, often with considerable enthusiasm. (I hate phones.)

If you want to explore the Language Computer function to turn Web content (heterogeneous and semi-structured content) into structured data, navigate to www.extractiv.com. You will need to register. In order to use the service you have to create a content job, perform some steps, and then know what the heck you are looking at. The system works.

The larger issue to consider is, “Why are companies like Language Computer, Fetch Technologies, JackBe, and others from the niche government markets suddenly bursting into the broader enterprise and consumer sector?”

The pundits have not tackled this question. Most of the Swingly.com write ups are content to beat on the Q&A drum. I don’t think question answering is a mass market service except on devices that allow me to talk. In short, the Web angle is silly. So I am at odds with the azurini. I don’t care too much about English majors and journalists who are experts in search and content processing. Feel free to fall in love. Just brush up on your Shakespeare because the plumbing in systems like Language Computer’s will mean zero to this crowd.

Read more »

Interviews

Exclusive Interview: Charlie Hull, FLAX

Cambridge, England, has been among the leaders in the open source search sector. The firm offers the FLAX search system and offers a range of professional services for clients and those who wish to use FLAX. Mr. Hull will be one of the speakers in the upcoming Lucene Revolution Conference, and I sought his views about open source search.

image

Charlie Hull, FLAX

Two years ago, Mr. Hull participated in a spirited discussion about the future of enterprise search. I learned about the firm’s clients which include Cambridge University, IBuildings, and MyDeco, among others. After our “debate,” I learned that Mr. Hull worked with the Muscat team, a search system which provided access to a wide range of European content in English and other languages. Dr. Martin Porter’s Muscat system was forward looking and broke new ground in my opinion. With the surge of interest in open source search, I found his comments quite interesting. The full text of the interview appears below:

Why are you interested in open source search?

I first became interested in search over a decade ago, while working on next-generation user interfaces for a Bayesian web search tool. Search is increasingly becoming a pervasive, ubiquitous feature – but it’s still being done so badly in many cases. I want to help change that.  With open source, I firmly believe we’re seeing a truly disruptive approach to the search market, and a coming of age of some excellent technologies. I’m also pretty sure that open source search can match and even surpass commercial solutions in terms of accuracy, scalability and performance. It’s an exciting time!

What is your take on the community aspect of open source search?

On the positive side, a collaborative, community-based development method can work very well and lead to stable, secure and high-performing software with excellent support. However it all depends on the ’shape’ of the community, and the ability of those within it to work together in a constructive way – luckily the open source search engines I’m familiar with have healthy and vibrant communities.

Commercial companies are playing what I call the “open source card.” Won’t that confuse people?

There are some companies who have added a drop of open source to their largely closed source offering – for example, they might have an open source version with far fewer features as tempting bait. I think customers are cleverer than this and will usually realize what defines ‘true’ open source – the source code is available, all of it, for free.

Those who have done their research will have realized true open source can give unmatched freedom and flexibility, and will have found companies like ourselves and Lucid Imagination who can help with development and ongoing support, to give a solid commercial backing to the open source community. They’ll also find that companies like ourselves regularly contribute code we develop back to the community.

What’s your take on the Oracle Google Java legal matter with regards to open source search?

Well, the Lucene engine is of course based on Java, but I can’t see any great risk to Lucene from this spat between Oracle and Google, which seems mainly to be about Oracle wanting a slice of Google’s Android operating system. I suspect that (as ever) the only real benefactors will be the lawyers…

What are the primary benefits of using open source search?

Freedom is the key one – freedom to choose how your search project is built, how it works and its future. Flexibility is important, as every project will need some amount of customization. The lack of ongoing license fees is an important economic consideration, although open source shouldn’t be seen as a ‘cheap and basic’ solution – these are solid, scalable and high performing technologies based on decades of experience. They’re mature and ready for action as well – we have implemented complete search solutions for our customers, scaling to millions of documents, in a matter of days.

When someone asks you why you don’t use a commercial search solution, what do you tell them?

The key tasks for any search solution are indexing the original data, providing search results and providing management tools. All of these will require custom development work in most cases, even with a closed source technology. So why pay license fees on top? The other thing to remember is anything could happen to the closed source technology – it could be bought up by another company, stuck on a shelf and you could be forced to ‘upgrade’ to something else, or a vital feature or supported platform could be discontinued…there’s too much risk. With open source you get the code, forever, to do what you want with. You can either develop it yourself, or engage experts like us to help.

What about integration? That’s a killer for many vendors in my experience.

Why so? Integrating search engines is what we do at Flax day-to-day – and since we’ve chosen highly flexible and adaptable open source technology, we can do this in a fraction of the time and cost. We don’t dictate to our customers how their systems will have to adapt to our search solution – we make our technology work for them. Whatever platform, programming language or framework you’re using, we can work with it.

How do people reach you?

Via our Web site at http://www.flax.co.uk – we’re based in Cambridge, England but we have customers worldwide. We’re always happy to speak to anyone with a search-related project or problem. You’ll also find me in Boston in October of course!

Thank you.

Stephen E Arnold, September 1, 2010

Freebie

Profiles

Sophia Search Lands Venture Funding

Wisdom is a good name for a search and content processing system. If you live in rural Kentucky, the Greek becomes “Sophia”, which denotes wisdom. (Gentle reader, “wisdom” is not highly prized in Harrod’s Creek.)

The news that Sophia Search (founded in 2007) landed $1.2 million in seed money reached me via Marketwire. The investors include Volcano, based in Belfast, and Javelin Ventures in London. The story’s title was effective in arresting my attention: “Sophia Search Secures Largest Angel Investment in Northern Ireland to Address Global Demand for Next-Generation Enterprise Search and Discovery.” The news item said:

Sophia’s technology is purpose built on the company’s unique, patented, Contextual Discovery Engine (CDE) based on the linguistical model of Semiotics, the science behind how humans understand the meaning of information in context. The CDE platform automatically detects relationships and themes in unstructured content to enable organizations to seamlessly search, extract, deduplicate and eliminate redundancy of content to minimize risk and reduce the cost of retrieving, storing and managing enterprise information.

The news story revealed that Sophia is built on a patented, next-generation search engine platform. The system can “automatically discover relationships and themes in unstructured content.”

The company, according to my notes, is a spin out from University of Ulster and Saint Petersburg State University. Sophia Search was one of the companies recognzed by the PricewaterhouseCooper entrepreneur competition. (Keep in mind that I do work for the outfit that help PricewaterhouseCoopers conduct these entrepreneur competitions.)

A quick trip to our Overflight system yielded some useful nuggets about this company. The Sophia Search white paper, dated January 2009, pointed out that the method is “fundamentally different to [sic] any other search tool.” The white paper continued:

These tools are based on ideas & principles drawn from disciplines such as Signal Processing or Mathematics. These ideas are  ‘borrowed’ from these disciplines and applied to text retrieval to provide search. In Sophia we believe that in order to retrieve useful information for users we must first understand its meaning and as such we build Sophia upon the recognised linguistical model of Semiotics.

The system “understands” the context in which a word or phrase is used. The white paper said: “In order to understand the meaning of a word it must be taken within the context of other words around it.” We agree. Key word indexing is one reason why most search systems drive users to distraction.

The white paper introduces the idea of “intertextuality”. Here’s what the Sophia white paper says:

All  texts  are  rehashes  of  previously  existing ones and in order to understand them properly they must be read within the  context of all information available that is related to them.

Many search engines remain ignorant of what has been previously processed. Google’s programmable search engine includes a context server which addresses this problem in the context of Ramanathan Guha’s method. But Google does not as far as I know offer its context server technology to third parties. Sophia’s engineers are heading down an interesting path in my opinion.

The system processes content, picks out key themes, and then clusters the pointers into “themes”. The idea is that a search rturns content which is “topically similar”. According to the write up in the University of Ulster’s U2B newsletter (Winter 2007), Dr. David Patterson, one of the founders of the company, revealed:

Sophia just doesn’t ind relevant information for customers, it also empowers them with an understanding of the meaning of the information returned. Using conventional search is akin to using a torch in a dark room (the torch represents the search engine and the room, an organisation’s information). Only the parts of the room that have the beam of light focussed on them can be seen at any one time, with limited understanding of the information in view. Using SOPHIA is like licking the switch for a bright ceiling light. The whole room can be seen and all information understood at once.

If you are into technical papers, you can get a feel for the system’s method in “Sophia: An Interactive Cluster-Based Retrieval System for the OHSUMED Collection,” published in 2005.

With some search systems fading, new entrants often find eager audiences. Will Sophia become a break out solution? We wish the Sophia team the best.

Stephen E Arnold, July 9, 2010

Freebie

Latest News

Dragon Search Now Available for Medical Professionals

One of my “real”, for-fee columns is about voice search. However, I wanted to capture this news item because it shows the niche influenza that is infecting search... Read more »

September 2, 2010 | | Comment

Mango Thrives in the Warmth of Solr

Mango library catalog helps to search libraries for the particular book, video, CD, or an ISBN, ISSN, and call number using criteria’s like keywords, title, author,... Read more »

September 2, 2010 | | Comment

Click Hunting without Crazy Business Analyses

I wrote about the shotgun marriage of a consulting firm getting a competitor to critique another firm’s products. Now I want to highlight Business Week’s essay... Read more »

September 2, 2010 | | Comment

Could Oracle Tap into Android Revenue?

Google has been sued by Oracle, as reported by the siliconvalley.com blog post “’Mo Money Mo Problems’ for Google”, for the heavy use of Java by Google in... Read more »

September 2, 2010 | | 1 Comment

Microsoft Seeks Rare Search Panda

I was not too surprised when I read “MS Seeks Net Search Partner.” The story said that Microsoft wanted to find a search partner for the China market. I thought... Read more »

September 2, 2010 | | Comment

Google and the Unexpected Consequences of a Hot Property

I don’t know much beyond what I have read in “Google Making Extraordinary Counteroffers To Stop Flow Of Employees To Facebook.” The idea is a good one if you... Read more »

September 1, 2010 | | Comment

Are There Two Threats to Google Editions?

Google continues to have a less-than-relaxing summer. Maybe life will improve when the leaves begin to fall? Almost lost in the buzz about Google’s new approach... Read more »

September 1, 2010 | | Comment


  •  Only search links from this page: