Resource Links: Text Extraction From HTML Documents

March 28, 2011

We found another nifty links page to add to your software utility file.  The list comes from Tomaž Kova?i?’s Tech Blog.  He gathered resource links about text extraction from HTML documents to aid the wayward IT worker.

He first highlights articles that cover the basics of text extraction.  By reading these articles, you gain a general knowledge about text extraction and the best way to approach it for your needs.  He also mentions how to eliminate content “noise” (i.e. content farms).

He’s also collected a comprehensive list of links related to software about text extraction.  He says, “There is only a small amount of competition when it comes to software capable of [removing boilerplate text / extracting article text / cleaning web pages / predicting informative content blocks] or whatever terms authors are using to describe the capabilities of their product.”

Extracting text from an HTML document is relatively simple.  The type of software you use makes it more complex.  He ends with information about APIs and other miscellaneous links that will be helpful. Stash it away for future use.

Whitney Grace, March 28, 2011

ISYS Search Tags the Equivio Account

March 21, 2011

Equivio, a highly-rated software firm whose focus is redundant data management, has chosen ISYS Search’s’ document connectors to move information from one file form to another. Equivio will paid the ISYS Search filters with Equivio’s own series of eDiscovery tools. The full release can be viewed on the ISYS site.

The arrangement is projected to reinforce ISYS’ commanding rank as a pioneer in rooted search technologies, which offer customers the capacity to mine text from a diverse array of formats. The multinational company continues to collect both accolades and associates.

In the announcement, Equivio CEO Amir Milo states that expeditious retrieval and review of information is a primary component in the success of any business utilizing eDiscovery software.

Micheal Cory, March 21, 2011

Freebie

Inforbix Poised to Shake Up Engineering Design Search

November 3, 2010

In an exclusive interview with ArnoldIT.com, Oleg Shilovitsky, co-founder and CEO of Inforbix, provides an in-depth look at his information retrieval system for engineering and product design. His firm Inforbix has been operating in a low profile and is now beginning to attract the attention of engineering professionals struggling with conventional data management tools for parts, components, assemblies, and other engineered pieces.

Most search systems are blind to the data locked in engineering design tools and systems. For example, in a typical manufacturing company, a traditional search system would index the content on an Exchange server, email, proposals in Word files, and maybe some of the content residing in specialized systems used for accounts payable or inventory. When these items are indexed, most are displayed in a hit list like a Google results page or in a point-and-click interface with hot links to documents that may or may not be related to the user’s immediate business information need.

But what about the specific part needed for a motor assembly? How does one locate the drawing? Where are the data about the item’s mean time before failure? The semantic relationships between bits of product data data located in multiple silos are missing. The context of information related to components in a manufacturing and production process is either ignored, not indexable, or presented as a meaningless item number and a numerical value.

That’s the problem Mr. Shilovitsky and his team of engineers has solved. With basic key word retrieval now a commodity, specialized problems exist. As Mr. Shilovitsky told me, “I think maybe we have solved a problem for the first time. We make manufacturing and production related data available in context.”

In the interview conducted on November 1, 2010, Mr. Shilovitsky said:

In my view, the most valuable characteristics of future systems will be “flexibility” and “granularity”. The diversity of data in manufacturing organization is huge. You need to be flexible to be able to crack the information retrieval. On the other side, businesses are driven by values and ROI. So, to be able to have a granular solution (don’t boil the ocean) in order to address a particular business problem is a second important thing.

He added:

Our system foundation combines flexibility and granularity with a deep understanding of product data in engineering and manufacturing. One of the problems of product development is a uniqueness of organizational processes. Every organization runs their engineering and development shop differently. They are using the same tools (CAD, CAM, CAE, data management tools, or an ERP system), but the combination is unique.

To read the full text of this exclusive interview, navigate to this link. For more information about this ground-breaking approach to a tough information problem, point your browser to www.inforbix.com.

Stephen E Arnold, November 3, 2010

Freebie

Coveo Connects

November 1, 2010

Knowledge and information are directly related to a company’s success. Coveo taps on this aspect as a leading provider of enterprise search and customer information access solutions. The PR-USA.net article “Coveo Announces New Information Indexing Connectors Including Support for Microsoft SharePoint 2010,” tells the story of how “Coveo offers a richer, more integrated view of enterprise knowledge and information compared to what’s available with Microsoft’s native search.”

The article further discloses that through its Enterprise Search 2.0 approach, it is possible for Coveo to “bring the benefits of unified information access to customers faster, and less expensively, than is possible with traditional solutions including SharePoint Search or Microsoft FAST.” Since Coveo dynamically indexes the data and presents it in a unified view, it helps the organizations with instant value of the information and knowledge stored in form of structured and unstructured data across the enterprise, in any system without moving data. Thus, the extended Coveo offers superior functionality and integration. Our recommendation: connect with Coveo.

Harleena Singh, November 1, 2010

Informatica Aims toward $800 Million in 2011

October 28, 2010

Informatica, a data integration software provider, reported a 40 percent jump in license revenue in its third quarter, that ended on September 30, 2010. Sohaib Abbasi, the CEO and Chairman of the firm said, “Our third quarter record results are further evidence of our sustained growth opportunity.” He asserted that the rise in customer demand and the company’s product portfolio positioned the firm as a leader in the business of data integration. With data transformation becoming increasingly important, Informatica may well benefit. Our research indicates that some data transformation tasks are underbudgeted. When costs rise, data transformation expenses can be difficult to control. Informatica’s tools are mature which may give the company a competitive advantage. For more information about Informatica, navigate to the firm’s Web site at www.informatica.com.

Stephen E Arnold, October 27, 2010

Freebie

Vamosa Acquired by T-Systems

October 27, 2010

Update: The goose is easily confused. T-Systems, not T-Mobile, purchased Vamosa. I think that Deutsche Telekom owns both of these companies. I see a similarity between the T-Systems’ Web site and the T-Mobile Web sites. The clue is the weird color and the dotted lines. I also heard from an ever-so-polite person who enjoined me in several emails to point out that T-Mobile(owned by Deutsche Telekom) did not acquire Vamosa. T-Systems (owned by Deutsche Telekom) did not buy Vamosa. Interesting because this sort of input attracts my attention; it does not diminish it. My question, “Why such a consoluted structure made more confusing with logos, color, and branding? ” Worth poking around perhaps?

And here’s an alleged official explanation from a person representing himself as affiliated with Kelso PR:

The problem is that in the UK, T-Systems and T-Mobile are different companies, owned by the same overall company, Deutsche Telekom.  T-Mobile is a partnership between France Telecom & Deutsche Telekomm [sic], whereas T-Systems is wholly owned by Deutsche Telekom. Indeed, in the UK T-Mobile isn’t called T-Mobile anymore, and is now called “everything everywhere”.  We are fine with you describing the purchaser as Deutsche Telekom (the overall owner), or as T-Systems (the actual buyer of Vamosa), but we would prefer if you don’t refer to the purchaser of Vamosa as “T-mobile”, which is a different company altogether. The Vamosa website has the “T-systems” branding running across the top of it. http://www.vamosa.com/ It’s just a simple issue of accuracy of the information.  If you have a look here:  http://www.heraldscotland.com/business/corporate-sme/t-systems-acquires-ip-and-trademarks-from-collapsed-vamosa-1.1063831 it should be clear how this is being reported in the UK.  As I say, thanks so much for responding to this.

A number of questions are swirling through my mind. Got that?

Short honk: T-Mobile (TSystems) has acquired Vamosa. I think of T-Mobile as a third string player in the US mobile market and a reliable wireless provider in the parts of Europe I visit. I was near the arctic circle a couple of years ago and I got a T-Mobile signal. T-Mobile’s purchase of Vamosa interested me. Vamosa embraced the notion of content governance, but I think of the company has having software that transform content. In addition to connectors, the company’s strength was moving a big chunk of content from one system into a form that another system could use. Instead of a human slogging through sample documents, Vamosa offered software to analyze, normalize, and migrate content. A person at KelsoPR.com sent me a news release that said:

The acquisition supports T-Systems’ strategic focus fuelling growth by enabling collaboration and mobility. “Executives are looking for innovative technologies that help them reduce the complexity of managing multiple e-channels, which they rely on to drive knowledge sharing and customer transactions. An increasing number of critical business processes depend on the implementation of a secure and consistent governance structure that ensures employees, partners and customers have access to reliable content at all times and across all screens,” said Peter Row, Vice President of T-Systems UK Systems Integration who led the acquisition. “By expanding our portfolio to target this business issue we will be offering a unique end to end solution for customers in the marketplace.” The market-leading suite of products previously developed by Vamosa Limited, automatically tags digital content, cleans legacy data and seamlessly migrates content into content management systems.  On an ongoing basis the software technology ensures corporate standards are adhered to and auto-fixes any breaches it uncovers.

I had heard that T-Mobile was thrashing around in search, content processing, and information services. Maybe this acquisition adds some credence to those rumors. I am not sure about the Vamosa connectors. As you know, I am watching the i2 Ltd / Palantir legal matter which seems to be about reverse engineering connectors in order to hook into proprietary file stores. Connectors and data transformation are emerging as interesting functions which warrant observation.

Stephen E Arnold, October 27, 2010

Freebie

Coveo Adds Connectors

October 16, 2010

Coveo has announced new information indexing connectors. Among the new connectors are those for Jive SBS Versions 3.0 to 4.5, support for Microsoft SharePoint 2010, and Microsoft Exchange 2010. Coveo updated its connector for Lotus Notes. In the news release, we learned that Coveo is working with Netezza. Earlier this year we heard that Netezza was hooked into Attivio. Netezza, as you may know, is now part of IBM, a company which has been on a mini-spending spree.

One of the interesting comments in the news story was:

Out of the box, Coveo Information Indexing Connectors seamlessly and securely index enterprise-wide systems and data repositories. Coveo-developed connectors offer superior functionality and integration, including with the native security model of each system. Coveo Connectors feature live monitoring and dynamically index new, deleted and modified documents, ensuring just-in-time access to the timeliest information.

Connectors continue to have a pipeline to our in box. The i2 – Palantir legal matter is about connectors. With the green light turned on for this dust up, connectors are edging from back stage to center stage.

More information about Coveo is available at www.coveo.com.

Stephen E Arnold, October 16, 2010

Freebie

Content Conversion: A Subset of Connectors

October 4, 2010

There is certainly some excitement in the technical backwater of content conversion. I wrote a post for someone’s blog about the legal matter involving i2.co.uk and www.palantir.com. You can do some poking around on this issue or wait until my Information Today column on the subject becomes available.

Over the years, I have had to take content from System A and convert it to a type of content or form of content that System B could process. I wanted to call attention to two outfits who provide these services.

The first is an outfit on Long Island that we used at Ziff Communications lo those many moons ago. The company is called Data Conversion Labs. I visited the firm a couple of times and sat through some demos. The take away for me was these folks can do the System A to System B work quite well. You can read about this company at

The second is an outfit one of my UK clients used. Stilo does the A to B think, and its output worked well as memory serves me. Stilo has added an on demand service, which I thought was quite nifty.

Why do I mention these two companies? I think there is a mid tier consulting firm and a search vendor suggesting that file conversion is some sort of cabal. In fact, file conversion is widely available from lots of people. The suggestion that file conversion is anything other than a widespread service, available from vendors throughout the world is just plain wrong. Marketing is one thing. Ignoring the vendors who perform conversion, code custom filters, and perform transformation on premises, via appliances, or using proprietary methods is one more example of search marketing distortion.

Ah, young people. So eager to become important and hit their numbers.

Stephen E Arnold, October 4, 2010

Freebie

Connector Craziness: The Next Search Battleground

September 29, 2010

A reader sent me a link to a blog post from one of the mid tier consulting firms. The article is “Document Filters as a Search Proxy War.” I really don’t have much to say about the write up. So I will pretty much ignore it. I do this with quite a bit of blog content as I flap past 66.

However, I would like to add some information that I think those involved in search and content processing may want to have at their fingertips. I am reasonably familiar with the number of connectors available from Autonomy and Oracle. However, the connector world is not limited to two vendors, nor is it likely that most of those in search of connectors are aware that the outcome of a legal matter could – and I emphasize could – have a significant impact on the market. You can read more about this matter in my Information Today column and in a series of posts I am doing for a new Web log that will be available in mid-October. The announcement of the new Web log will appear in Beyond Search and I believe there will be a news release if I remember to alert one of my goslings to the task.

First, EntropySoft is a vendor that offers document connectors. You can get information about that firm’s offerings at www.entropysoft.net.

Second, there is a major dust up in the document connector world, and it is one that is the subject of my October Information Today column. The issue is an allegation by i2 Ltd., a company based in England. The core of the allegation is that improper actions were used to reverse engineer a document connector by Palantir. Depending on the outcome of this legal matter, there may be some modifications in the connector world. The issue is a connector for file type ANB. I have done work for i2.

Third, there are some open source connector initiatives underway. If you have not explored this side of the connector world you can begin with a Google search, a Black Duck search, or navigate to http://openconnector.org. The open source software movement, particularly in light of the Oracle litigation with Google, may have an impact on open source connectors.

There are also connector vendors in Hungary and India, but I won’t list these. When the mid tier consultants recycle my work, I want them to have something to do.

With the financial vice closing on many keyword search firms, one has to be vigilant for partial or edited information. Hiving off connectors is a way to generate cash from “must have” code widgets. A serious connector business is a relatively large undertaking. That is one reason why certain firms eschew connectors completely; others code their own with varying degrees of success; and most firms turn to third parties for a bundle that handles the most common file types.

The goose may be old, but he makes an effort to identify as many sides of an issue as possible. What we have, therefore, is a potential instability in the shift from basic search to more sophisticated information fusion.

Stephen E Arnold, September 29, 2010

Freebie, unlike information from English majors, former journalists, and the azurini of the world

Is ISYS Search Software Shifting Its Focus?

August 19, 2010

There are enough economically-fostered partnerships in today’s tempestuous technology market to make Donald Trump salivate. ISYS Search Software may have found a port in the storm. In an article entitled “Sybase Extends Leadership in Advanced Analytics to Customers Around The World”  ISYS announced that Sybase has selected ISYS Document Filtering System for use in its IQ text analytics server. The write up said, “ISYS is aggressively addressing needs that are not being served by competitors Oracle and Autonomy.”

So, is ISYS finding the search waters too deep? If so, partnering with Sybase could be a smart move. We’re not sure if this will level the search playing field just yet, we’re going to keep an eye on this interesting development. What’s clear is that search vendors are scrambling to squeeze revenue from a lousy economic turnip. Connector licensing is one interesting angle.

Bret Quinn, August 19, 2010

Freebie

« Previous Page

  • Archives

  • Recent Posts

  • Meta