Media Cloud: Foggy Payoff

March 12, 2009

  I wrote about Calais in 2008. You can find that article here. Calais makes use of ClearForest technology to perform semantic tagging. I am cautious when large companies make services available at a low or no cost. Now, Calais was pegged to a project at Harvard University. You can read the ReadWriteWeb.com story here. the Media Cloud project delivers some of the Google Trends or Compete.com type outputs from content processed with Calais. For me, the most interesting comment in the write up was:

we see this as an example of how the Internet is driving traditional media to change and respond in new ways. We are excited by the scope and potential that Media Cloud brings to anyone interested in following news and media trends.

I have a different view. A university demo project is just that a demo with an academic spin. Traditional media need to do more than a demo before the money in the checking and savings accounts runs dry.

Stephen Arnold, March 12, 2009

Database Content: Take or Use

March 12, 2009

You may want to read Out-Law.com’s “Database Infringements Depend on Taking, Not Usage of Data” here. The article tackles an issue that has triggered a European Court of Justice ruling. For me the key statement in the Out-Law.com synopsis of the ruling was:

The Directive protects against “extraction and/or re-utilisation of the whole or of a substantial part…of the contents of that database”. The ECJ said that infringement was independent of the use to which someone wants to put the information.

Does this ruling matter in the US or elsewhere?

In my opinion, the ruling underscores the difference between how a person who compiles and provides access to that specific compilation of data perceives the value of the data and the person who wants to repurpose some of the data in that database. I am no lawyer, but I do work with clients who can click to a Web site and find useful information; for example, the data available from a government Web site or the patent information I have compiled for my Google patent search service.

image

Software can now slice and dice data. A programmer can make many information “meals” with these amazing software tools.

There are different ways to view the structured data such as airline flight information or condos for sale in Baltimore, Maryland or loosely structured data such as an RSS feed or well formed XML documents.

An innovator / entrepreneur can see these data as raw material for something new. The idea is that individual data items may gain utility when assembled or organized in a way different from the way the information appear on a specific Web site. Because the information are viewable in a browser, it seems to the innovator / entrepreneur that the data or their constituent elements like a phone number are like molecules in a mixture. These can be combined without losing their original chemical structure. The data are publicly available, so the data are meant to be used.

Read more

Search May Not Mean Search

March 11, 2009

Last week, I had a disturbing conversation with a very confident 30 something. After more than a year of planning, I learned that the company had decided to deploy a key word search system from a big name vendor. I asked, “What do the employees need? Keyword retrieval? Reports? Alerts?”

The answer was, “We have that information from informal discussions. Keyword search.”

I thanked the person for lunch and walked away shaking my head. Businesses are struggling for revenue, and employees in the organizations I have visited since October 2008 strike me as wanting to make their companies successful. Employees are savvy and know that if their employer goes down the drain finding another job might not be easy.
For some, there will be increased competition. Darwinianism is an abstract concept until a person can’t find work.

The 30 something had a job. An important job. The information technology unit at this services firms had search systems but employees did not use them. The IT budget was getting scrutiny, so the manager and tech staff decided it was time to get a “new” search system.

The problem was that I had in 2003 and 2004 conducted interviews with a number of senior managers at this organization. I even knew the president of one of the operating units socially. Although my lunch took place in 2009, I realized that the IT department was going to make the same errors it had with its previous search procurements. Every two or three years, the company licensed another system. After a honeymoon of six months, the results were predictable in my opinion. Grousing and declining usage.

Vendors have a tough time breaking the cycle. Some search companies pitch a “simple solution” that is like a One a Day vitamin. Others deliver a toolkit that is far to complicated for the IT team to get working and scarce budget dollars cannot be pumped into what amounts a customized search system.

If this scenario resonates, you may want to navigate to LLrX and read the article, “Knowledge Discovery Resources 2009: An Internet MiniGuide Annotated Link Compilation” here. The listing was compiled by the prolific Marcus P. Zillman, Internet expert. What I liked about the meaty listing was it made clear to me one point: Search does not mean keyword retrieval. The list provided me with a meaty link burger. I discovered a number of useful resources. You will want to download it and do some exploration.

I did not send the list to my lunch pal, the 30 something who knows what his users want without bothering with surveys, interviews, focus groups, and observation of users in action. As long as organizations hire information technology professionals who know what “search” means, a list won’t make much difference.

You might have a more open mind. I hope so. Search defined as keyword retrieval is about as relevant today as a bronze surgical instrument in an emergency room in a big city hospital. Access to information in a way that meets the needs of individual users is, in my opinion, what search means.

Stephen Arnold, March 11, 2009

Search: Still in Its Infancy

March 9, 2009

Click here and read the job postings for intelligence professionals. Notice that the skills are those that require an ability to manipulate information, not just in English but in other languages. Here’s a portion of one posting:

Core Collector-certified Collection Management Officers (CMO’s) oversee and facilitate the collection, evaluation, classification, and dissemination of foreign intelligence developed from clandestine sources. CMO’s play a critical role in ensuring that foreign intelligence collected by clandestine sources is relevant,

I keep reading about search is stable and search is simple. I don’t think so. Because language is complex, the challenge for search and content processing vendors is significant. With more than 150 systems available to manipulate information, one would think that software could handle basic collection and analysis, right? Software helps but search is still in its infancy. The source of the jobs? The US Central Intelligence Agency, which is reasonably well equipped with search, text processing, and content analysis systems. Too bad the reality of search is complex, but some find it easy to say the problem is solved and move on in a fog of wackiness.

Stephen Arnold, March 9, 2009

ODNI Data Mining Report Available

March 8, 2009

If you want to keep a scorecard for data mining projects in some US government agencies, you may find the “Data Mining Report” (unclassified) interesting. You can download a copy here. You will need an acronym knowledgebase to make sense of some of the jargon.

Robert Steele, recovering spy who served on the top intelligence committees for information handling and advanced analysis and processing, says this “If I were on the Hill, this report would trouble me.  If the DNI is to lead, this report should encompass all IC elements, not just the Office of the DNI.  For the DNI to allow “shot-gun” responses to a Congressional requirement is not helpful at best, disingeneous at worst.”

For me, there were two interesting points:

  • Video is a sticky wicket: lots of data and the tools are still evolving
  • Coordination remains a challenge.

Enjoy.

Stephen Arnold, March 8, 2009

MyRoar: NLP Financial Information Centric Service

March 6, 2009

A happy quack to the reader who alerted me to MyRoar.com. This is a vertical search service that relies on natural language processing. I did some sleuthing and learned that François Schiettecatte joined the company earlier this year. Mr.  Schiettecatte  has a distinguished track record in search, natural language processing, and content processing. French by birth, he went to university in the UK and has lived and worked in the US for many years. Here’s what the company says about MyRoar.com:

In today’s current political and economic environment people have never had more questions. MyRoar helps people sort through the hype to find just the answers they are looking for. Extraneous information is eliminated, while saving hours of time or abandonment of search. We provide a fun new interface that keeps users up to date on current news, which helps them formulate the best questions to ask. MyRoar is a Natural Language Processing Question Answering Search Engine. Using integrated technologies we are able to offer high precision allowing users to ask questions relating to finance and news. MyRoar integrates proprietary Question Answer matching techniques with the best English NLP tools that span the globe.

You can use the system here. The system performed quite well on my test queries; for example, “What are the current financials for Parker Hannifin?” returned two results with the data I wanted. I will try to get Mr. Schiettecatte  to participate in the Search Wizards Speak interview series. Give the system a whirl.

Stephen Arnold, March 6, 2009

Metadata Perp Walk

March 5, 2009

I mentioned the problems of eDiscovery in a briefing I did last year for a content processing company. I have not published that information. Maybe some day. The point that drew a chuckle from the client was my mentioning the legal risk associated with metadata. I was reporting what I learned in one of  my expert witness projects. Short take: bad metadata could mean a perp walk. Mike Fernandes’ “Think You’re Compliant? Corrupt Metadata Could Land You in Jail” here tackled this subject in a more informed way than my anecdote. He does a good job of explaining why metadata are important. Then he hits the marrow of this info bone:

Data recovery cannot be treated as the ugly stepsister of enterprise backup, and the special needs that ECM systems place on backup must not be ignored. Regulatory authorities and industry experts are beginning to demand more ECM- and compliance-savvy recovery management strategies, thereby setting new industry-wide legal precedents. One misstep can lead to disaster; however, there are approaches and ECM solutions that help avoid noncompliance, downtime and other incidents.

If you are floating through life assuming that your metadata are shipshape, you will want to make a copy of Mr. Fernandes’ excellent write up. Oh, and why the perp walk? Bad metadata can annoy a judge. More to the point, bad metadata in the hands of the attorney from the other side can land you in jail. You might not have an Enron problem, but from the inside of a cell, the view is the same.

Stephen Arnold, March 5, 2009

Storage Rages

March 5, 2009

ComputerWorld’s “Virtualization the Top Trend over the Next 5 Years” here underscores a potential opportunity that most traditional search and content processing vendors won’t be able to handle with their here-and-now solutions.

“Storage technology is similar to insurance in the financial services industry. In times of a recession, you have to manage your risk. Storage protects what you have and reduces risk,” said Steve Ingledew, managing director of Millward Brown Research’s Technology Practice.

What is interesting about this quote from the ComputerWorld article is that storage itself becomes a risk. Are most search and content processing systems up to task of managing massive repositories of digital information? The answer, in my opinion, is, “Sort of.” Autonomy moved to buy Interwoven to bolster its enterprise information and eDiscovery footprint.  Specialists such as Clearwell Systems and Stratify (Iron Mountain) are farther along than most search and content processing companies. But when the volume of data gets into the tera and peta range, the here-and-now systems may not be up to the task.

With storage booming, there are some major opportunities for companies such as Aster Data, InfoBright, and Perfect Search. Unfamiliar with these companies? One may become the next big thing in data management. Google was on my short list, but the company seems to have lost some zip in the last 12 months. Amazon? At its core it is still an ecommerce vendor and not set up to handle the rigors of spoliation. Storage rages forward.

Stephen Arnold, March 5, 2009

Attivio’s Sid Probstein: An Exclusive Interview

February 25, 2009

I caught up with Sid Probstein, Attivio’s engaging chief technologist on February 23, 2009. Attivio is a new breed information company. The company combines a number of technologies to allow its licensees to extract more value from structured and unstructured information. Mr. Probstein is one of the speakers at the Boston Search Engine Meeting, a show that is now recognized as one of the most important venues for those serious about search, information retrieval, and content processing. You can register to attend this year’s conference here. Too many conferences features confusing multi track programs, cavernous exhibit halls, and annoyed attendees who find that the substance of the program does not match the marketing hyperbole. When you attend the Boston Search Engine Meeting, you have opportunities to talk directly to influential experts like Mr. Probstein. The full text of the interview appears below.

Will you describe briefly your company and its search / content processing technology? If you are not a company, please, describe your research in search / content processing.

Attivio’s Active Intelligence Engine (AIE) is powering today’s critical business solutions with a completely new approach to unifying information access. AIE supports querying with the precision of SQL and the fuzziness of full-text search. Our patent-applied-for query-side JOIN() operator allows relational data to be manipulated as a database would, but in combination with full-text operations like fuzzy search, fielded search, Boolean search, etc. Finally our ability to save any query as an alert and thereafter have new data trigger a workflow that may notify a user or update another system, brings a sorely needed “active” component to information access.

By extending enterprise search capabilities across documents, data and media, AIE brings deeper insight to business applications and Web sites. AIE’s flexible design enables business and technology leaders to speed innovation through rapid prototyping and deployment, which dramatically lowers risk – and important consideration in today’s economy. Systems integrators, independent software vendors, corporations and government agencies partner with Attivio to automate information-driven processes and gain competitive advantage.

What are the three major challenges you see in search / content processing in 2009?

May I offer three plus a bonus challenge?

First, understanding structured and unstructured data; currently most search engines don’t deal with structured data as it exists; they remove or require removal of the relationships. Retaining these relationships is the key challenge and a core value of information access.

Second, switching from the “pull” model in which end-users consume information, to the “push” model in which end-users and information systems are fed a stream of relevant information and analysis.

Third, being able to easily and rapidly construct information access applications. The year-long implementation cycle simply won’t cut it in the current climate; after all, that was the status quo for the past five years – long, challenging implementations, as search was still nascent. In 2009 what took months should take weeks. Also, the model has to change. Instead of trying to determine exactly how to build your information access strategy – the classic “aim, fire” approach – which often misses! – the new model is to “fire” and then “aim, aim aim” – correct your course and learn as you go so that you ultimately produce an application you are delighted with.

I also want to mention supporting complex analysis and enrichment of many different forms of content. For example: identifying important fields, from a search perspective; detecting relationships between pieces of content, or entire silos of content. This is key to breaking down silos – something leading analysts agree that this will be a major focus in enterprise IT starting in 2011.

With search / content processing decades old, what have been the principal barriers to resolving these challenges in the past?

There are several hurdles. First, the inverted index structure has not traditionally been able to deal with relationships; just terms and documents. Second, there still is a lack of tools to move data around, as opposed to simply obtaining content, has been a barrier for enterprise search in particular. There has not been an analog to “ETL” in the unstructured world. (The “connector” standard is about getting data, not moving it.) Finally, I think there’s a lack of a truly dynamic architecture has meant having to re-index when changing configuration or adding new types of data to the index; also a lack of support for rapid updates has lead to a proliferation of paired search engines and databases.

With the rapid change in the business climate, how will the increasing financial pressure on information technology affect search / content processing?

Information access is critically important during a recession. Every interaction with the customer has the potential to cause churn. Reducing churn is less costly by far then acquiring new customers. Good service is one of the keys to retaining customers, and a typical cause of poor service is … poor information access. A real life example: I recently rolled over my 401K. I had 30 days to do it, and did on the 28th day via phone. On the 29th day someone else from my financial services firm called back and asked me if I wanted to roll my 401K over. This was quite surprising. When asked why the representative didn’t know I had done it the day before, they said “I don’t have access to that information”. The cost of that information access problem was two phone calls: the second rollover call, and then another call back from me to verify that I had, in fact, rolled over my 401k.

From the internal perspective of IT, demand to turn-around information access solutions will be higher than ever. The need to show progress quickly has never been higher, so selecting tools that support rapid development via iteration and prototyping is critically important.

Search / content processing systems have been integrated into such diverse functions as business intelligence and customer support. Do you see search / content processing becoming increasingly integrated into enterprise applications?

Search is an essential feature in most every application used to create, manage or even analyze content. However, in this mode search is both a commodity and a de-facto silo of data. Standalone search and content processing will still be important as it is the best way to build applications using data across these silos. A good example here is what we call the Agile Content Network (ACN). Every content management system (CMS) has at least minimal search facilities. But how can a content provider create new channels and micro-sites of content across many incompatible CMSs? Standalone information access that can cut across silos is the answer.

Google has disrupted certain enterprise search markets with its appliance solution. The Google brand creates the idea in the minds of some procurement teams and purchasing agents that Google is the only or preferred search solution. What can a vendor do to adapt to this Google effect?

It is certainly true that Google has a powerful brand. However, vendors must promote transparency and help educate buyers so that they realize, on their own, the fit or non-fit of the GSA. It is also important to explain how what your product does is different from what Google does and how those differences apply to the customers’ needs for accessing information. Buyers are smart, and the challenge for vendors is to be sure to communicate and educate about needs, goals and the most effective way to attain them.

A good example of the Google brand blinding customers to their own needs is detailed in the following blog entry: http://www.attivio.com/attivio/blog/317-report-from-gilbane-2008-our-take-on-open-source-search.html

As you look forward, what are some new features / issues that you think will become more important in 2009? Where do you see a major break-through over the next 36 months?

I think that there continue to be no real standards around information access. We believe that older standards like SQL need to be updated with full-text capabilities. Legacy enterprise search vendors have traditionally focused on proprietary interfaces or driving their own standards. This will not be the case for the next wave of information access companies. Google and others are showing how powerful language modeling can be. I believe machine translation and various multi-word applications will all become part of the landscape in the next 36 months.

12. Mobile search is emerging as an important branch of search / content processing. Mobile search, however, imposes some limitations on presentation and query submission. What are your views of mobile search’s impact on more traditional enterprise search / content processing?

Mobile information access is definitely emerging in the enterprise. In the short term, it needs to become the instrument by which some updates are delivered – as alerts – and in other cases it is simply a notification that a more complex update – perhaps requiring a laptop – is available. In time mobile devices will be able to enrich results on their own. The iPhone, for example, could filter results using GPS location. The iPhone also shows that complex presentations are increasingly possible.

Ultimately, a mobile device, like the desktop, call center, digital home, brick and mortar store kiosk, are all access and delivery channels. Getting the information flow for each to work consistently while taking advantage of the intimacy of the medium (e.g. GPS information for mobile) is the future.

15. Where can I find more information about your products, services, and research?

The best place is our Web site: www.attivio.com.

Stephen Arnold, February 25, 2009

Mysteries of Online 8: Duplicates

February 24, 2009

In print, duplicates are the province of scholars and obsessives. In the good old days, I would sit in a library with two books. I would then look at the data in one book and then hunt through the other book until I located the same or similar information. Then I would examine each entry to see if I could find differences. Once I located a major difference such as a number, a quotation, or an argument of some type, I would write down that information on a 5×8 note card. I had a forensics scholarship along with some other cash for guessing accurately on objective tests. To get the forensics grant, I had to participate in cross examination debate, extemporaneous speaking, and just about any other crazy Saturday time waster my “coaches” demanded.

Not surprisingly, mistakes or variances in books, journals, and scholarly publications were not of much concern to some of the students who attended the party school that accepted an addled goose with thick glasses. There were rewards for spending hours looking for information and then chasing down variances. I recall that our debate team, which was reasonably good if you liked goose arguments, were putting up with a team from Dartmouth College. I was listening when I heard a statement that did not match what I had located in a government reference document and in another source. The opponent from Dartmouth had erroneously presented the information. I gave a short rebuttal. I still remember the look of nausea that crossed our opponent’s face when she realized that I presented what I found in my hours of manual checking  and reminded the judges that distorting information suggests an issue with the argument. We won.

image

For most people, the notion of having two individuals with the same source is an example of duplicate information. Upon closer inspection, duplication does not mean identical in gross features. Duplication drills down to the details of the information and to the need to determine which item of information is at variance and then figuring out why and what is the most likely version of the duplicate.

That’s when the fun begins in traditional research. An addled goose can do this type of analysis. Brains are less important than persistence and a toleration for some dull, tedious work. As a result, finding duplicative information and then figuring out variances was not something that the typical college sophomore spends much time doing.

Enter computer systems.

Read more

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta