The Semantic Web as it Stands
April 16, 2011
Semantic search for the enterprise is here, but the semantic web remains the elusive holy grail. “Semantic Web: Tools you can use” gives an overview of the existing state of semantic technology and what is needed to get it off the ground as a true semantic web technology.
Tim Berners-Lee was the first one to articulate what the semantic web would be like, and his vision of federated search is still sorely missing from reality. Federated search searches several disparate resources simultaneously (like when you search several different library databases at once). Windows 7 supports federated search, but it is still not common throughout the web. The W3C (World Wide Web Consortium) has developed standards to support semantic web infrastructure, including SPARQL, RDF, and OWL, and Google, Yahoo and Bing are starting to use semantic metadata and support W3C standards like RDF.
Semantic software is able to analyze and describe the meaning of data objects and their inter-relationships, while resolving language ambiguities such as homonyms or synonyms, as long as standards are followed. This has practical applications with things like shopping comparisons. If standards are followed and semantic metadata provided by the merchants themselves, online shoppers can compare products without all the inaccuracies and out-of-date information currently plaguing third-party shopping comparison sites.
There are some tools, platforms, prewritten components, and services currently available to make semantic deployment easier and somewhat less expensive. Jena is an open-source Java framework for building semantic Web applications, and Sesame, is an open-source framework for storing, inferencing and querying RDF data. Lexalytics produces a semantic platform that contains general ontologies that can then be fine-tuned by service provider partners for specific business domains and applications. Revelytix sells a knowledge-modeling tool called Knoodl.com, a wiki-based framework that helps a wide variety of types of users to collaboratively develop a semantic vocabulary for domain-specific information residing on different web sites. Sinequa’s semantic platform, Context Engine, provides semantic infrastructure that includes a generic semantic dictionary that can translate between various languages and can also be customized with business-specific terms. Thomson Reuters provides Machine Readable News which collects and analyzes analyzes and scores online news for sentiment (public opinion), relevance, and novelty and OpenCalais, which creates open metadata for submitted content.
Despite all these advances for the use of the semantic web in the enterprise, general, widespread use of the semantic web remains elusive, and no one can predict exactly when that will change:
“In a 2010 Pew Research survey of about 895 semantic technology experts and stakeholders, 47% of the respondents agreed that Berners-Lee’s vision of a semantic Web won’t be realized or make a significant difference to end users by the year 2020. On the other hand, 41% of those polled predicted that it would. The remainder did not answer that query.”
Semantic technology for the enterprise is not only here today, but is growing by about 20% a year according to IDC. That kind of semantic technology is a much smaller beast to tame. When it comes to the World Wide Wide, there is still not widespread support of W3C standards and common vocabularies, which is why more people said no than yes in the survey mentioned above. Generalized web searches are difficult because each site has its own largely proprietary ontology instead of a shared and open taxonomy.
Sometimes even within an enterprise it is difficult to overcome differences in different sectors of the same business.
However, certain industries are starting to come under pressure from customers or industry and have responded by creating standardized ontologies. GoodRelations is one such e-commerce ontology used by eBestBuy.com, Overstock.com, and Google. This kind of technique has not become widespread because of the costs and slow payoff involved. This is a catch-22 where businesses don’t want to jump on the bandwagon because there is not a critical mass yet, but the real benefits won’t start until there is a large number of businesses participating. Things like product categories are often unique to a business and getting some kind of universal standardization is akin to a nightmare, but there still needs to be consensus on using some type of W3C standards of categorization to satisfy customers. And, with more an more bogus information proliferating on the web, semantics become not only convenient, but essential for finding the right information.
I think the fundamental question that this article leaves us with is whether or not we have the standards we need or whether the current standards are the stepping off point to something new. SGML was fine in its day, but it didn’t get very far. HTML cherrypicked some of the basic ideas of SGML and added linking and the World Wide Web was born. Now HTML 5 is re-introducing some of the ideas of SGML that were lost. Maybe HTML can continue to evolve, or maybe someone will cherrypick its best ideas and create something (almost) entirely new. Another issue is all the work that it takes to create all the metadata, no matter what the standards. Flickr and Facebook have made user tagging into a fun activity, but for the semantic web to really function, machines need to do do most of the work. Will this all be figured out by 2020? Survey says no, but who knows?
Alice Wasielewski
April 16, 2011
Twitter Firehose News
April 15, 2011
There is a tweak to the Witter and Mediasift partnership. You can read about it in the DataSift write up “Twitter Partnership”.
Mediasift and Twitter have agreed to a partnership that has the potential to change how marketers and companies understand conversations about their products as well as how they choose to market them to target audiences. By utilizing the advanced DataSift software they are able to break down “tweets” into a language that is easily understandable and searchable and is still quite cost effective with it’s “pay per use” subscription. The article said:
As a company we have been very fortunate to have access to the Twitter Firehose for quite some time. This has enabled us over the past two years to refine our thinking, leading to the incarnation of DataSift.
DataSift compiles multiple social media feeds and additional data sets to create a common abstract layer which provides meaningful insight into much of the chaotic and unstructured data from the outlets. It took nearly 18 months to complete the DataSift platform but it has already seen a huge outpouring of company and marketing support with more than a billion requests per month.
Important stuff for the real time crowd.
Leslie Radcliff, April 15, 2011
Freebie
Improving Health via Analytics and a Competition
April 14, 2011
We have been poking around in health care information for about eight months. We have an exclusive briefing that covers, among other things, what we call the “shadow FBI.” If you are curious about this shadow FBI angle, shoot us a note at seaky2000 at yahoo dot com. One of the goslings will respond. While you wait for our return quack, consider the notion of a competition to improve health care information in order to make health care better.
Competition promises better health care stated:
The goal of the prize is to develop a predictive algorithm that can identify patients who will be admitted to the hospital within the next year, using historical claims data.
According to the latest survey from the American Hospital Association more than 70 million people in the United States alone will be admitted to a hospital this year. The Heritage Provider Network believes that they can change all of that. The HPN will be holding a two year competition that will award $3 million dollars to the team that can create an algorithm that accurately predicts how many days a person will spend in the hospital over the next year.
An algorithm that can predict how many days a person will spend in the hospital can help doctors create new more effective care plans that can help “nip it in the bud” if there are any causes for concern. If possible the algorithm could help to lower the cost of care while reducing the number of hospitalizations.
This will result in increasing the health of patients while decreasing the cost of care. In short, a winning solution will change health care delivery as we know it – from an emphasis on caring for the individual after they get sick to a true health care system.
HPN believes that an incentive based competition is the way to achieve the big breakthroughs that are needed to begin redeveloping America’s health care system.
Leslie Radcliff, April 14, 2011
Freebie
Recorded Future in the Spotlight: An Interview with Christopher Ahlberg
April 5, 2011
It is big news when In-Q-Tel, the investment arm of the US intelligence community, funds a company. It is really big news when Google funds a company. But when both of these tech-savvy organizations fund a company, Beyond Search has to take notice.
After some floundering around, ArnoldIT was able to secure a one-on-one interview with the founder of Recorded Future. The company is one of the next-generation cloud-centric analytics firms. What sets the company apart technically is, of course, the magnetism that pulled In-Q-Tel and Google to the Boston-based firm.
Mr. Ahlberg, one of the founders of Spotfire which was acquired by the hyper-smart TIBCO organization, has turned his attention to Web content and predictions. Using sophisticated numerical recipes, Recorded Future can make observations about trends. This is not fortune telling, but mathematics talking.
In my interview with Mr. Ahlberg, he said:
We set out to organize unstructured information at very large scale by events and time. A query might return a link to a document that says something like “Hu Jintao will tomorrow land in Paris for talks with Sarkozy” or “Apple will next week hold a product launch event in San Francisco”). We wanted to take this information and make insights available through a stunning user experiences and application programming interfaces. Our idea was that an API would allow others to tap into the richness and potential of Internet content in a new way.
When I probed for an example, he told me:
What we do is to tag information very, very carefully. For example, we add metatags that make explicit when we locate an item of data. We tag when that datum was published. We tag when we analyzed that datum. We also tag when we find it, when it was published, when we analyzed it, and what actual time point (past, present, future) to which the datum refers. The time precision is quite important. Time makes it possible for end users and modelers to deal with this important attribute. At this stage in our technology’s capabilities, we’re not trying to claim that we can beat someone like Reuters or Bloomberg at delivering a piece of news the fastest. But if you’re interested in monitoring, for example, the co-incidence of an insider trade with a product recall we can probably beat most at that.
To read the full text of the interview with Mr. Ahlberg click here. The interview is part of the Search Wizards Speak collection of first person narratives about search and content processing. Available without charge on the ArnoldIT.com Web site, the more than 50 interviews comprise the largest repository of first hand explanations of “findability” available.
If you want your search or content processing company featured in this interview series, write seaky2000 at yahoo dot com.
Stephen E Arnold, April 5, 2011
Freebie
Linguamatics Takes to the Cloud
March 22, 2011
One of the leaders in enterprise text mining, Linguamatics, recently announced its newest software creation in “I2E OnDemand – Cloud (Online) Text Mining”. The company’s flagship product, I2E, is an enterprise version of NLP-based text mining software, largely implemented in the medical and pharmaceutical industries. Now Linguamatics adds I2E OnDemand to its offerings menu, matching the popular I2E capabilities with cloud computing for those companies with fewer resources stacked in their corners.
The write-up boasts:
“I2E OnDemand provides a cost-effective, accessible, high performance text mining capability to rapidly extract facts and relationships from the MEDLINE biomedical literature database, supporting business-critical decision making within your projects. MEDLINE is one of the most commonly accessed resources for research by the pharmaceutical and biotech industries.”
Of course in the event that search of additional data sources is required, it is possible to move to the enterprise version of I2E. There is a trial version for evaluation, available by request from the website. Linguamatics has been diversifying in the last 12 months. In 2009, I characterized Linguamatics as a vendor with a product tailored to the needs of the pharma and medical sectors. Now Linguamatics appears to be making moves outside of these vertical sectors.
Sarah Rogers, March 22, 2011
Freebie
Rosette Linguistics Platform Releases Latest Version
March 10, 2011
Basis Technology has announced its most recent release if its Rosette Linguistics Platform. Rosette is the firm’s multilingual text analytics software. Among the features of the new release is the addition of Finnish, Hebrew, Thai, and Turkish to the system’s 24 language capability. One point that we noted is that this release of Rosette sports an interesting mix of compatible search engines. According to the Basis Tech announcement:
“Bundled connectors enable applications built with Apache Lucene, Apache Solr, dtSearch Text Retrieval Engine, and LucidWorks Enterprise to incorporate advanced linguistic capabilities, including document language identification, multilingual search, entity extraction, and entity resolution.”
Several observations seem warranted. First, Basis Tech is moving beyond providing linguistic functionality. The company is pushing into text analytics and search. Second, Basis Tech is supporting commercial and open source search systems; namely, the SharePoint centric dtSearch and the Lucid Imagination’s open source solution.
The question becomes, “What is the business trajectory of Basis Tech? Will it become a competitor to the vendors with which the company has worked for many years? Will it morph into a new type of linguistic-centric analytics firm?” Stay tuned.
Cynthia Murrell, March 10, 2011
Freebie
Automated Understanding: Digital Reasoning Cracks the Information Maze
March 4, 2011
I learned from one reader that the presentation by Tim Estes, the founder of Digital Reasoning, caused some positive buzz at a recent conference on the west coast. According to my source, this was a US government sponsored event focused on where content processing was going. The surprise was that as other presenters talked about the future, a company called Digital Reasoning displayed a next generation system. Keep in mind that i2 Ltd. is a solid analyst’s tool with technology roots that stretch back 15 years. (I did some work for the founder of i2 a few years ago and have a great appreciation for the case value of the system for law enforcement.) Palantir has some useful visualization tools, but the company continues to attract attention from litigation and brushes with outfits with some interesting sales practices. Beyond Search covered this story here and here.
ArnoldIT.com sees Digital Reasoning’s Synthesys as solving difficult information puzzles quickly and efficiently because it eliminates most of the false path or trial-and-error of traditional systems. Solving the information maze of real world flows is now possible in our view.
The shift was from semi-useful predictive numerical recipes and overlays or augmented outputs to something quite new and different. The Digital Reasoning presentation focused on real data and what the company called “automated understanding.”
For a few bucks last year, one of my colleagues and I got a look at the automated understanding approach of the Synthesys 3 platform. Tim Estes explained that real data poses major challenges to systems that lack an ability to process large flows, discern nuances, and apply what Mr. Estes described as “entity oriented analytics.”
Our take at ArnoldIT.com is that Digital Reasoning moves “beyond search” in a meaningful way. The key points we recall from our briefing was the a modular approach eliminates the need for a massive infrastructure build and the analytics reflect what is happening in a real time flow of unstructured information. My personal view is that historical research is best served by key word systems. The more advanced methods deliver actionable information and better decisions by focusing on the vast amounts of “now” data. A single Twitter message can be important. A meaningful analysis of a flow of Twitter messages moves insight to the next level.
Attivo Unveils Maturity Model
March 4, 2011
Our aggregators returned this interesting piece to us from PR Newswire: “Attivio Releases Maturity Model for Unified Information Access.” Attivo has released a series of whitepapers detailing the benefits of using unified information access (UIA). The purpose of UIA is to help businesses see how using information access technologies can increase revenue, cut costs, and increase customer satisfaction for long term strategic planning. Using the UIA model, businesses can learn new ways about data integration. Attivio said:
“The objective of the model is to help organizations establish, benchmark, and improve information access and management strategies. According to the report, the first step in developing a plan for implementing UIA is to conduct a self-assessment of current capabilities and needs, then determine the urgency and importance of solving each issue identified. As an organization moves into the next stage, the incremental capabilities and benefits are measured across two vectors – business criticality and information management integration and process improvements.”
The UIA model can be used by any business to improve their information assets and overall practices.
Attivio is a technology firm that offers functions and systems that push beyond keyword search and retrieval.
Whitney Grace, March 4, 2011
Freebie
Attensity Goes on a Social Offensive
March 3, 2011
Remember the pigeons from Psych 101.
Beginning with the discoveries made by Pavlov and his dogs to the direct application of the science by Ogilvy and his Madison Ave. minions, psychology has long played a part in shaping us as consumers.
Now it seems the growing worldwide embrace of Social media has altered one more aspect of our lives, how we are marketed to, or to phrase it more accurately, how we have begun to market ourselves.
Attensity’s “Customer Segmentation for the Social Media Age“, (which the Attensity writer admits was inspired by a series of tweets) delves into the new media ramifications on conventional segmentation practices.
Attensity explains that before the technological advances made over the last three decades, gathering the information necessary to construct effective marketing campaigns consumed both substantial amounts of time and capital. Despite these costs,
” … Segmentation was the best attempt that we as marketers had to give our customers what they needed, …”
What has changed?
The buyer’s willingness, nay their seeming compulsion to share every fleeting thought and scrap of personal information about themselves to anyone clever enough to operate one of the many devices that link us to the web. The new breed of admen now, instead of sorting through pounds of trial results and customer surveys, can as Attensity states:
” … scour the social web to find mentions of our brands, our competitors’ brands and product categories.”
An interesting read and something to think about the next time you feel the urge to “friend” your laundry detergent.
In a related post on the Parisian consulting and technology firm Capgemini’s site, Senior Consultant Jude Umeh discusses the melding of social media surveillance with the review, application and management of the collected data. His perspective is informed by the hands on experience he received at a partner training session organized by Attensity.
Attensity is collaborating with Pega, a firm offering business process management and customer relationship management software. BPM and CRM are factors in the new math of modern marketing, Attensity seems to have discovered the formula that will position the collective at the head of pack.
Layering their respective technologies, the group appears poised to revolutionize the way information is gleaned from new media. Can Attensity pull off a home run with every search and content processing vendor “discovering” these market sectors? We do not know.
Michael Cory, March 3, 2011
Freebie
Meaning in Casual Wording
March 3, 2011
I love science. Paired with my increasing passion for language and grammar, a sweeter cocktail could hardly be imagined. “Do Casual Words Betray Warlike Intent?” was a fascinating read.
At the recent American Association for the Advancement of Science (AAAS) meeting, James Pennebaker, a University of Texas at Austin psychologist spoke about the study he and assorted colleagues along with the Department of Homeland Security have been engaged in recently. The focus of the research has been on four similar Islamic groups and the relationship between the speech they employ and the actions that follow. The collective hope is the study’s findings can be used to forecast aggressive activity.
Isolating pronouns, determiners, adjectives and prepositions, the group mines them for what Pennebaker calls “linguistic shifts”. To date they have determined that of the four, the two groups who have committed acts of violence, telegraphed said destructiveness with the use of “more personal pronouns, and words with social meaning or which convey positive or negative emotions.” Aside from differentiating between various stylistic elements of expression, Pennebaker has also scrutinized statements made by warmongers from our past, including George W. Bush, with interesting results.
Skepticism has always fueled scientific endeavors, and we must continue to ask questions, especially those that breed discomfort. This science deals with a very grey area and Pennebaker himself labels the results as only “modest probabilistic predictions”. There is no question that this information must be used responsibly, but my aforementioned appreciation for the field keeps me from seeing this as a negative.
If one can discern an opponent’s intent in a fight or a game of cards by careful observation, why is it so strange to think the same could be done from listening to what they say?
Sarah Rogers, March 3, 2011
Freebie