Microsoft on Semantic Search
October 25, 2011
We were interested to learn that semantic search is alive and kicking. A helping hand may be needed, but semantic search is not on life support.
Microsoft is making baby steps toward more user-friendly services, particularly in the realm of semantic search. MSDN Library offers information and assistance for developers using Microsoft products and services. I found one reference article while browsing the site that I found particularly useful.
“Semantic Search (SQL Server)” is an write up which is still in its “preview” stage, so it is short and has a few empty links, but it provides quite a bit of insight and examples that are very useful for someone attempting to integrate Statistical Semantic Search in SQL Server databases. This process, we learn, extracts and indexes statistically relevant key phrases and uses these phrases to identify and index documents that are similar or related. A user queries these semantic indexes by using Transact-SQL rowset functions.
The document tells us:
Semantic search builds upon the existing full-text search feature in SQL Server, but enables new scenarios that extend beyond keyword searches. While full-text search lets you query the words in a document, semantic search lets you query the meaning of the document. Solutions that are now possible include automatic tag extraction, related content discovery, and hierarchical navigation across similar content. For example, you can query the index of key phrases to build the taxonomy for an organization, or for a corpus of documents.
The article goes on to explain various features of semantic search, such as finding key phrases in a document, finding similar or related documents, or even finding the key phrases that make documents similar or related. Add in storage, installation, indexing, and we have a good move in “how-to” for Microsoft. With Powerset, Fast Search, and Cognition Technologies, Microsoft should be one of the aces in semantic search.
Andrea Hayden, October 25, 2011
Sponsored by Pandia.com
Startup Kwaga Adopts Sinequa
October 17, 2011
The French outlet Presse-Citron brings covers Sinequa’s newest client in a weekly column, “The French Start-Up of the Week: Kwaga.” Kwaga develops tools to provide semantic management for email and contacts. Kwaga is using Sinequa to incorporate language semantics.
As the article explains:
To speak of the history of Kwaga, start by talking about Sinequa. Sinequa is a company that offers a solution for business search engines using semantics. And if the name Sinequa does not evoke anything, use a search engine like LeMonde.fr or LeFigaro.fr and you will find the words ‘Search results provided by Sinequa.
Kwaga’s venture into applications is still young, having only begun this year. Sinequa for a number of years focused on the enterprise. The firm has asserted that it is a leader in search. We will monitor both Sinequa and Kwaga.
Emily Rae Aldridge, October 17, 2011
Observations about Content Shaping
October 3, 2011
Writer’s Note: Stephen E Arnold can be surprising. He asked me to review the text of his keynote speech at ISS World Americas October 2011 conference, which is described as “America’s premier intelligence gathering and high technology criminal investigation conference.” Mr. Arnold has moved from government work to a life of semi retirement in Harrod’s Creek. I am one of the 20 somethings against whom he rails in his Web log posts and columns. Nevertheless, he continues to rely on my editorial skills, and I have to admit I find his approach to topics interesting and thought provoking. He asked me to summarize his keynote, which I attempted to do. If you have questions about the issues he addresses, he has asked me to invite you to write him at seaky2000 at yahoo dot com. Prepare to find a different approach to the content mechanisms he touches upon. (Yes, you can believe this write up.) If you want to register, point your browser at www.issworldtraining.com.— Andrea Hayden
Research results manipulation is not a topic that is new in the era of the Internet. Information has been manipulated by individuals in record keeping and researching for ages. People want to (and can) affect how and what information is presented. Information can also be manipulated not just by people, but by the accidents of numerical recipes.
However, even though this is not a new issue, the information manipulation in this age is much more frequent than many believe, and the information we are trying to gather is much more accessible. I want to answer the question, “What information analysts need to know about this interesting variant of disinformation?”
The volume of data in a digital environment means that algorithms or numerical recipes process content in digital form. The search and content processing vendors can acquire as much or as little content as the system administrator wishes.
In addition to this, most people don’t know that all of the leading search engines specify what content to acquire, how much content to process, and when to look for new content. This is where search engine optimization comes in. Boosting a ranking in a search result is believed to be an important factor for many projects, businesses, and agencies.
Intelligence professionals should realize that conforming to the Webmaster guidelines set forth by Web indexing services will result in a grade much like the scoring of an essay with a set rubric. Documents should conform to these set guidelines to result in a higher search result ranking. This works because most researches rely on the relevance ranking to provide the starting point for research. Well-written content which conforms to the guidelines will then frame the research on what is or is not important. Such content can be shaped in a number of ways.
Traditional Entity Extraction’s Six Weaknesses
September 26, 2011
Editor’s Note: This is an article written by Tim Estes, founder of Digital Reasoning, one of the world’s leading providers of technology for entity based analytics. You can learn more about Digital Reasoning at www.digitalreasoning.com.
Most university programming courses ignore entity extraction. Some professors talk about the challenges of identifying people, places, things, events, Social Security Numbers and leave the rest to the students. Other professors may have an assignment related to parsing text and detecting anomalies or bound phrases. But most of those emerging with a degree in computer science consign the challenge of entity extraction to the Miscellaneous file.
Entity extraction means processing text to identify, tag, and properly account for those elements that are the names of person, numbers, organizations, locations, and expressions such as a telephone number, among other items. An entity can consist of a single word like Cher or a bound sequence of words like White House. The challenge of figuring out names is tough one for several reasons. Many names exist in richly varied forms. You can find interesting naming conventions in street addresses in Madrid, Spain, and for the owner of a falafel shop in Tripoli.
Entities, as information retrieval experts have learned since the first DARPA conference on the subject in 1987, are quite important to certain types of content analysis. Digital Reasoning has been working for more than 11 years on entity extraction and related content processing problems. Entity oriented analytics have become a very important issue these days as companies deal with too much data, the need to understand the meaning and not the just the statistics of the data and finally to understand entities in context – critical to understanding code terms, etc.
I want to highlight the six weaknesses of traditional entity extraction and highlight Digital Reasoning’s patented, fully automated method. Let’s look at the weaknesses.
1 Prior Knowledge
Traditional entity extraction systems assume that the system will “know” about the entities. This information has been obtained via training or specialized knowledge bases. The idea is that a system processes content similar to that which the system will process when fully operational. When the system is able to locate or a human “helps” the system locate an entity, the software will “remember” the entity. In effect, entity extraction assumes that the system either has a list of entities to identify and tag or a human will interact with various parsing methods to “teach” the system about the entities. The obvious problem is that when a new entity becomes available and is mentioned one time, the system may not identify the entity.
2 Human Inputs
I have already mentioned the need for a human to interact with the system. The approach is widely used, even in the sophisticated systems associated with firms such as Hewlett Packard Autonomy and Microsoft Fast Search. The problem with relying on humans is a time and cost equation. As the volume of data to be processed goes up, more human time is needed to make sure the system is identifying and tagging correctly. In our era of data doubling every four months, the cost of coping with massive data flows makes human intermediated entity identification impractical.
3 Slow Throughput
Most content processing systems talk about high performance, scalability, and massively parallel computing. The reality is that most of the subsystems required to manipulate content for the purpose of identifying, tagging, and performing other operations on entities are bottlenecks. What is the solution? Most vendors of entity extraction solutions push the problem back to the client. Most information technology managers solve performance problems by adding hardware to either an on premises or cloud-based solution. The problem is that adding hardware is at best a temporary fix. In the present era of big data, content volume will increase. The appetite for adding hardware lessens in a business climate characterized by financial constraints. Not surprisingly entity extraction systems are often “turned off” because the client cannot afford the infrastructure required to deal with the volume of data to be processed. A great system that is too expensive introduces some flaws in the analytic process.
Semantic Technology: Coming for Everyone?
September 23, 2011
Some strong assertions have been made regarding the importance of semantic search and many companies are delving into the process of understanding and utilizing the data.
An important trend of semantic search and extraction of data is supported by Hewlett-Packard’s confirmation that it would acquire Autonomy, a semantic-based tool used to extract information from non-structured data.
MediaPosts’ blog post, “Semantic Search and Raw Data on Rise” tells us more about the purchase:
Autonomy makes software that searches and keeps track of unstructured data in databases and on Web sites such as Google-like searches through hospital databases and records. Unstructured raw data could increasingly become the next diamond in the rough, allowing brands to target ads based on information extracted from text and images. Some companies already do this.
Google, Bing, and Yahoo are just three of the companies that are expanding their technology to understand semantic search on the web. Vertical Search Works is even launching mobile voice search for the iPhone. There is much promise in this venture, yet many challenges ahead because semantic technology may be one of the technologies best left an an enabler, not something the user must think about doing.
Andrea Hayden, September 23, 2011
Sponsored by Pandia.com
Foodchannel Vertical Search: More Stickiness?
September 23, 2011
Foodchannel.com has become one of the first consumer food sites to deploy a new semantic search bar technology.
Vertical Search Works announced the launch of VSW Search, a new search bar that publishers can use free of charge. The search bar will direct visitors to a publisher-branded results page rather than immediately being directed away from the publishers’ site. PR Newswire’s article, “Vertical Search Works Launches VSW Search™, a Semantic Search Platform for Web Publishers” details the release:
‘We believe VSW Search™ is the “killer app” for search,’ said Colin Jeavons, CEO of VSW. ‘Publishers on the Web are thirsty for page views, and by delivering a semantic-powered search, we can help them better engage consumers by offering relevant, actionable search results.’
The technology understands a search term as a concept instead of as a keyword. By understanding the searcher’s intent with the semantic technology as well as keeping visitors on a publishers’ results page, I wonder how much a user’s search is going to be dictated by these results. It is an interesting approach to say the least.
For more information, visit www.verticalsearchworks.com.
Andrea Hayden, September 23, 2011
Sponsored by Pandia.com
Text Processing for Gender Info
August 27, 2011
Apparently researchers are proving what we have known all along, men and women communicate differently. In all seriousness, language patterns of tweets are being studied by the Mitre Corporation to determine if gender can be accurately assigned. Read more from, “Study shows how some tweeters can identify their gender without even trying.”
As the Mitre team shows in their report, there are certain “buzzwords” that can often be found by analyzing the output of female tweeters. Phrases such as “chocolate” and “shopping” are among the most repeated for women tweeters. The most popular phrases for men, you ask? “Http” and “Google”…hey we never said either gender was more interesting than the other.
The team determined that the female/male ratio on Twitter is 55/45, so a guess of “female” would prove correct 55% of the time. However, the team found success 75% of the time through analyze of certain phrases, like those mentioned above. Perhaps such research could lead to targeted gender-specific advertising. It is interesting regardless, and the full report could be worth a look.
Emily Rae Aldridge, August 27, 2011
Sponsored by Pandia.com
Linguamatics Revealed
July 25, 2011
David Milward, CTO of Linguamatics sat down with The Inquirer for an in-depth look at the10 year old British company’s founder. Dr. Milward insists that it’s not hart to explain what Linguamatics is all about. The write up reported Dr. Milward as saying:
“Its software extracts knowledge from unstructured text. What’s difficult is to explain why it’s different. Isn’t that what a search engine does?”
Linguamatics is individual in that traditional searches are not very ‘agile,’ you have to program specifically what you want. With his system, you can ask any question and get relevant returns.
Milward and partner Roger Hale have taken text mining to another level with the development of the Linguamatics company. Dr. Milward said:
“Organizations are becoming more and more knowledge-driven,” he says. “Similarly to scientific discovery, they build new things based on existing knowledge.”
Automation is important in the fast paced world of enterprise. Pharmaceutical companies are just one of the knowledge driven arenas that have adopted Milwards approach to business intelligence. He demonstrated the advancements of his technology in the last election when he mined Twitter reactions. We learned:
“We found that although people don’t use fully grammatical sentences, they do use grammatical constructions.” The relatively few linguistic patterns enabled them to identify what was being said.
Linguistic structure varies with the various operations and field’s humans are involved in, as do the words we use. Dr. Milward added:
“We found that although people don’t use fully grammatical sentences, they do use grammatical constructions.” The relatively few linguistic patterns enabled them to identify what was being said.
Milward said his system can see the relationship between them all. For example his system can take the words: carcinoma, tumor and neoplasm and equate it with “cancer.” He said:
“The result is the ability to ask a question like, “What genes are associated with breast cancer?” and get back a list of genes rather than a list of documents.”
That’s pretty cool, for a system that doesn’t have a human’s rationality or ability to grow and think. Linguamatics maintains that it’s not trying to replace the human element within the process. They are simply trying to aid in the development so that a job can be done more effectively and in a shorter amount of time.
What this means to the business world is that you will be able to find companies and concepts that are linked in documents without having to pour over the results for hours on end. It will save time and in turn, will save money. Another key pint was:
“There are 20 million relevant articles in the biological domain,” says Milward. “And if you’re going into social media, for example, there are one billion tweets a week. It’s huge amounts of information and what we’re trying to do typically is pull out bits of information from that.”
While in theory Linguamatics has the ability to be a useful tool that can be utilized for the greater good, there are some barriers that it will have to overcome first. The challenge of accessibility is a big one. They have yet to find a graphical interface that can create queries that all computers understand. Let’s face it, even in this age of technology, not everyone is a programmer and knows ‘techspeak.’ All in all, it’s a promising technology and something to keep an eye on. The start-up is only ten years old and has plenty of room to grow this into something big.
Stephen E Arnold, July 25, 2011
Sponsored by Pandia.com, publishers of The New Landscape of Enterprise Search.
Latent Semantic Indexing: Just What Madison Avenue Needs
June 29, 2011
Ontosearch examines “The Use of Latent Semantic Indexing in Internet Marketing.” Going beyond the traditional use of simple keywords, Latent Semantic Indexing (LSI) puts words into context. On the assumption that words used in the same context are synonyms, the method uses math to find patterns within text; this process is known as Singular Value Decomposition. The word “latent” refers to creating correlations that are just sitting there waiting to provide important clues to the reader (either human or software) within the text sample.
When used by a search engine to determine ranking, LSI is a huge advance in establishing relevance to the user’s query. It also helps to lower the rank of duplicate websites. A company’s marketing department must keep this process in mind, and refuse to rely on keywords alone.
Google recently made headlines by revamping their search engine to increase the relevancy of their search results. Enhanced LSI was at the root of that change. Many users have been happy with the results, but a lot of businesses found themselves scrambling to recover their coveted high rankings. Adjustments had to be made.
Ontosearch’s post examines the response to this technique in the marketing world:
Latent Semantic system, is known to enhance or compliment the traditional net marketing keyword analysis technique rather than replacing or competing with them. One drawback of the LSI system is that it is based on a mathematical set of rules, which means that it can be justified mathematically but in the natural term, it has hardly any meaning to the users. The use of Latent Semantic System does not mean that you get rid of the standard use of keywords for search reference, instead it is suggested that you maintain a good density of specific keywords along with a good number of related keywords for appropriate Web marketing of the sites.
That technique allows marketing departments to maximize their search rankings. Wow, the marketers are moving to the future! I guess they know what’s good for them. Any company that refuses to embrace the newest techniques risks being left in the dust, especially these days.
But what happens if the Latent Semantic interpretation is incorrect? It can’t guess correctly every time. Check up on search engines’ interpretation of your site’s text to be sure you appear where you think you should.
During a quick Web search (no, the irony is not lost on me), I found that the method has been used to filter spam. That’s welcome. It’s also been applied to education. It’s also been applied to the study of human memory. Interesting. (That reminds me, have I taken my Ginkgo biloba today?)
Our view is that semantic methods have been in use in the plumbing of Google-like systems for years. The buzz about semantic technology is one of the search methods that surf on Kondratieff waves. This has been a long surf board ride. The shore is in sight.
Cynthia Murrell June 29, 2011
You can read more about enterprise search and retrieval in The New Landscape of Enterprise Search, published my Pandia in Oslo, Norway, in June 2011.
Egyptian Startup Kngine Bets on Semantic Search
June 27, 2011
The Next Web has a couple of interesting recent articles regarding startups in Egypt. First, they announce that “Four Egyptian Startups Are US-Bound for Funding.” The fledgling companies include a couple of mobile services providers, a hardware accelerator enterprise, and semantic search engine Kngine.
According to the write up:
Sawari Ventures, an international venture capital firm, is behind the concept, and is supporting the four Egyptian startup companies as part of its efforts ‘to identify, serve, and provide capital for extraordinary entrepreneurs who are determined to change the MENA [Middle East/ North Africa] region.
We applaud the effort and wish all the startups luck; nothing boosts stability like successful businesses.
We, however, are particularly interested in Kngine, a semantic search provider, said to have already attracted an international following.
We keep asking, “Is semantic search the next big thing?”
Investors and influential blogs like Next Web track the space closely; for example, see the excellent write up “Semantic Web, Meet Middle East. Middle East, Meet Kngine!”
Revealing that Kngine is the first Middle Eastern semantic search engine, the article voices confidence in the product:
“The engine, while Middle Eastern focused, also works great on various global and international topics and can provide on-the-spot suggestions, related results and even calculates the average city weather per month. While it’s no WolframAlpha, Kngine has been entirely created by a two-person team. It could be a great Google/Wiki search alternative if you’re looking for quick and fast information, especially if it’s Middle Eastern related.”
Especially impressive are the robust support of complex queries and the ability to recognize Arabic. Though the results won’t be displayed in that language for another six months, the engine can connect Arabic words with their English equivalents.
You can take a tour of the service here.
Cynthia Murrell, June 26, 2011
You can read more about enterprise search and retrieval in The New Landscape of Enterprise Search, published my Pandia in Oslo, Norway, in June 2011.