March 14, 2017
Real chicken or fake news? You decide. I read “Google, What the H&%)? Search Giant Wrongly Said Shop Closed Down, Refused to List the Truth.” The write up reports that a chicken restaurant is clucking mad about how Google references the eatery. The Google, according to the article, thinks the fowl peddler is out of business. The purveyor of poultry disagrees.
The write up reports:
Kaie Wellman says that her rotisserie chicken outlet Arrosto, in Portland, Oregon, US, was showing up as “permanently closed” on Google’s mobile search results.
Ms Wellman contacted the Google and allegedly learned that Google would not change the listing. The fix seems to be that the bird roaster has to get humans to input data via Google Maps. The smart Google system will recognize the inputs and make the fix.
The write up reports that the Google listing is now correct. The fowl mix up is now resolved.
Yes, the Google. Relevance, precision, recall, and accuracy. Well, maybe not so much of these ingredients when one is making fried mobile outputs.
Stephen E Arnold, March 14, 2017
March 8, 2017
I read “Ontologies: Practical Applications.” The main idea in the write up is that indexing is important. Now indexing is labeled in different ways today; for example, metadata, entity extraction, concepts, etc. I agree that indexing is important, but the challenge is that most people are happy with tags, keywords, or systems which return a result that has made a high percentage of users happy. Maybe semi-happy. Who really knows? Asking about search and content processing system satisfaction returns the same grim news year after year; that is, most users (roughly two thirds) are not thrilled with the tools available to locate information. Not much progress in 50 years it seems.
The write up informs me:
Ontologies are a critical component of the enterprise information architecture. Organizations must be capable of rapidly gathering and interpreting data that provides them with insights, which in turn will give their organization an operational advantage. This is accomplished by developing ontologies that conceptualize the domain clearly, and allows transfer of knowledge between systems.
This seems to mean a classification system which makes sense to those who work in an organization. The challenge which we have encountered over the last half century is that the content and data flowing into an organization changes often rapidly over time. At any one point in time, the information today is not available. The organization sucks in what’s needed and hopes the information access system indexes the new content right away and makes it findable and usable in other software.
That’s the hope anyway.
The reality is that a gap exists between what’s accessible to a person in an organization and what information is being acquired and used by others in the organization. Search fails for most system users because what’s needed now is not indexed or if indexed, the information is not findable.
An ontology is a fancy way of saying that a consultant and software can cook up a classification system and use those terms to index content. Nifty idea, but what about that gap?
This is the killer for most indexing outfits. They make a sale because people are dissatisfied with the current methods of information access. An ontology or some other jazzed up indexing component is sold as the next big thing.
When an ontology, taxonomy, or other solution does not solve the problem, the company grouses about search and cotenant processing again.
Is there a fix? Who knows. But after 50 years in the information access sector, I know that jargon is not an effective way to solve very real problems. Money, know how, and old school methods are needed to make certain technologies deliver useful applications.
Ontologies. Great. Silver bullet. Nah. Practical applications? Nifty concept. Reality is different.
Stephen E Arnold, March 8, 2017
February 27, 2017
Let’s create a scenario. You are a person trying to figure out how to index a chunk of content. You are working with cancer information sucked down from PubMed or a similar source. You run an extraction process and push the text through an indexing system. You use a system like Leximancer and look at the results. Hmmm.
Next you take a corpus of blog posts dealing with medical information. You suck down the content and run it through your extractor, your indexing system, and your Leximancer set up. You look at the results. Hmmm.
How do you figure out what terms are going to be important for your next batch of mixed content?
You might navigate to “Selecting Forecasting Methods in Data Science.” The write up does a good job of outlining some of the numerical recipes taught in university courses and discussed in textbooks. For example, you can get an overview in this nifty graphic:
And you can review outputs from the different methods identified like this:
What’s missing? For the person floundering away like one government agency’s employee at which I worked years ago, you pick the trend line you want. Then you try to plug in the numbers and generate some useful data. If that is too tough, you hire your friendly GSA schedule consultant to do the work for you. Yep, that’s how I ended up looking at:
- Manually selected data
- Lousy controls
- Outputs from different systems
- Misindexed text
- Entities which were not really entities
- A confused government employee.
Here’s the takeaway. Just because software is available to output stuff in a log file and Excel makes it easy to wrangle most of the data into rows and columns, none of the information may be useful, valid, or even in the same ball game.
When one then applies without understanding different forecasting methods, we have an example of how an individual can create a pretty exciting data analysis.
Descriptions of algorithms do not correlate with high value outputs. Data quality, sampling, understanding why curves are “different”, and other annoying details don’t fit into some busy work lives.
Stephen E Arnold, February 27, 2017
February 24, 2017
Intellisophic identifies itself as a Linkapedia company. Poking around Linkapedia’s ownership revealed some interesting factoids:
- Linkapedia is funded in part by GITP Ventures and SEMMX (possible a Semper fund)
- The company operates in Hawaii and Pennsylvania
- One of the founders is a monk / Zen master. (Calm is a useful characteristic when trying to spin money from a search machine.)
First, Intellisophic. The company describes itself this way at this link:
Intellisophic is the world’s largest provider of taxonomic content. Unlike other methods for taxonomy development that are limited by the expense of corporate librarians and subject matter experts, Intellisophic content is machine developed, leveraging knowledge from respected reference works. The taxonomies are unbounded by subject coverage and cost significantly less to create. The taxonomy library covers five million topic areas defined by hundreds of millions of terms. Our taxonomy library is constantly growing with the addition of new titles and publishing partners.
In addition, Intellisophic’s technology—Orthogonal Corpus Indexing—can identify concepts in large collections of text. The system can be sued to enrich an existing technology, business intelligence, and search. One angle Intellisophic exploits is its use of reference and educational books. The company is in the “content intelligence” market.
The company is described this way in Crunchbase:
Linkapedia is an interest based advertising platform that enables publishers and advertisers to monetize their traffic, and distribute their content to engaged audiences. As opposed to a plain search engine which delivers what users already know, Linkapedia’s AI algorithms understand the interests of users and helps them discover something new they may like even if they don’t already know to look for it. With Linkapedia content marketers can now add Discovery as a new powerful marketing channel like Search and Social.
Like other search related services, Linkapedia uses smart software. Crunchbase states:
What makes Linkapedia stand out is its AI discovery engine that understands every facet of human knowledge. “There’s always something for you on Linkapedia”. The way the platform works is simple: people discover information by exploring a knowledge directory (map) to find what interests them. Our algorithms show content and native ads precisely tailored to their interests. Linkapedia currently has hundreds of million interest headlines or posts from the worlds most popular sources. The significance of a post is that “someone thought something related to your interest was good enough to be saved or shared at a later time.” The potential of a post is that it is extremely specific to user interests and has been extracted from recognized authorities on millions of topics.
Interesting. Search positioned as indexing, discovery, social, and advertising.
Stephen E Arnold, February 24, 2017
February 22, 2017
One of the Beyond Search goslings noticed a repositioning of the taxonomy capabilities of Mondeca. Instead of pitching indexing, the company has embraced ElasticSearch (based on Lucene) and Solr. The idea is that if an organization is using either of these systems for search and retrieval, Mondeca can provide “augmented” indexing. The idea is that keywords are not enough. Mondeca can index the content using concepts.
Of course, the approach is semantic, permits exploration, and enables content discovery. Mondeca’s Web site describes search as “find” and explains:
Initial results are refined, annotated and easy to explore. Sorted by relevancy, important terms are highlighted: easy to decide which one are relevant. Sophisticated facet based filters. Refining results set: more like this, this one, statistical and semantic methods, more like these: graph based activation ranking. Suggestions to help refine results set: new queries based on inferred or combined tags. Related searches and queries.
This is a similar marketing move to the one that Intrafind, a German search vendor, implemented several years ago. Mondeca continues to offer its taxonomy management system. Human subject matter experts do have a role in the world of indexing. Like other taxonomy systems and services vendors, the hook is that content indexed with concepts is smart. I love it when indexing makes content intelligent.
The buzzword is used by outfits ranging from MarkLogic’s merry band of XML and XQuery professionals to the library-centric outfits like Smartlogic. Isn’t smart logic better than logic?
Stephen E Arnold, February 22, 2017
February 15, 2017
The article on Smartlogic titled The Future Is Happening Now puts forth the Semaphore platform as the technology filling the gap between NLP and AI when it comes to conversation. The article posits that in spite of the great strides in AI in the past 20 years, human speech is one area where AI still falls short. The article explains,
The reason for this, according to the article, is that “words often have meaning based on context and the appearance of the letters and words.” It’s not enough to be able to identify a concept represented by a bunch of letters strung together. There are many rules that need to be put in place that affect the meaning of the word; from its placement in a sentence, to grammar and to the words around – all of these things are important.
Advocating human developed rules for indexing is certainly interesting, and the author compares this logic to the process of raising her children to be multi-lingual. Semaphore is a model-driven, rules-based platform that allows us to auto-generate usage rules in order to expand the guidelines for a machine as it learns. The issue here is cost. Indexing large amounts of data is extremely cost-prohibitive, and that it before the maintenance of the rules even becomes part of the equation. In sum, this is a very old school approach to AI that may make many people uncomfortable.
Chelsea Kerwin, February 15, 2017
February 8, 2017
I love the Phoenix like behavior of search and content processing subsystems. Consider semantics or figuring out what something is about and assigning an index term to that aboutness. Semantics is not new, and it is not an end in itself. Semantic functions are one of the many Lego blocks which make up a functioning and hopefully semi accurate content processing and information accessing system.
I read “With Better Scaling, Semantic Technology Knocks on Enterprise’s Door.” The headline encapsulates decades of frustration for the champions of semantic solutions. The early bird vendors fouled the nest for later arrivals. As a result, nifty semantic technology makes a sales call and finds that those who bother to attend the presentation are [a] skeptical, [b] indifferent, [c] clueless, [d] unwilling to spend money for another career killer. Pick your answer.
For decades, yes, decades, enterprise search and content processing vendors have said whatever was necessary to close a deal. The operative concept was that the vendor could whip up a solution and everything would come up roses. Well, fly that idea by those who licensed Convera for video search, Fast Search for an intelligent system, or any of the other train wrecks that lie along the information railroad tracks.
This write up happily ignores the past and bets that “better” technology will make semantic functions accurate, easy, low cost, and just plain wonderful. Yep, the Garden of Semantics exists as long as the licensee has the knowledge, money, time, and personnel to deliver the farm fresh produce.
I noted this passage:
… semantics standards came out 15 or more years ago, but scalability has been an inhibitor. Now, the graph technology has taken off. Most of what people have been looking at it for is [online transactional processing]. Our focus has been on [online analytical processing] — using graph technology for analytics. What held graph technology back from doing analytics was the scaling problem. There was promise and hype over those years, but, at every turn, the scale just wasn’t there. You could see amazing things in miniature, but enterprises couldn’t see them at scale. In effect, we have taken our query technology and applied MPP technology to it. Now, we are seeing tremendous scales of data.
Yep, how much does it cost to shove Big Data through a computationally intensive semantic system? Ask a company licensing one of the industrial strength systems like Gotham or NetReveal.
Make sure you have a checkbook with a SPARQL enhanced cover and a matching pen with which to write checks appropriate to semantic processing of large flows of content. Some outfits can do this work and do it well. In my experience, most outfits cannot afford to tackle the job.
That’s why semantic chatter is interesting but often disappointing to those who chomp the semantic apple from the hot house of great search stuff. Don’t forget to gobble some cognitive chilies too.
Stephen E Arnold, February 8, 2017
January 23, 2017
Yep, indexing is back. The cacaphone “ontology” is the next big thing yet again. Folks, an ontology is a form of metadata. There are key words, categories, and classifications. Whipping these puppies into shape has been the thankless task of specialists for hundreds if not thousands of years. “What Is an Ontology and Why Do I Want One?” tries to make indexing more alluring. When an enterprise search system delivers results which are off the user’s information need or just plain wrong, it is time for indexing. The problem is that machine based indexing requires some well informed humans to keep the system on point. Consider Palantir Gotham. Content finds its way into the system when a human performs certain tasks. Some of these tasks are riding herd on the indexing of the content object. IBM Analyst’s Notebook and many other next generation information access systems work hand in glove with expensive humans. Why? Smart software is still only sort of smart.
The write up dances around the need for spending money on indexing. The write up prefers to confuse a person who just wants to locate the answer to a business related question without pointing, clicking, and doing high school research paper dog work. I noted this passage:
Think of an ontology as another way to classify content (like a taxonomy) that allows you to identify what the content is about and how it relates to other types of content.
Okay, but enterprise search generally falls short of the mark for 55 to 70 percent of a search system’s users. This is a downer. What makes enterprise search better? An ontology. But without the cost and time metrics, the yap about better indexing ends up with “smart content” companies looking confused when their licenses are not renewed.
What I found amusing about the write up is that use of an ontology improves search engine optimization. How about some hard data? Generalities are presented, not instead of some numbers one can examine and attempt to verify.
SEO means getting found when a user runs a query. That does not work too well for general purpose Web search systems like Google. SEO is struggling to deal with declining traffic to many Web sites and the problem mobile search presents.
But in an organization, SEO is not what the user wants. The user needs the purchase order for a client and easy access to related data. Will an ontology deliver an actionable output. To be fair, different types of metadata are needed. An ontology is one such type, but there are others. Some of these can be extracted without too high an error rate when the content is processed; for example, telephone numbers. Other types of data require different processes which can require knitting together different systems.
To build a bubble gum card, one needs to parse a range of data, including images and content from a range of sources. In most organizations, silos of data persist and will continue to persist. Money is tight. Few commercial enterprises can afford to do the computationally intensive content processing under the watchful eye and informed mind of an indexing professional.
Cacaphones like “ontology” exacerbate the confusion about indexing and delivering useful outputs to users who don’t know a Boolean operator from a SQL expression.
Indexing is a useful term. Why not use it?
Stephen E Arnold, January 23, 2017
January 23, 2017
Recently I was speaking with someone and the conversation turned to libraries. I complimented the library’s collection in his hometown and he asked, “You mean they still have a library?” This response told me a couple things: one, that this person was not a reader and two, did not know the value of a library. The Lucidea blog discussed how “Do The Original 5 Laws Of Library Science Hold Up In A Digital World?” and apparently they still do.
S.R. Ranganathan wrote five principles of library science before computers dominated information and research in 1931. The post examines how the laws are still relevant. The first law states that books are meant to be used, meaning that information is meant to be used and shared. The biggest point of this rule is accessibility, which is extremely relevant. The second laws states, “Every reader his/her book,” meaning that libraries serve diverse groups and deliver non-biased services. That still fits considering the expansion of the knowledge dissemination and how many people access it.
The third law is also still important:
Dr. Ranganathan believed that a library system must devise and offer many methods to “ensure that each item finds its appropriate reader”. The third law, “every book his/her reader,” can be interpreted to mean that every knowledge resource is useful to an individual or individuals, no matter how specialized and no matter how small the audience may be. Library science was, and arguably still is, at the forefront of using computers to make information accessible.
The fourth law is “save time for the reader” and it refers to being able to find and access information quickly and easily. Search engines anyone? Finally, the fifth law states that “the library is a growing organism.” It is easy to interpret this law. As technology and information access changes, the library must constantly evolve to serve people and help them harness the information.
The wording is a little outdated, but the five laws are still important. However, we need to also consider how people have changed in regards to using the library as well.
Whitney Grace, January 23, 2017
December 4, 2016
I read “Google, Democracy and the Truth about Internet Search.” One more example of a person who thinks he or she is an excellent information hunter and gatherer. Let’s be candid. A hunter gatherer flailing away for 15 or so years using online research tools, libraries, and conversations with actual humans should be able to differentiate a bunny rabbit from a female wolf with baby wolves at her feet.
Natural selection works differently in the hunting and gathering world of online. The intrepid knowledge warrior can make basic mistakes, use assumptions without consequence, and accept whatever a FREE online service delivers. No natural selection operates.
A “real” journalist discovers the basics of online search’s power. Great insight, just 50 years from the time online search became available to this moment of insight in December 2017. Slow on the trigger or just clueless?
That’s scary. When the 21st century hunter gatherer seems to have an moment of inspiration and realizes that online services—particularly ad supported free services—crank out baloney, it’s frightening. The write up makes clear that a “real” journalist seems to have figured out that online outputs are not exactly the same as sitting at a table with several experts and discussing an issue. Online is not the same as going to a library and reading books and journal articles, thinking about what each source presents as actual factoids.
Here’s an example of the “understanding” one “real” journalist has about online information:
Google is knowledge. It’s where you go to find things out.
There you go. Reliance on one service to provide “knowledge.” From an ad supported. Free. Convenient. Ubiquitous. Online service.
Yep, that’s the way to keep track of “knowledge.”