Forecasting Methods: Detail without Informed Guidance

February 27, 2017

Let’s create a scenario. You are a person trying to figure out how to index a chunk of content. You are working with cancer information sucked down from PubMed or a similar source. You run an extraction process and push the text through an indexing system. You use a system like Leximancer and look at the results. Hmmm.

Next you take a corpus of blog posts dealing with medical information. You suck down the content and run it through your extractor, your indexing system, and your Leximancer set up. You look at the results. Hmmm.

How do you figure out what terms are going to be important for your next batch of mixed content?

You might navigate to “Selecting Forecasting Methods in Data Science.” The write up does a good job of outlining some of the numerical recipes taught in university courses and discussed in textbooks. For example, you can get an overview in this nifty graphic:

image

And you can review outputs from the different methods identified like this:

image

Useful.

What’s missing? For the person floundering away like one government agency’s employee at which I worked years ago, you pick the trend line you want. Then you try to plug in the numbers and generate some useful data. If that is too tough, you hire your friendly GSA schedule consultant to do the work for you. Yep, that’s how I ended up looking at:

  • Manually selected data
  • Lousy controls
  • Outputs from different systems
  • Misindexed text
  • Entities which were not really entities
  • A confused government employee.

Here’s the takeaway. Just because software is available to output stuff in a log file and Excel makes it easy to wrangle most of the data into rows and columns, none of the information may be useful, valid, or even in the same ball game.

When one then applies without understanding different forecasting methods, we have an example of how an individual can create a pretty exciting data analysis.

Descriptions of algorithms do not correlate with high value outputs. Data quality, sampling, understanding why curves are “different”, and other annoying details don’t fit into some busy work lives.

Stephen E Arnold, February 27, 2017

Intellisophic / Linkapedia

February 24, 2017

Intellisophic identifies itself as a Linkapedia company. Poking around Linkapedia’s ownership revealed some interesting factoids:

  • Linkapedia is funded in part by GITP Ventures and SEMMX (possible a Semper fund)
  • The company operates in Hawaii and Pennsylvania
  • One of the founders is a monk / Zen master. (Calm is a useful characteristic when trying to spin money from a search machine.)

First, Intellisophic. The company describes itself this way at this link:

Intellisophic is the world’s largest provider of taxonomic content. Unlike other methods for taxonomy development that are limited by the expense of corporate librarians and subject matter experts, Intellisophic content is machine developed, leveraging knowledge from respected reference works. The taxonomies are unbounded by subject coverage and cost significantly less to create. The taxonomy library covers five million topic areas defined by hundreds of millions of terms. Our taxonomy library is constantly growing with the addition of new titles and publishing partners.

In addition, Intellisophic’s technology—Orthogonal Corpus Indexing—can identify concepts in large collections of text. The system can be sued to enrich an existing technology, business intelligence, and search. One angle Intellisophic exploits is its use of reference and educational books. The company is in the “content intelligence” market.

Second, the “parent” of Intellisophic is Linkapedia. This public facing Web site allows a user to run a query and see factoids, links about a topic. Plus, Linkapedia has specialist collections of content bundles; for example, lifestyle, pets, and spirituality. I did some clicking around and found that certain topics were not populated; for instance, Lifestyle, Cars, and Brands. No brand information appeared for me.  I stumbled into a lengthy explanation of the privacy policy related to a mathematics discussion group. I backtracked, trying to get access the actual group and failed. I think the idea is an interesting one, but more work is needed. My test query for “enterprise search” presented links to Convera and a number of obscure search related Web sites.

The company is described this way in Crunchbase:

Linkapedia is an interest based advertising platform that enables publishers and advertisers to monetize their traffic, and distribute their content to engaged audiences. As opposed to a plain search engine which delivers what users already know, Linkapedia’s AI algorithms understand the interests of users and helps them discover something new they may like even if they don’t already know to look for it. With Linkapedia content marketers can now add Discovery as a new powerful marketing channel like Search and Social.

Like other search related services, Linkapedia uses smart software. Crunchbase states:

What makes Linkapedia stand out is its AI discovery engine that understands every facet of human knowledge. “There’s always something for you on Linkapedia”. The way the platform works is simple: people discover information by exploring a knowledge directory (map) to find what interests them. Our algorithms show content and native ads precisely tailored to their interests. Linkapedia currently has hundreds of million interest headlines or posts from the worlds most popular sources. The significance of a post is that “someone thought something related to your interest was good enough to be saved or shared at a later time.” The potential of a post is that it is extremely specific to user interests and has been extracted from recognized authorities on millions of topics.

Interesting. Search positioned as indexing, discovery, social, and advertising.

Stephen E Arnold, February 24, 2017

Mondeca: Tweaking Its Market Position

February 22, 2017

One of the Beyond Search goslings noticed a repositioning of the taxonomy capabilities of Mondeca. Instead of pitching indexing, the company has embraced ElasticSearch (based on Lucene) and Solr. The idea is that if an organization is using either of these systems for search and retrieval, Mondeca can provide “augmented” indexing. The idea is that keywords are not enough. Mondeca can index the content using concepts.

Of course, the approach is semantic, permits exploration, and enables content discovery. Mondeca’s Web site describes search as “find” and explains:

Initial results are refined, annotated and easy to explore. Sorted by relevancy, important terms are highlighted: easy to decide which one are relevant. Sophisticated facet based filters. Refining results set: more like this, this one, statistical and semantic methods, more like these: graph based activation ranking. Suggestions to help refine results set: new queries based on inferred or combined tags. Related searches and queries.

This is a similar marketing move to the one that Intrafind, a German search vendor, implemented several years ago. Mondeca continues to offer its taxonomy management system. Human subject matter experts do have a role in the world of indexing. Like other taxonomy systems and services vendors, the hook is that content indexed with concepts is smart. I love it when indexing makes content intelligent.

The buzzword is used by outfits ranging from MarkLogic’s merry band of XML and XQuery professionals to the library-centric outfits like Smartlogic. Isn’t smart logic better than logic?

Stephen E Arnold, February 22, 2017

The Pros and Cons of Human Developed Rules for Indexing Metadata

February 15, 2017

The article on Smartlogic titled The Future Is Happening Now puts forth the Semaphore platform as the technology filling the gap between NLP and AI when it comes to conversation. The article posits that in spite of the great strides in AI in the past 20 years, human speech is one area where AI still falls short. The article explains,

The reason for this, according to the article, is that “words often have meaning based on context and the appearance of the letters and words.” It’s not enough to be able to identify a concept represented by a bunch of letters strung together. There are many rules that need to be put in place that affect the meaning of the word; from its placement in a sentence, to grammar and to the words around – all of these things are important.

Advocating human developed rules for indexing is certainly interesting, and the author compares this logic to the process of raising her children to be multi-lingual. Semaphore is a model-driven, rules-based platform that allows us to auto-generate usage rules in order to expand the guidelines for a machine as it learns. The issue here is cost. Indexing large amounts of data is extremely cost-prohibitive, and that it before the maintenance of the rules even becomes part of the equation. In sum, this is a very old school approach to AI that may make many people uncomfortable.

Chelsea Kerwin, February 15, 2017

Semantics: Biting the Semantic Apple in the Garden of Search Subsystems

February 8, 2017

I love the Phoenix like behavior of search and content processing subsystems. Consider semantics or figuring out what something is about and assigning an index term to that aboutness. Semantics is not new, and it is not an end in itself. Semantic functions are one of the many Lego blocks which make up a functioning and hopefully semi accurate content processing and information accessing system.

I read “With Better Scaling, Semantic Technology Knocks on Enterprise’s Door.” The headline encapsulates decades of frustration for the champions of semantic solutions. The early bird vendors fouled the nest for later arrivals. As a result, nifty semantic technology makes a sales call and finds that those who bother to attend the presentation are [a] skeptical, [b] indifferent, [c] clueless, [d] unwilling to spend money for another career killer. Pick your answer.

For decades, yes, decades, enterprise search and content processing vendors have said whatever was necessary to close a deal. The operative concept was that the vendor could whip up a solution and everything would come up roses. Well, fly that idea by those who licensed Convera for video search, Fast Search for an intelligent system, or any of the other train wrecks that lie along the information railroad tracks.

This write up happily ignores the past and bets that “better” technology will make semantic functions accurate, easy, low cost, and just plain wonderful. Yep, the Garden of Semantics exists as long as the licensee has the knowledge, money, time, and personnel to deliver the farm fresh produce.

I noted this passage:

… semantics standards came out 15 or more years ago, but scalability has been an inhibitor. Now, the graph technology has taken off. Most of what people have been looking at it for is [online transactional processing]. Our focus has been on [online analytical processing] — using graph technology for analytics. What held graph technology back from doing analytics was the scaling problem. There was promise and hype over those years, but, at every turn, the scale just wasn’t there. You could see amazing things in miniature, but enterprises couldn’t see them at scale. In effect, we have taken our query technology and applied MPP technology to it. Now, we are seeing tremendous scales of data.

Yep, how much does it cost to shove Big Data through a computationally intensive semantic system? Ask a company licensing one of the industrial strength systems like Gotham or NetReveal.

Make sure you have a checkbook with a SPARQL enhanced cover and a matching pen with which to write checks appropriate to semantic processing of large flows of content. Some outfits can do this work and do it well. In my experience, most outfits cannot afford to tackle the job.

That’s why semantic chatter is interesting but often disappointing to those who chomp the semantic apple from the hot house of great search stuff. Don’t forget to gobble some cognitive chilies too.

Stephen E Arnold, February 8, 2017

Indexing: The Big Wheel Keeps on Turning

January 23, 2017

Yep, indexing is back. The cacaphone “ontology” is the next big thing yet again. Folks, an ontology is a form of metadata. There are key words, categories, and classifications. Whipping these puppies into shape has been the thankless task of specialists for hundreds if not thousands of years. “What Is an Ontology and Why Do I Want One?” tries to make indexing more alluring. When an enterprise search system delivers results which are off the user’s information need or just plain wrong, it is time for indexing. The problem is that machine based indexing requires some well informed humans to keep the system on point. Consider Palantir Gotham. Content finds its way into the system when a human performs certain tasks. Some of these tasks are riding herd on the indexing of the content object. IBM Analyst’s Notebook and many other next generation information access systems work hand in glove with expensive humans. Why? Smart software is still only sort of smart.

The write up dances around the need for spending money on indexing. The write up prefers to confuse a person who just wants to locate the answer to a business related question without pointing, clicking, and doing high school research paper dog work. I noted this passage:

Think of an ontology as another way to classify content (like a taxonomy) that allows you to identify what the content is about and how it relates to other types of content.

Okay, but enterprise search generally falls short of the mark for 55 to 70 percent of a search system’s users. This is a downer. What makes enterprise search better? An ontology. But without the cost and time metrics, the yap about better indexing ends up with “smart content” companies looking confused when their licenses are not renewed.

What I found amusing about the write up is that use of an ontology improves search engine optimization. How about some hard data? Generalities are presented, not instead of some numbers one can examine and attempt to verify.

SEO means getting found when a user runs a query. That does not work too well for general purpose Web search systems like Google. SEO is struggling to deal with declining traffic to many Web sites and the problem mobile search presents.

But in an organization, SEO is not what the user wants. The user needs the purchase order for a client and easy access to related data. Will an ontology deliver an actionable output. To be fair, different types of metadata are needed. An ontology is one such type, but there are others. Some of these can be extracted without too high an error rate when the content is processed; for example, telephone numbers. Other types of data require different processes which can require knitting together different systems.

To build a bubble gum card, one needs to parse a range of data, including images and content from a range of sources. In most organizations, silos of data persist and will continue to persist. Money is tight. Few commercial enterprises can afford to do the computationally intensive content processing under the watchful eye and informed mind of an indexing professional.

Cacaphones like “ontology” exacerbate the confusion about indexing and delivering useful outputs to users who don’t know a Boolean operator from a SQL expression.

Indexing is a useful term. Why not use it?

Stephen E Arnold, January 23, 2017

Obey the Almighty Library Laws

January 23, 2017

Recently I was speaking with someone and the conversation turned to libraries.  I complimented the library’s collection in his hometown and he asked, “You mean they still have a library?” This response told me a couple things: one, that this person was not a reader and two, did not know the value of a library.  The Lucidea blog discussed how “Do The Original 5 Laws Of Library Science Hold Up In A Digital World?” and apparently they still do.

S.R. Ranganathan wrote five principles of library science before computers dominated information and research in 1931.  The post examines how the laws are still relevant.  The first law states that books are meant to be used, meaning that information is meant to be used and shared.  The biggest point of this rule is accessibility, which is extremely relevant.  The second laws states, “Every reader his/her book,” meaning that libraries serve diverse groups and deliver non-biased services.  That still fits considering the expansion of the knowledge dissemination and how many people access it.

The third law is also still important:

Dr. Ranganathan believed that a library system must devise and offer many methods to “ensure that each item finds its appropriate reader”. The third law, “every book his/her reader,” can be interpreted to mean that every knowledge resource is useful to an individual or individuals, no matter how specialized and no matter how small the audience may be. Library science was, and arguably still is, at the forefront of using computers to make information accessible.

The fourth law is “save time for the reader” and it refers to being able to find and access information quickly and easily.  Search engines anyone?  Finally, the fifth law states that “the library is a growing organism.”  It is easy to interpret this law.  As technology and information access changes, the library must constantly evolve to serve people and help them harness the information.

The wording is a little outdated, but the five laws are still important.  However, we need to also consider how people have changed in regards to using the library as well.

Whitney Grace, January 23, 2017

The Google: A Real Newspaper Discovers Modern Research

December 4, 2016

I read “Google, Democracy and the Truth about Internet Search.” One more example of a person who thinks he or she is an excellent information hunter and gatherer. Let’s be candid. A hunter gatherer flailing away for 15 or so years using online research tools, libraries, and conversations with actual humans should be able to differentiate a bunny rabbit from a female wolf with baby wolves at her feet.

Natural selection works differently in the hunting and gathering world of online. The intrepid knowledge warrior can make basic mistakes, use assumptions without consequence, and accept whatever a FREE online service delivers. No natural selection operates.

image

A “real” journalist discovers the basics of online search’s power. Great insight, just 50 years from the time online search became available to this moment of insight in December 2017. Slow on the trigger or just clueless?

That’s scary. When the 21st century hunter gatherer seems to have an moment of inspiration and realizes that online services—particularly ad supported free services—crank out baloney, it’s frightening. The write up makes clear that a “real” journalist seems to have figured out that online outputs are not exactly the same as sitting at a table with several experts and discussing an issue. Online is not the same as going to a library and reading books and journal articles, thinking about what each source presents as actual factoids.

Here’s an example of the “understanding” one “real” journalist has about online information:

Google is knowledge. It’s where you go to find things out.

There you go. Reliance on one service to provide “knowledge.” From an ad supported. Free. Convenient. Ubiquitous. Online service.

Yep, that’s the way to keep track of “knowledge.”

Read more

Google and Its Search Results: Objective or Subjective

December 1, 2016

I love the Alphabet Google thing. The information I obtain via a Google query is spot on, accurate, perfect, and highly credible. Run the query “dancing with the stars” and what do you get? Substance. Rock solid factoids.

I read “Google Search Results Tend to Have Liberal Bias That Could Influence Public Opinion.” The write up informed me:

After analyzing nearly 2,000 pages, a panel rated 31% pages as liberal as opposed to only 22% that were conservative; the remaining 47% pages were neutral that included government or mainstream news websites.

And the source of this information? An outfit called CanIRank.com. That sounds like a company that would make Ian Sharp sit up and take notice. Don’t remember Ian Sharp? Well, too bad. He founded IP Sharp Associates and had some useful insights about the subjective/objective issues in algorithms.

The methodology is interesting too:

The study conducted by online search marketer CanIRank.com found that 50 most recent searches for political terms on the search engine showed more liberal-leaning Web pages rather than conservative ones.

But the Google insists that is results are objective. But Google keeps its ranking method secret. The write up quotes a computer science professor as saying:

“No one really knows what Google’s search engine is doing,” said Christo Wilson, a Northeastern University computer science professor. “This is a big, complex system that’s been evolving for 15 years.”

Hmm. Evolving. I thought that the Google wraps its 1998 methods and just keeps on trucking. My hunch is that the wrappers which have been added by those trying to deal with the new content and new uses to which the mobile and desktop Web search systems are put are add ons. Think of the customization of a celebrity’s SUV. That’s how Google relevance has evolved. Cool, right?

The write up points out:

Google denies results are politically slanted and says its algorithms use several factors.

My hunch is that CanIRank.com is well meaning, but it may have some biases baked into its study. CanIRank.com, like the Google, is based on human choices. When humans fiddle, subjectivity enters the arena. For real objectivity, check out Google’s translation system which may have created its own inter-lingua. That’s objective as long as one does not try to translate colloquial code phrase from a group of individuals seeking to secure their communications.

Subjective humans are needed for that task. Humans are subjective. So how does the logic flow? Oh, right. Google must be subjective. This is news? Ask Foundem.

Stephen E Arnold, December 1, 2016

Partnership Aims to Establish AI Conventions

October 24, 2016

Artificial intelligence research has been booming, and it is easy to see why— recent advances in the field have opened some exciting possibilities, both for business and  society as a whole. Still, it is important to proceed carefully, given the potential dangers of relying too much on the judgement of algorithms. The Philadelphia Inquirer reports on a joint effort to develop some AI principles and best practices in its article, “Why This AI Partnership Could Bring Profits to These Tech Titans.” Writer Chiradeep BasuMallick explains:

Given this backdrop, the grandly named Partnership on AI to Benefit People and Society is a bold move by Alphabet, Facebook, IBM and Microsoft. These globally revered companies are literally creating a technology Justice League on a quest to shape public/government opinion on AI and to push for friendly policies regarding its application and general audience acceptability. And it should reward investors along the way.

The job at hand is very simple: Create a wave of goodwill for AI, talk about the best practices and then indirectly push for change. Remember, global laws are still obscure when it comes to AI and its impact.

Curiously enough, this elite team is missing two major heavyweights. Apple and Tesla Motors are notably absent. Apple Chief Executive Tim Cook, always secretive about AI work, though we know about the estimated $200 million  Turi project, is probably waiting for a more opportune moment. And Elon Musk, co-founder, chief executive and product architect of Tesla Motors, has his own platform to promote technology, called OpenAI.

Along with representatives of each participating company, the partnership also includes some independent experts in the AI field. To say that technology is advancing faster than the law can keep up with is a vast understatement. This ongoing imbalance underscores the urgency of this group’s mission to develop best practices for companies and recommendations for legislators. Their work could do a lot to shape the future of AI and, by extension, society itself. Stay tuned.

Cynthia Murrell, October 24, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta