Indexing: The Big Wheel Keeps on Turning

January 23, 2017

Yep, indexing is back. The cacaphone “ontology” is the next big thing yet again. Folks, an ontology is a form of metadata. There are key words, categories, and classifications. Whipping these puppies into shape has been the thankless task of specialists for hundreds if not thousands of years. “What Is an Ontology and Why Do I Want One?” tries to make indexing more alluring. When an enterprise search system delivers results which are off the user’s information need or just plain wrong, it is time for indexing. The problem is that machine based indexing requires some well informed humans to keep the system on point. Consider Palantir Gotham. Content finds its way into the system when a human performs certain tasks. Some of these tasks are riding herd on the indexing of the content object. IBM Analyst’s Notebook and many other next generation information access systems work hand in glove with expensive humans. Why? Smart software is still only sort of smart.

The write up dances around the need for spending money on indexing. The write up prefers to confuse a person who just wants to locate the answer to a business related question without pointing, clicking, and doing high school research paper dog work. I noted this passage:

Think of an ontology as another way to classify content (like a taxonomy) that allows you to identify what the content is about and how it relates to other types of content.

Okay, but enterprise search generally falls short of the mark for 55 to 70 percent of a search system’s users. This is a downer. What makes enterprise search better? An ontology. But without the cost and time metrics, the yap about better indexing ends up with “smart content” companies looking confused when their licenses are not renewed.

What I found amusing about the write up is that use of an ontology improves search engine optimization. How about some hard data? Generalities are presented, not instead of some numbers one can examine and attempt to verify.

SEO means getting found when a user runs a query. That does not work too well for general purpose Web search systems like Google. SEO is struggling to deal with declining traffic to many Web sites and the problem mobile search presents.

But in an organization, SEO is not what the user wants. The user needs the purchase order for a client and easy access to related data. Will an ontology deliver an actionable output. To be fair, different types of metadata are needed. An ontology is one such type, but there are others. Some of these can be extracted without too high an error rate when the content is processed; for example, telephone numbers. Other types of data require different processes which can require knitting together different systems.

To build a bubble gum card, one needs to parse a range of data, including images and content from a range of sources. In most organizations, silos of data persist and will continue to persist. Money is tight. Few commercial enterprises can afford to do the computationally intensive content processing under the watchful eye and informed mind of an indexing professional.

Cacaphones like “ontology” exacerbate the confusion about indexing and delivering useful outputs to users who don’t know a Boolean operator from a SQL expression.

Indexing is a useful term. Why not use it?

Stephen E Arnold, January 23, 2017

Obey the Almighty Library Laws

January 23, 2017

Recently I was speaking with someone and the conversation turned to libraries.  I complimented the library’s collection in his hometown and he asked, “You mean they still have a library?” This response told me a couple things: one, that this person was not a reader and two, did not know the value of a library.  The Lucidea blog discussed how “Do The Original 5 Laws Of Library Science Hold Up In A Digital World?” and apparently they still do.

S.R. Ranganathan wrote five principles of library science before computers dominated information and research in 1931.  The post examines how the laws are still relevant.  The first law states that books are meant to be used, meaning that information is meant to be used and shared.  The biggest point of this rule is accessibility, which is extremely relevant.  The second laws states, “Every reader his/her book,” meaning that libraries serve diverse groups and deliver non-biased services.  That still fits considering the expansion of the knowledge dissemination and how many people access it.

The third law is also still important:

Dr. Ranganathan believed that a library system must devise and offer many methods to “ensure that each item finds its appropriate reader”. The third law, “every book his/her reader,” can be interpreted to mean that every knowledge resource is useful to an individual or individuals, no matter how specialized and no matter how small the audience may be. Library science was, and arguably still is, at the forefront of using computers to make information accessible.

The fourth law is “save time for the reader” and it refers to being able to find and access information quickly and easily.  Search engines anyone?  Finally, the fifth law states that “the library is a growing organism.”  It is easy to interpret this law.  As technology and information access changes, the library must constantly evolve to serve people and help them harness the information.

The wording is a little outdated, but the five laws are still important.  However, we need to also consider how people have changed in regards to using the library as well.

Whitney Grace, January 23, 2017

The Google: A Real Newspaper Discovers Modern Research

December 4, 2016

I read “Google, Democracy and the Truth about Internet Search.” One more example of a person who thinks he or she is an excellent information hunter and gatherer. Let’s be candid. A hunter gatherer flailing away for 15 or so years using online research tools, libraries, and conversations with actual humans should be able to differentiate a bunny rabbit from a female wolf with baby wolves at her feet.

Natural selection works differently in the hunting and gathering world of online. The intrepid knowledge warrior can make basic mistakes, use assumptions without consequence, and accept whatever a FREE online service delivers. No natural selection operates.

image

A “real” journalist discovers the basics of online search’s power. Great insight, just 50 years from the time online search became available to this moment of insight in December 2017. Slow on the trigger or just clueless?

That’s scary. When the 21st century hunter gatherer seems to have an moment of inspiration and realizes that online services—particularly ad supported free services—crank out baloney, it’s frightening. The write up makes clear that a “real” journalist seems to have figured out that online outputs are not exactly the same as sitting at a table with several experts and discussing an issue. Online is not the same as going to a library and reading books and journal articles, thinking about what each source presents as actual factoids.

Here’s an example of the “understanding” one “real” journalist has about online information:

Google is knowledge. It’s where you go to find things out.

There you go. Reliance on one service to provide “knowledge.” From an ad supported. Free. Convenient. Ubiquitous. Online service.

Yep, that’s the way to keep track of “knowledge.”

Read more

Google and Its Search Results: Objective or Subjective

December 1, 2016

I love the Alphabet Google thing. The information I obtain via a Google query is spot on, accurate, perfect, and highly credible. Run the query “dancing with the stars” and what do you get? Substance. Rock solid factoids.

I read “Google Search Results Tend to Have Liberal Bias That Could Influence Public Opinion.” The write up informed me:

After analyzing nearly 2,000 pages, a panel rated 31% pages as liberal as opposed to only 22% that were conservative; the remaining 47% pages were neutral that included government or mainstream news websites.

And the source of this information? An outfit called CanIRank.com. That sounds like a company that would make Ian Sharp sit up and take notice. Don’t remember Ian Sharp? Well, too bad. He founded IP Sharp Associates and had some useful insights about the subjective/objective issues in algorithms.

The methodology is interesting too:

The study conducted by online search marketer CanIRank.com found that 50 most recent searches for political terms on the search engine showed more liberal-leaning Web pages rather than conservative ones.

But the Google insists that is results are objective. But Google keeps its ranking method secret. The write up quotes a computer science professor as saying:

“No one really knows what Google’s search engine is doing,” said Christo Wilson, a Northeastern University computer science professor. “This is a big, complex system that’s been evolving for 15 years.”

Hmm. Evolving. I thought that the Google wraps its 1998 methods and just keeps on trucking. My hunch is that the wrappers which have been added by those trying to deal with the new content and new uses to which the mobile and desktop Web search systems are put are add ons. Think of the customization of a celebrity’s SUV. That’s how Google relevance has evolved. Cool, right?

The write up points out:

Google denies results are politically slanted and says its algorithms use several factors.

My hunch is that CanIRank.com is well meaning, but it may have some biases baked into its study. CanIRank.com, like the Google, is based on human choices. When humans fiddle, subjectivity enters the arena. For real objectivity, check out Google’s translation system which may have created its own inter-lingua. That’s objective as long as one does not try to translate colloquial code phrase from a group of individuals seeking to secure their communications.

Subjective humans are needed for that task. Humans are subjective. So how does the logic flow? Oh, right. Google must be subjective. This is news? Ask Foundem.

Stephen E Arnold, December 1, 2016

Partnership Aims to Establish AI Conventions

October 24, 2016

Artificial intelligence research has been booming, and it is easy to see why— recent advances in the field have opened some exciting possibilities, both for business and  society as a whole. Still, it is important to proceed carefully, given the potential dangers of relying too much on the judgement of algorithms. The Philadelphia Inquirer reports on a joint effort to develop some AI principles and best practices in its article, “Why This AI Partnership Could Bring Profits to These Tech Titans.” Writer Chiradeep BasuMallick explains:

Given this backdrop, the grandly named Partnership on AI to Benefit People and Society is a bold move by Alphabet, Facebook, IBM and Microsoft. These globally revered companies are literally creating a technology Justice League on a quest to shape public/government opinion on AI and to push for friendly policies regarding its application and general audience acceptability. And it should reward investors along the way.

The job at hand is very simple: Create a wave of goodwill for AI, talk about the best practices and then indirectly push for change. Remember, global laws are still obscure when it comes to AI and its impact.

Curiously enough, this elite team is missing two major heavyweights. Apple and Tesla Motors are notably absent. Apple Chief Executive Tim Cook, always secretive about AI work, though we know about the estimated $200 million  Turi project, is probably waiting for a more opportune moment. And Elon Musk, co-founder, chief executive and product architect of Tesla Motors, has his own platform to promote technology, called OpenAI.

Along with representatives of each participating company, the partnership also includes some independent experts in the AI field. To say that technology is advancing faster than the law can keep up with is a vast understatement. This ongoing imbalance underscores the urgency of this group’s mission to develop best practices for companies and recommendations for legislators. Their work could do a lot to shape the future of AI and, by extension, society itself. Stay tuned.

Cynthia Murrell, October 24, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Semantiro and Ontocuro Basic

October 20, 2016

Quick update from the Australian content processing vendor SSAP or Semantic Software Asia Pacific Limited. The company’s Semantiro platform now supports the new Ontocuro tool.

Semantiro is a platform which “promises the ability to enrich the semantics of data collected from disparate data sources, and enables a computer to understand its context and meaning,” according to “Semantic Software Announces Artificial Intelligence Offering.”

I learned:

Ontocuro is the first suite of core components to be released under the Semantiro platform. These bespoke components will allow users to safely prune unwanted concepts and axioms; validate existing, new or refined ontologies; and import, store and share these ontologies via the Library.

The company’s approach is to leapfrog the complex interfaces other indexing and data tagging tools impose on the user. The company’s Web site for Ontocuro is at this link.

Stephen E Arnold, October 20, 2016

Online and without Ooomph: Social Content

October 15, 2016

I am surprised when Scientific American Magazine runs a story somewhat related to online information access. Navigate to read “The Bright Side of Internet Shaming.” The main point is that shaming has “become so common that it might soon begin to lose its impact.” Careful wording, of course. It is Scientific American, and the write up has few facts of the scientific ilk.

I highlighted this passage:

…these days public shaming are increasingly frequent. They’ve become a new kind of grisly entertainment, like a national reality show.

Yep, another opinion from Scientific American.

I then circled in Hawthorne Scarlet A red:

there’s a certain kind of hope in the increasing regularity of shamings. As they become commonplace, maybe they’ll lose their ability to shock. The same kinds of ugly tweets have been repeated so many times, they’re starting to become boilerplate.

I don’t pay much attention to social media unless the data are part of a project. I have a tough time distinguishing misinformation, disinformation, and run of the mill information.

What’s the relationship to search? Locating “shaming” type messages is difficult. Social media search engines don’t work particularly well. The half hearted attempts at indexing are not consistent. No surprise in that because user generated input is often uninformed input, particularly when it comes to indexing.

My thought is that Scientific American reflects shaming. The write up is not scientific. I would have found the article more interesting if:

  • Data based on tweet or Facebook post analyses based on negative or “shaming” words
  • Facts about the increase or decrease in “shaming” language for some “boilerplate” words
  • A Palantir-type link analysis illustrating the centroids for one solid shaming example.

Scientific American has redefined science it seems. Thus, a search for science might return a false drop for the magazine. I will skip the logic of the write up because the argument strikes me as subjective American thought.

Stephen E Arnold, October 15, 2016

Deindexing: A Thing?

October 12, 2016

There was the right to be forgotten. There were reputation management companies promising to scrub unwanted information for indexes using humans, lawyers (a different species, of course), and software agents.

Now I have learned that “dozens of suspicious court cases, with missing defendants, aim at getting web pages taken down or deindexed.” The write up asserts:

Google and various other Internet platforms have a policy: They won’t take down material (or, in Google’s case, remove it from Google indexes) just because someone says it’s defamatory. Understandable — why would these companies want to adjudicate such factual disputes? But if they see a court order that declares that some material is defamatory, they tend to take down or deindex the material, relying on the court’s decision.

Two thoughts:

  1. Have reputation management experts cooked up some new broth?
  2. How quickly will the lovely word “deindex” survive in the maelstrom of the information flow.

I love the idea of indexing content. Perhaps there is a new opportunity for innovation with the deindexing thing? Semantic deindexing? Structured deindexing? And my fave unstructured deindexing in federated cloud based data lakes. I wish I were 21 years old again. A new career beckons with declassification codes, delanguage processing, and even desmart software.

Stephen E Arnold, October 22, 2016

Hacking Federal Agencies Now a Childs Play

October 12, 2016

A potentially dangerous malware called GovRat that is effective in cyber-espionage is available on Dark Web for as low as $1,000.

IBTimes recently published an article Malware used to target US Government and military being sold on Dark Web in which the author states –

The evolved version of GovRat, which builds on a piece of malware first exposed in November last year, can be used by hackers to infiltrate a victim’s computer, remotely steal files, upload malware or compromised usernames and passwords.

The second version of this malware has already caused significant damage. Along with it, the seller is also willing to give away credentials to access US government servers and military groups.

Though the exact identity of the creator of GovRat 2.0 is unknown, the article states:

Several of these individuals are known as professional hackers for hire,” Komarovexplained. He cited one name as ROR [RG] – a notorious hacker who previously targeted Ashley Madison, AdultFriendFinder and the Turkish General Directorate of Security (EGM).

Data of large numbers of federal employees are already compromised and details like email, home address, login IDs and hashed passwords are available for anyone who can pay the price.

InfoArmor a cybersecurity and identity protection firm while scanning the Dark Web forums unearthed this information and has already passed on the details to relevant affected parties. The extent of the damage is unknown, the stolen information can be used to cause further damage.

Vishal Ingole, October 12, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Google: Algorithms Are Objective

July 17, 2016

I know that Google’s algorithms are tireless, objective numerical recipes. However, “Google: Downranking Online Piracy Sites in Search Results Has Led to a 89% Decrease in Traffic” sparked in my mind the notion that human intervention may be influencing some search result rankings. I highlighted these statements in the write up:

“Google does not proactively remove hyperlinks to any content unless first notified by copyright holders, but the tech giant says that it is now processing copyright removal notices in less than six hours on average…” I assume this work is performed by objective algorithms.

“…it is happy to demote links to pages that explicitly contain or link to content that infringes copyright.” Again, a machine process and, therefore, objective?

Human intervention in high volume flows of information is often difficult. If Google is not using machine processes, perhaps the company is forced to group sites and then have humans make decisions.

Artificial intelligence, are you not up to the task?

Stephen E Arnold, July 21, 2016

Next Page »