Info-Distortion: Suddenly People Understand

November 16, 2016

I have watched the flood of stories about misinformation, false news, popular online services’ statements about dealing with the issue, and denials that disinformation influence anything. Sigh.

I have refrained from commenting after reading write ups in the New York Times, assorted blogs, and wild and crazy posts on Reddit.

A handful of observations/factoids from rural Kentucky:

  • Detection of weaponized information is a non trivial task
  • Online systems can be manipulated by exploiting tendencies within the procedures of very popular algorithms; most online search systems rely on workhorse algorithms that know their way to the barn. Their predictability makes manipulation easy
  • Textual information which certain specific attributes will usually pass undetected by humans who have to then figure out a way to interrelate a sequence of messages distributed via different outlets

There is some information about the method at my www.augmentext.com site. The flaws in “smart” indexing systems have been known for years and have been exploited by individual actors as well as nation states. The likelihood of identifying and eliminating weaponized information will be an interesting challenge. Yep, I know a team of whiz kids figured out how to solve Facebook’s problem in a short period of time. I just don’t believe the approach applies to some of the methods in use by certain government actors. How do you know an “authority” is not a legend?

Stephen E Arnold, November 16, 2016

How Real Journalists Do Research

November 8, 2016

I read “Search & Owned Media Most Used by Journalists.” The highlight of the write up was a table created by Businesswire. The “Media Survey” revealed “Where the Media Look When Researching an Organization.” Businesswire is a news release outfit. Organizations pay to have a write up sent to “real” journalists.

Let’s look at the data in the write up.

The top five ways “real” journalists obtain information is summarized in the table below. I don’t know the sample size, the methodology, or the method of selecting the sample. My hunch is that the people responding have signed up for Businesswire information or have some other connection with the company.

Most Used Method Percent Using
Google 89%
Organization Web site 88%
Organization’s online newsroom 75%
Social media postings 54%
Government records 53%

Now what about the five least used methods for research:

Least Used Method Percent Using
Organization PR spokesperson 39%
News release boilerplate 33%
Bing 8%
Yahoo 7%
Other (sorry but no details) 6%

Now what about the research methods in between these two extremes of most and least used:

No Man’s Land Methods Percent Using
Talk to humans 51%
Trade publication Web sites 44%
Local newspapers 43%
Wikipedia 40%
Organization’s blog 39%

Several observations flapped across the minds of the goslings in Harrod’s Creek.

  1. Yahoo and Bing may want to reach out to “real” journalists and explain how darned good their search systems are for “real” information. If the data are accurate, Google is THE source for “real” journalists’ core or baseline information
  2. The popularity of social media and government information is a dead heat. I am not sure whether this means social media information is wonderful or if government information is not up to the standards of social media like Facebook or Twitter
  3. Talking to humans, which I assume was the go to method for information, is useful to half the “real” journalists. This suggests that half of the “real” news churned out by “real” journalists may be second hand, recycled and transformed, or tough to verify. The notion of “good enough” enters at this point
  4. Love that Wikipedia because 40 percent of “real” journalists rely on it for some or maybe a significant portion of the information in a “real” news story.

It comes as no surprise that news releases creep into the results list via Google’s indexing of “real” news, the organization’s online newsroom, the organization’s tweets and Facebook posts, trade publications which are first class recyclers of news releases, and the organization’s blog.

Interesting. Echo chamber, filter bubble, disinformation—Do any of these terms resonate with you?

Stephen E Arnold, November 8, 2016

 

Entity Extraction: No Slam Dunk

November 7, 2016

There are differences among these three use cases for entity extraction:

  1. Operatives reviewing content for information about watched entities prior to an operation
  2. Identifying people, places, and things for a marketing analysis by a PowerPoint ranger
  3. Indexing Web content to add concepts to keyword indexing.

Regardless of your experience with software which identifies “proper nouns,” events, meaningful digits like license plate numbers, organizations, people, and locations (accepted and colloquial)—you will find the information in “Performance Comparison of 10 Linguistic APIs for Entity Recognition” thought provoking.

The write up identifies the systems which perform the best and the worst.

Here are the five systems and the number of errors each generated in a test corpus. The “scores” are based on a test which contained 150 targets. The “best” system got more correct than incorrect. I find the results interesting but not definitive.

The five best performing systems on the test corpus were:

The five worst performing systems on the test corpus were:

There are some caveats to consider:

  1. Entity identification works quite well when the training set includes the entities and their synonyms as part of the training set
  2. Multi-language entity extraction requires additional training set preparation. “Learn as you go” is often problematic when dealing with social messages, certain intercepted content, and colloquialisms
  3. Identification of content used as a code—for example, Harrod’s teddy bear for contraband—is difficult even for smart software operating with subject matter experts’ input. (Bad guys are often not stupid and understand the concept of using one word to refer to another thing based on context or previous interactions).

Net net: Automated systems are essential. The error rates may be fine for some use cases and potentially dangerous for others.

Stephen E Arnold, November 7, 2016

BA Insight and Its Ideas for Enterprise Search Success

October 25, 2016

I read “Success Factors for Enterprise Search.” The write up spells out a checklist to make certain that an enterprise search system delivers what the users want—on point answers to their business information needs. The reason a checklist is necessary after more than 50 years of enterprise search adventures is a disconnect between what software can deliver and what the licensee and the users expect. Imagine figuring out how to get across the Grand Canyon only to encounter the Iguazu Falls.

The preamble states:

I’ll start with what absolutely does not work. The “dump it in the index and hope for the best” approach that I’ve seen some companies try, which just makes the problem worse. Increasing the size of the haystack won’t help you find a needle.

I think I agree, but the challenge is multiple piles of data. Some data are in haystacks; some are in odd ball piles from the AS/400 that the old guy in accounting uses for an inventory report.

Now the check list items:

  1. Metadata. To me, that’s indexing. Lousy indexing produces lousy search results in many cases. But “good indexing” like the best pie at the state fair is a matter of opinion. When the licensee, users, and the search vendor talk about indexing, some parties in the conversation don’t know indexing from oatmeal. The cost of indexing can be high. Improving the indexing requires more money. The magic of metadata often leads back to a discussion of why the system delivers off point results. Then there is talk about improving the indexing and its cost. The cycle can be more repetitive than a Kenmore 28132’s.
  2. Provide the content the user requires. Yep, that’s easy to say. Yep, if its on a distributed network, content disappears or does not get input into the search system. Putting the content into a repository creates another opportunity for spending money. Enterprise search which “federates” is easy to say, but the users quickly discover what is missing from the index or stale.
  3. Deliver off point results. The results create work by not answering the user’s question. From the days of STAIRS III to the latest whiz kid solution from Sillycon Valley, users find that search and retrieval systems provide an opportunity to go back to traditional research tools such as asking the person in the next cube, calling a self-appointed expert, guessing, digging through paper documents, or hiring an information or intelligence professional to gather the needed information.

The check list concludes with a good question, “Why is this happening?” The answer does not reside in the check list. The answer does not reside in my Enterprise Search Report, The Landscape of Search, or any of the journal and news articles I have written in the last 35 years.

The answer is that vendors directly or indirectly reassure that their software will provide the information a user needs. That’s an easy hook to plant in the customer who behaves like a tuna. The customer has a search system or experience with a search system that does not work. Pitching a better, faster, cheaper solution can close the deal.

The reality is that even the most sophisticated search and content processing systems end up in trouble. Search remains a very difficult problem. Today’s solutions do a few things better than STAIRS III did. But in the end, search software crashes and burns when it has to:

  • Work within a budget
  • Deal with structured and unstructured data
  • Meet user expectations for timeliness, precision, recall, and accuracy
  • Does not require specialized training to use
  • Delivers zippy response time
  • Does not crash or experience downtime due to maintenance
  • Outputs usable, actionable reports without having to involve a programmer
  • Provides an answer to a question.

Smart software can solve some of these problems for specific types of queries. Enterprise search will benefit incrementally. For now, baloney about enterprise search continues to create churn. The incumbent loses the contract, and a new search vendors inks a deal. Months later, the incumbent loses the contract, and the next round of vendors compete for the contract. This cycle has eroded the credibility of search and content processing vendors.

A check list with three items won’t do much to change the credibility gap between what vendors say, what licensees hope will occur, and what users expect. The Grand Canyon is a big hole to fill. The Iguazu Falls can be tough to cross. Same with enterprise search.

Stephen E Arnold, October 25, 2016

Semantiro and Ontocuro Basic

October 20, 2016

Quick update from the Australian content processing vendor SSAP or Semantic Software Asia Pacific Limited. The company’s Semantiro platform now supports the new Ontocuro tool.

Semantiro is a platform which “promises the ability to enrich the semantics of data collected from disparate data sources, and enables a computer to understand its context and meaning,” according to “Semantic Software Announces Artificial Intelligence Offering.”

I learned:

Ontocuro is the first suite of core components to be released under the Semantiro platform. These bespoke components will allow users to safely prune unwanted concepts and axioms; validate existing, new or refined ontologies; and import, store and share these ontologies via the Library.

The company’s approach is to leapfrog the complex interfaces other indexing and data tagging tools impose on the user. The company’s Web site for Ontocuro is at this link.

Stephen E Arnold, October 20, 2016

Online and without Ooomph: Social Content

October 15, 2016

I am surprised when Scientific American Magazine runs a story somewhat related to online information access. Navigate to read “The Bright Side of Internet Shaming.” The main point is that shaming has “become so common that it might soon begin to lose its impact.” Careful wording, of course. It is Scientific American, and the write up has few facts of the scientific ilk.

I highlighted this passage:

…these days public shaming are increasingly frequent. They’ve become a new kind of grisly entertainment, like a national reality show.

Yep, another opinion from Scientific American.

I then circled in Hawthorne Scarlet A red:

there’s a certain kind of hope in the increasing regularity of shamings. As they become commonplace, maybe they’ll lose their ability to shock. The same kinds of ugly tweets have been repeated so many times, they’re starting to become boilerplate.

I don’t pay much attention to social media unless the data are part of a project. I have a tough time distinguishing misinformation, disinformation, and run of the mill information.

What’s the relationship to search? Locating “shaming” type messages is difficult. Social media search engines don’t work particularly well. The half hearted attempts at indexing are not consistent. No surprise in that because user generated input is often uninformed input, particularly when it comes to indexing.

My thought is that Scientific American reflects shaming. The write up is not scientific. I would have found the article more interesting if:

  • Data based on tweet or Facebook post analyses based on negative or “shaming” words
  • Facts about the increase or decrease in “shaming” language for some “boilerplate” words
  • A Palantir-type link analysis illustrating the centroids for one solid shaming example.

Scientific American has redefined science it seems. Thus, a search for science might return a false drop for the magazine. I will skip the logic of the write up because the argument strikes me as subjective American thought.

Stephen E Arnold, October 15, 2016

Definitions of Search to Die For. Maybe With?

October 13, 2016

I read “Search Terminology. Web Search, Enterprise Search, Real Time Search, Semantic Search.” I have included glossaries in some of my books about search. I did not realize that I could pluck out four definitions and present them as a stand alone article. Ah, the wonders of content marketing.

If you want to read the definition with which one can die, either for or with, have at it. May I suggest that you consider these questions prior to your perusing the content marketing write up thing:

Web search

  • What’s the method for password protected sites and encrypted sites which exist under current Web technology?
  • What Web search systems build their own indexes and which send a query to multiple search systems and aggregate the results? Does the approach matter?
  • What is the freshness or staleness of Web indexes? Does it matter that one index may be a few minutes “old” and another index several weeks “old”?

Enterprise search

  • How does an enterprise search system deliver internal content points and external content pointers?
  • What is the consequence of an enterprise search user who accesses content which is incomplete or stale?
  • What does the enterprise search system do with third party content such as consultants’ reports which someone in the organization has purchased? Ignore? Re-license? Index the content and worry later?
  • What is the refresh cycle for changed and new content?
  • What is the search function for locating database content or rich media residing on the organization’s systems?

Real time search

  • What is real time? The indexing of content in the millisecond world of Wall Street? Indexing content when machine resources and network bandwidth permit?
  • How does a user determine the latency in the search system because marketers can write “real time” while programmers implement index update options which the search administrator selects?
  • What search system indexes videos in real time? YouTube struggles with 10 minute or longer latency with some videos requiring hours before the index points to those videos?

Semantic search

  • What is the role of human subject matter experts in semantic search?
  • What is the benefit of human-intermediated systems versus person-machine or automated smart indexing?
  • How does one address concept drift as a system “learns” from its indexing of information?
  • What happens to taxonomies, dictionary lists of entities, and other artifacts of concept indexing?
  • What does a system do when encountering documents, audio, and videos in a language different from the language of the majority of a system’s users?

Get the idea that zippy, brief definitions cannot deliver Gatorade to the college football players studying in the dorm the night before a big game?

Stephen E Arnold, October 13, 2016

Five Years in Enterprise Search: 2011 to 2016

October 4, 2016

Before I shifted from worker bee to Kentucky dirt farmer, I attended a presentation in which a wizard from Findwise explained enterprise search in 2011. In my notes, I jotted down the companies the maven mentioned (love that alliteration) in his remarks:

  • Attivio
  • Autonomy
  • Coveo
  • Endeca
  • Exalead
  • Fabasoft
  • Google
  • IBM
  • ISYS Search
  • Microsoft
  • Sinequa
  • Vivisimo.

There were nodding heads as the guru listed the key functions of enterprise search systems in 2011. My notes contained these items:

  • Federation model
  • Indexing and connectivity
  • Interface flexibility
  • Management and analysis
  • Mobile support
  • Platform readiness
  • Relevance model
  • Security
  • Semantics and text analytics
  • Social and collaborative features

I recall that I was confused about the source of the information in the analysis. Then the murky family tree seemed important. Five years later, I am less interested in who sired what child than the interesting historical nuggets in this simple list and collection of pretty fuzzy and downright crazy characteristics of search. I am not too sure what “analysis” and “analytics” mean. The notion that an index is required is okay, but the blending of indexing and “connectivity” seems a wonky way of referencing file filters or a network connection. With the Harvard Business Review pointing out that collaboration is a bit of a problem, it is an interesting footnote to acknowledge that a buzzword can grow into a time sink.

image

There are some notable omissions; for example, open source search options do not appear in the list. That’s interesting because Attivio was at that time I heard poking its toe into open source search. IBM was a fan of Lucene five years ago. Today the IBM marketing machine beats the Watson drum, but inside the Big Blue system resides that free and open source Lucene. I assume that the gurus and the mavens working on this list ignored open source because what consulting revenue results from free stuff? What happened to Oracle? In 2011, Oracle still believed in Secure Enterprise Search only to recant with purchases of Endeca, InQuira, and Rightnow. There are other glitches in the list, but let’s move on.

Read more

Google and the Future of Search Engine Optimization

September 30, 2016

Regular readers know that we are not big fans of SEO (Search Engine Optimization ) or its champions, so you will understand our tentative glee at the Fox News headline, “Is Google Trying to Kill SEO?” The article is centered around a Florida court case whose plaintiff is e.ventures Worldwide LLC, accused by Google of engaging in “search-engine manipulation”. As it turns out, that term is a little murky. That did not stop Google from unilaterally de-indexing “hundreds” of e.ventures’ websites. Writer Dan Blacharski observes:

The larger question here is chilling to virtually any small business which seeks a higher ranking, since Google’s own definition of search engine manipulation is vague and unpredictable. According to a brief filed by e-ventures’ attorney Alexis Arena at Flaster Greenberg PC, ‘Under Google’s definition, any website owner that attempts to cause its website to rank higher, in any manner, could be guilty of ‘pure spam’ and blocked from Google’s search results, without explanation or redress. …

The larger question here is chilling to virtually any small business which seeks a higher ranking, since Google’s own definition of search engine manipulation is vague and unpredictable. According to a brief filed by e-ventures’ attorney Alexis Arena at Flaster Greenberg PC, ‘Under Google’s definition, any website owner that attempts to cause its website to rank higher, in any manner, could be guilty of ‘pure spam’ and blocked from Google’s search results, without explanation or redress.

We cannot share Blacharski’s alarm at this turn of events. In our humble opinion, if websites focus on providing quality content, the rest will follow. The article goes on to examine Google’s first-amendment based stance, and considers whether SEO is even a legitimate strategy. See the article for its take on these considerations.

Cynthia Murrell, September 30, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

 

 

Key Words and Semantic Annotation

September 27, 2016

I read “Exploiting Semantic Annotation of Content with Linked Data to Improve Searching Performance in Web Repositories.” The nub of the paper is, “Better together.” The idea is that key words work if one knows the subject and the terminology required to snag the desired information.

image

If not, then semantic indexing provides another path. If the conclusion seems obvious, consider that two paths are better for users. The researchers used Elasticsearch. However, the real world issue is the cost of expertise and the computational cost and time required to add another path. You can download the journal paper at this link.

Stephen E Arnold, September 27, 2016

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta