Oh, Canada: Censorship Means If It Is Not Indexed, Information Does Not Exist

December 8, 2016

I read “Activists Back Google’s Appeal against Canadian Order to Censor Search Results.” The write up appears in a “real” journalistic endeavor, a newspaper in fact. (Note that newspapers are facing an ad revenue Armageddon if the information in “By 2020 More Money Will Be Spent on Online Ads Than on Radio or Newspapers” is accurate.)

The point of the “real” journalistic endeavor’s write up is to point out that censorship could get a bit of a turbo boost. I highlighted this passage:

In an appeal heard on Tuesday [December 6, 2016] in the supreme court of Canada, Google Inc took aim at a 2015 court decision that sought to censor search results beyond Canada’s borders.

If the appeal goes south, a government could instruct the Google and presumably any other indexing outfit to delete pointers to content. If one cannot find online information, that information may cease to be findable. Ergo. The information does not exist for one of the search savvy wizards holding a mobile phone or struggling to locate a US government document.

The “real” journalistic endeavor offers:

A court order to remove worldwide search results could threaten free expression if it catches on globally – where it would then be subject to wildly divergent standards on freedom of speech.

It is apparently okay for a “real” journalistic endeavor to prevent information from appearing in its information flows as long as the newspaper is doing the deciding. But when a third party like a mere government makes the decision, the omission is a very bad thing.

I don’t have a dog in this fight because I live in rural Kentucky, am an actual addled goose (honk!), and find that so many folks are now realizing the implications of indexing digital content. Let’s see. Online Web indexes have been around and free for 20, maybe 30 years.

There is nothing like the howls of an animal caught in a trap. The animal wandered into or was lured into the trap. Let’s howl.

Stephen E Arnold, December 8, 2016

Written by Stephen E. Arnold · Filed Under Government, News, Publishing | Comments Off on Oh, Canada: Censorship Means If It Is Not Indexed, Information Does Not Exist

Info-Distortion: Suddenly People Understand

November 16, 2016

I have watched the flood of stories about misinformation, false news, popular online services’ statements about dealing with the issue, and denials that disinformation influence anything. Sigh.

I have refrained from commenting after reading write ups in the New York Times, assorted blogs, and wild and crazy posts on Reddit.

A handful of observations/factoids from rural Kentucky:

Detection of weaponized information is a non trivial task
Online systems can be manipulated by exploiting tendencies within the procedures of very popular algorithms; most online search systems rely on workhorse algorithms that know their way to the barn. Their predictability makes manipulation easy
Textual information which certain specific attributes will usually pass undetected by humans who have to then figure out a way to interrelate a sequence of messages distributed via different outlets

There is some information about the method at my www.augmentext.com site. The flaws in “smart” indexing systems have been known for years and have been exploited by individual actors as well as nation states. The likelihood of identifying and eliminating weaponized information will be an interesting challenge. Yep, I know a team of whiz kids figured out how to solve Facebook’s problem in a short period of time. I just don’t believe the approach applies to some of the methods in use by certain government actors. How do you know an “authority” is not a legend?

Stephen E Arnold, November 16, 2016

Written by Stephen E. Arnold · Filed Under News, Online (general) | Comments Off on Info-Distortion: Suddenly People Understand

How Real Journalists Do Research

November 8, 2016

I read “Search & Owned Media Most Used by Journalists.” The highlight of the write up was a table created by Businesswire. The “Media Survey” revealed “Where the Media Look When Researching an Organization.” Businesswire is a news release outfit. Organizations pay to have a write up sent to “real” journalists.

Let’s look at the data in the write up.

The top five ways “real” journalists obtain information is summarized in the table below. I don’t know the sample size, the methodology, or the method of selecting the sample. My hunch is that the people responding have signed up for Businesswire information or have some other connection with the company.

Most Used Method	Percent Using
Google	89%
Organization Web site	88%
Organization’s online newsroom	75%
Social media postings	54%
Government records	53%

Now what about the five least used methods for research:

Least Used Method	Percent Using
Organization PR spokesperson	39%
News release boilerplate	33%
Bing	8%
Yahoo	7%
Other (sorry but no details)	6%

Now what about the research methods in between these two extremes of most and least used:

No Man’s Land Methods	Percent Using
Talk to humans	51%
Trade publication Web sites	44%
Local newspapers	43%
Wikipedia	40%
Organization’s blog	39%

Several observations flapped across the minds of the goslings in Harrod’s Creek.

Yahoo and Bing may want to reach out to “real” journalists and explain how darned good their search systems are for “real” information. If the data are accurate, Google is THE source for “real” journalists’ core or baseline information
The popularity of social media and government information is a dead heat. I am not sure whether this means social media information is wonderful or if government information is not up to the standards of social media like Facebook or Twitter
Talking to humans, which I assume was the go to method for information, is useful to half the “real” journalists. This suggests that half of the “real” news churned out by “real” journalists may be second hand, recycled and transformed, or tough to verify. The notion of “good enough” enters at this point
Love that Wikipedia because 40 percent of “real” journalists rely on it for some or maybe a significant portion of the information in a “real” news story.

It comes as no surprise that news releases creep into the results list via Google’s indexing of “real” news, the organization’s online newsroom, the organization’s tweets and Facebook posts, trade publications which are first class recyclers of news releases, and the organization’s blog.

Interesting. Echo chamber, filter bubble, disinformation—Do any of these terms resonate with you?

Stephen E Arnold, November 8, 2016

Written by Stephen E. Arnold · Filed Under News, Publishing, Reference tool | Comments Off on How Real Journalists Do Research

Entity Extraction: No Slam Dunk

November 7, 2016

There are differences among these three use cases for entity extraction:

Operatives reviewing content for information about watched entities prior to an operation
Identifying people, places, and things for a marketing analysis by a PowerPoint ranger
Indexing Web content to add concepts to keyword indexing.

Regardless of your experience with software which identifies “proper nouns,” events, meaningful digits like license plate numbers, organizations, people, and locations (accepted and colloquial)—you will find the information in “Performance Comparison of 10 Linguistic APIs for Entity Recognition” thought provoking.

The write up identifies the systems which perform the best and the worst.

Here are the five systems and the number of errors each generated in a test corpus. The “scores” are based on a test which contained 150 targets. The “best” system got more correct than incorrect. I find the results interesting but not definitive.

The five best performing systems on the test corpus were:

Intellexer API (best)
Lexalytics (better
AlchemyLanguage IBM (good)
Indico (less good)
Google Natural Language.

The five worst performing systems on the test corpus were:

Microsoft Cognitive Services (dead last)
Hewlett Packard Enterprise Haven (penultimate last)
Text Razor (antipenultimate)
Meaning Cloud
Aylien (apparently misspelled in the source article).

There are some caveats to consider:

Entity identification works quite well when the training set includes the entities and their synonyms as part of the training set
Multi-language entity extraction requires additional training set preparation. “Learn as you go” is often problematic when dealing with social messages, certain intercepted content, and colloquialisms
Identification of content used as a code—for example, Harrod’s teddy bear for contraband—is difficult even for smart software operating with subject matter experts’ input. (Bad guys are often not stupid and understand the concept of using one word to refer to another thing based on context or previous interactions).

Net net: Automated systems are essential. The error rates may be fine for some use cases and potentially dangerous for others.

Stephen E Arnold, November 7, 2016

Written by Stephen E. Arnold · Filed Under AI, Natural language processing, News, Text analytics, Text processing | Comments Off on Entity Extraction: No Slam Dunk

BA Insight and Its Ideas for Enterprise Search Success

October 25, 2016

I read “Success Factors for Enterprise Search.” The write up spells out a checklist to make certain that an enterprise search system delivers what the users want—on point answers to their business information needs. The reason a checklist is necessary after more than 50 years of enterprise search adventures is a disconnect between what software can deliver and what the licensee and the users expect. Imagine figuring out how to get across the Grand Canyon only to encounter the Iguazu Falls.

The preamble states:

I’ll start with what absolutely does not work. The “dump it in the index and hope for the best” approach that I’ve seen some companies try, which just makes the problem worse. Increasing the size of the haystack won’t help you find a needle.

I think I agree, but the challenge is multiple piles of data. Some data are in haystacks; some are in odd ball piles from the AS/400 that the old guy in accounting uses for an inventory report.

Now the check list items:

Metadata. To me, that’s indexing. Lousy indexing produces lousy search results in many cases. But “good indexing” like the best pie at the state fair is a matter of opinion. When the licensee, users, and the search vendor talk about indexing, some parties in the conversation don’t know indexing from oatmeal. The cost of indexing can be high. Improving the indexing requires more money. The magic of metadata often leads back to a discussion of why the system delivers off point results. Then there is talk about improving the indexing and its cost. The cycle can be more repetitive than a Kenmore 28132’s.
Provide the content the user requires. Yep, that’s easy to say. Yep, if its on a distributed network, content disappears or does not get input into the search system. Putting the content into a repository creates another opportunity for spending money. Enterprise search which “federates” is easy to say, but the users quickly discover what is missing from the index or stale.
Deliver off point results. The results create work by not answering the user’s question. From the days of STAIRS III to the latest whiz kid solution from Sillycon Valley, users find that search and retrieval systems provide an opportunity to go back to traditional research tools such as asking the person in the next cube, calling a self-appointed expert, guessing, digging through paper documents, or hiring an information or intelligence professional to gather the needed information.

The check list concludes with a good question, “Why is this happening?” The answer does not reside in the check list. The answer does not reside in my Enterprise Search Report, The Landscape of Search, or any of the journal and news articles I have written in the last 35 years.

The answer is that vendors directly or indirectly reassure that their software will provide the information a user needs. That’s an easy hook to plant in the customer who behaves like a tuna. The customer has a search system or experience with a search system that does not work. Pitching a better, faster, cheaper solution can close the deal.

The reality is that even the most sophisticated search and content processing systems end up in trouble. Search remains a very difficult problem. Today’s solutions do a few things better than STAIRS III did. But in the end, search software crashes and burns when it has to:

Work within a budget
Deal with structured and unstructured data
Meet user expectations for timeliness, precision, recall, and accuracy
Does not require specialized training to use
Delivers zippy response time
Does not crash or experience downtime due to maintenance
Outputs usable, actionable reports without having to involve a programmer
Provides an answer to a question.

Smart software can solve some of these problems for specific types of queries. Enterprise search will benefit incrementally. For now, baloney about enterprise search continues to create churn. The incumbent loses the contract, and a new search vendors inks a deal. Months later, the incumbent loses the contract, and the next round of vendors compete for the contract. This cycle has eroded the credibility of search and content processing vendors.

A check list with three items won’t do much to change the credibility gap between what vendors say, what licensees hope will occur, and what users expect. The Grand Canyon is a big hole to fill. The Iguazu Falls can be tough to cross. Same with enterprise search.

Stephen E Arnold, October 25, 2016

Written by Stephen E. Arnold · Filed Under Enterprise search, News | 1 Comment

Semantiro and Ontocuro Basic

October 20, 2016

Quick update from the Australian content processing vendor SSAP or Semantic Software Asia Pacific Limited. The company’s Semantiro platform now supports the new Ontocuro tool.

Semantiro is a platform which “promises the ability to enrich the semantics of data collected from disparate data sources, and enables a computer to understand its context and meaning,” according to “Semantic Software Announces Artificial Intelligence Offering.”

I learned:

Ontocuro is the first suite of core components to be released under the Semantiro platform. These bespoke components will allow users to safely prune unwanted concepts and axioms; validate existing, new or refined ontologies; and import, store and share these ontologies via the Library.

The company’s approach is to leapfrog the complex interfaces other indexing and data tagging tools impose on the user. The company’s Web site for Ontocuro is at this link.

Stephen E Arnold, October 20, 2016

Written by Stephen E. Arnold · Filed Under Indexing, News, Semantic | 1 Comment

Online and without Ooomph: Social Content

October 15, 2016

I am surprised when Scientific American Magazine runs a story somewhat related to online information access. Navigate to read “The Bright Side of Internet Shaming.” The main point is that shaming has “become so common that it might soon begin to lose its impact.” Careful wording, of course. It is Scientific American, and the write up has few facts of the scientific ilk.

I highlighted this passage:

…these days public shaming are increasingly frequent. They’ve become a new kind of grisly entertainment, like a national reality show.

Yep, another opinion from Scientific American.

I then circled in Hawthorne Scarlet A red:

there’s a certain kind of hope in the increasing regularity of shamings. As they become commonplace, maybe they’ll lose their ability to shock. The same kinds of ugly tweets have been repeated so many times, they’re starting to become boilerplate.

I don’t pay much attention to social media unless the data are part of a project. I have a tough time distinguishing misinformation, disinformation, and run of the mill information.

What’s the relationship to search? Locating “shaming” type messages is difficult. Social media search engines don’t work particularly well. The half hearted attempts at indexing are not consistent. No surprise in that because user generated input is often uninformed input, particularly when it comes to indexing.

My thought is that Scientific American reflects shaming. The write up is not scientific. I would have found the article more interesting if:

Data based on tweet or Facebook post analyses based on negative or “shaming” words
Facts about the increase or decrease in “shaming” language for some “boilerplate” words
A Palantir-type link analysis illustrating the centroids for one solid shaming example.

Scientific American has redefined science it seems. Thus, a search for science might return a false drop for the magazine. I will skip the logic of the write up because the argument strikes me as subjective American thought.

Stephen E Arnold, October 15, 2016

Written by Stephen E. Arnold · Filed Under Indexing, News, Social Media | Comments Off on Online and without Ooomph: Social Content

Definitions of Search to Die For. Maybe With?

October 13, 2016

I read “Search Terminology. Web Search, Enterprise Search, Real Time Search, Semantic Search.” I have included glossaries in some of my books about search. I did not realize that I could pluck out four definitions and present them as a stand alone article. Ah, the wonders of content marketing.

If you want to read the definition with which one can die, either for or with, have at it. May I suggest that you consider these questions prior to your perusing the content marketing write up thing:

Web search

What’s the method for password protected sites and encrypted sites which exist under current Web technology?
What Web search systems build their own indexes and which send a query to multiple search systems and aggregate the results? Does the approach matter?
What is the freshness or staleness of Web indexes? Does it matter that one index may be a few minutes “old” and another index several weeks “old”?

Enterprise search

How does an enterprise search system deliver internal content points and external content pointers?
What is the consequence of an enterprise search user who accesses content which is incomplete or stale?
What does the enterprise search system do with third party content such as consultants’ reports which someone in the organization has purchased? Ignore? Re-license? Index the content and worry later?
What is the refresh cycle for changed and new content?
What is the search function for locating database content or rich media residing on the organization’s systems?

Real time search

What is real time? The indexing of content in the millisecond world of Wall Street? Indexing content when machine resources and network bandwidth permit?
How does a user determine the latency in the search system because marketers can write “real time” while programmers implement index update options which the search administrator selects?
What search system indexes videos in real time? YouTube struggles with 10 minute or longer latency with some videos requiring hours before the index points to those videos?

Semantic search

What is the role of human subject matter experts in semantic search?
What is the benefit of human-intermediated systems versus person-machine or automated smart indexing?
How does one address concept drift as a system “learns” from its indexing of information?
What happens to taxonomies, dictionary lists of entities, and other artifacts of concept indexing?
What does a system do when encountering documents, audio, and videos in a language different from the language of the majority of a system’s users?

Get the idea that zippy, brief definitions cannot deliver Gatorade to the college football players studying in the dorm the night before a big game?

Stephen E Arnold, October 13, 2016

Written by Stephen E. Arnold · Filed Under Enterprise search, News, Search | Comments Off on Definitions of Search to Die For. Maybe With?

Five Years in Enterprise Search: 2011 to 2016

October 4, 2016

Before I shifted from worker bee to Kentucky dirt farmer, I attended a presentation in which a wizard from Findwise explained enterprise search in 2011. In my notes, I jotted down the companies the maven mentioned (love that alliteration) in his remarks:

Attivio
Autonomy
Coveo
Endeca
Exalead
Fabasoft
Google
IBM
ISYS Search
Microsoft
Sinequa
Vivisimo.

There were nodding heads as the guru listed the key functions of enterprise search systems in 2011. My notes contained these items:

Federation model
Indexing and connectivity
Interface flexibility
Management and analysis
Mobile support
Platform readiness
Relevance model
Security
Semantics and text analytics
Social and collaborative features

I recall that I was confused about the source of the information in the analysis. Then the murky family tree seemed important. Five years later, I am less interested in who sired what child than the interesting historical nuggets in this simple list and collection of pretty fuzzy and downright crazy characteristics of search. I am not too sure what “analysis” and “analytics” mean. The notion that an index is required is okay, but the blending of indexing and “connectivity” seems a wonky way of referencing file filters or a network connection. With the Harvard Business Review pointing out that collaboration is a bit of a problem, it is an interesting footnote to acknowledge that a buzzword can grow into a time sink.

There are some notable omissions; for example, open source search options do not appear in the list. That’s interesting because Attivio was at that time I heard poking its toe into open source search. IBM was a fan of Lucene five years ago. Today the IBM marketing machine beats the Watson drum, but inside the Big Blue system resides that free and open source Lucene. I assume that the gurus and the mavens working on this list ignored open source because what consulting revenue results from free stuff? What happened to Oracle? In 2011, Oracle still believed in Secure Enterprise Search only to recant with purchases of Endeca, InQuira, and Rightnow. There are other glitches in the list, but let’s move on.

Written by Stephen E. Arnold · Filed Under Business strategy, Enterprise search, Feature | Comments Off on Five Years in Enterprise Search: 2011 to 2016

Google and the Future of Search Engine Optimization

September 30, 2016

Regular readers know that we are not big fans of SEO (Search Engine Optimization ) or its champions, so you will understand our tentative glee at the Fox News headline, “Is Google Trying to Kill SEO?” The article is centered around a Florida court case whose plaintiff is e.ventures Worldwide LLC, accused by Google of engaging in “search-engine manipulation”. As it turns out, that term is a little murky. That did not stop Google from unilaterally de-indexing “hundreds” of e.ventures’ websites. Writer Dan Blacharski observes:

The larger question here is chilling to virtually any small business which seeks a higher ranking, since Google’s own definition of search engine manipulation is vague and unpredictable. According to a brief filed by e-ventures’ attorney Alexis Arena at Flaster Greenberg PC, ‘Under Google’s definition, any website owner that attempts to cause its website to rank higher, in any manner, could be guilty of ‘pure spam’ and blocked from Google’s search results, without explanation or redress. …

The larger question here is chilling to virtually any small business which seeks a higher ranking, since Google’s own definition of search engine manipulation is vague and unpredictable. According to a brief filed by e-ventures’ attorney Alexis Arena at Flaster Greenberg PC, ‘Under Google’s definition, any website owner that attempts to cause its website to rank higher, in any manner, could be guilty of ‘pure spam’ and blocked from Google’s search results, without explanation or redress.

We cannot share Blacharski’s alarm at this turn of events. In our humble opinion, if websites focus on providing quality content, the rest will follow. The article goes on to examine Google’s first-amendment based stance, and considers whether SEO is even a legitimate strategy. See the article for its take on these considerations.

Cynthia Murrell, September 30, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Written by Stephen E. Arnold · Filed Under Applications, Google, News, Search, Technology | 1 Comment

« Previous Page — Next Page »

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.

Categories
- 3D-Printing
- Acquisition
- Advertising
- Aggregation
- AI
- Alexa
- algorithms
- Amazon
- Amazonia
- Analytics
- Appliance
- Applications
- Audio
- Augmented Reality
- Big data
- Bing
- Bitcoin
- Bitext
- Book review
- Business intelligence
- Business process
- Business strategy
- Censorship
- Cloud computing
- Company Profile
- Conferences
- Connectors
- Consulting
- Consumer
- Content processing
- Copyright
- Corporate Concerns
- Cost
- Crawl
- Crowdfunding
- cryptocurrency
- Customer support
- Cyber OSINT
- cybercrime
- cybersecurity
- Dark Web
- DarkCyber
- Data
- Data mining
- Database
- Deepfakes
- Digital Assistant
- Digital Library
- E2EE
- ECommerce
- EDiscovery
- Editorial opinion
- Education
- Emoticons
- Enterprise
- Enterprise search
- Entity extraction
- Ethics
- Facebook
- Faceted search
- Factualities
- Feature
- Federated search
- Financial
- Fogint
- Google
- Governance
- Government
- Hackers
- healthcare
- IBM Watson
- Image search
- Indexing
- Infrastructure
- Innovation
- Integration
- intelware
- Interface
- Internet
- Interview
- Investment
- law enforcement
- Legal matters
- Library automation
- Management
- Marketing
- Mathematics
- Metadata
- Microsoft
- Mobile
- Natural language processing
- News
- NGIA
- Online (general)
- Open Access
- Open source
- OSINT
- Osint Radar
- Overflight
- Palantir
- Patents
- Personnel
- Podcast
- Policeware
- Portals
- Predictive coding
- Privacy
- Profile
- Publishing
- Quotation
- Real time search
- Reference tool
- Rich media
- Robot Writer
- Search
- Search enabled applications
- search engine
- Search quality
- Security
- Semantic
- Sentiment analysis
- SEO
- SharePoint
- Short Honks
- Smart Technology
- Social
- Social Media
- software
- Statistics
- Taxonomy
- Technology
- Text analytics
- Text processing
- Tools
- Tor
- Training
- Translation
- Twitter
- Uncategorized
- Unstructured Data
- User experience
- User Interface
- Vertical search
- Video
- visualization
- Voice search
- Voice technology
- Web 3
- Web Services
- Webinar
- Windows
- Work flow
- XML
- Yahoo

Beyond Search

Oh, Canada: Censorship Means If It Is Not Indexed, Information Does Not Exist

Info-Distortion: Suddenly People Understand

How Real Journalists Do Research

Entity Extraction: No Slam Dunk

BA Insight and Its Ideas for Enterprise Search Success

Semantiro and Ontocuro Basic

Online and without Ooomph: Social Content

Definitions of Search to Die For. Maybe With?

Five Years in Enterprise Search: 2011 to 2016

Google and the Future of Search Engine Optimization

Search the site

Categories

Archives

Recent Posts

Meta

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Search the site

Categories

Archives

Recent Posts

Meta