Google Search and Hot News: Sensitivity and Relevance

November 10, 2017

I read “Google Is Surfacing Texas Shooter Misinformation in Search Results — Thanks Also to Twitter.” What struck me about the article was the headline; specifically, the implication for me was that Google was not responding to user queries. Google is actively “surfacing” or fetching and displaying information about the event. Twitter is also involved. I don’t think of Twitter as much more than a party line. One can look up keywords or see a stream of content containing a keyword or a, to use Twitter speak, “hash tags.”

The write up explains:

Users of Google’s search engine who conduct internet searches for queries such as “who is Devin Patrick Kelley?” — or just do a simple search for his name — can be exposed to tweets claiming the shooter was a Muslim convert; or a member of Antifa; or a Democrat supporter…

I think I understand. A user inputs a term and Google’s system matches the user’s query to the content in the Google index. Google maintains many indexes, despite its assertion that it is a “universal search engine.” One has to search across different Google services and their indexes to build up a mosaic of what Google has indexed about a topic; for example, blogs, news, the general index, maps, finance, etc.

Developing a composite view of what Google has indexed takes time and patience. The results may vary depending on whether the user is logged in, searching from a particular geographic location, or has enabled or disabled certain behind the scenes functions for the Google system.

The write up contains this statement:

Safe to say, the algorithmic architecture that underpins so much of the content internet users are exposed to via tech giants’ mega platforms continues to enable lies to run far faster than truth online by favoring flaming nonsense (and/or flagrant calumny) over more robustly sourced information.

From my point of view, the ability to figure out what influences Google’s search results requires significant effort, numerous test queries, and recognition that Google search now balances on two pogo sticks. Once “pogo stick” is blunt force keyword search. When content is indexed, terms are plucked from source documents. The system may or may not assign additional index terms to the document; for example, geographic or time stamps.

The other “pogo stick” is discovery and assignment of metadata. I have explained some of the optional tags which Google may or may not include when processing a content object; for example, see the work of Dr. Alon Halevy and Dr. Ramanathan Guha.

But Google, like other smart content processing today, has a certain sensitivity. This means that streams of content processed may contain certain keywords.

When “news” takes place, the flood of content allows smart indexing systems to identify a “hot topic.” The test queries we ran for my monographs “The Google Legacy” and “Google Version 2.0” suggest that Google is sensitive to certain “triggers” in content. Feedback can be useful; it can also cause smart software to wobble a bit.

Image result for the impossible takes a little longer

T shirts are easy; search is hard.

I believe that the challenge Google faces is similar to the problem Bing and Yandex are exploring as well; that is, certain numerical recipes can over react to certain inputs. These over reactions may increase the difficulty of determining what content object is “correct,” “factual,” or “verifiable.”

Expecting a free search system, regardless of its owner, to know what’s true and what’s false is understandable. In my opinion, making this type of determination with today’s technology, system limitations, and content analysis methods is impossible.

In short, the burden of figuring out what’s right and what’s not correct falls on the user, not exclusively on the search engine. Users, on the other hand, may not want the “objective” reality. Search vendors want traffic and want to generate revenue. Algorithms want nothing.

Mix these three elements and one takes a step closer to understanding that search and retrieval is not the slam dunk some folks would have me believe. In fact, the sensitivity of content processing systems to comparatively small inputs requires more discussion. Perhaps that type of information will come out of discussions about how best to deal with fake news and related topics in the context of today’s information retrieval environment.

Free search? Think about that too.

Stephen E Arnold, November 10, 2017

Smartlogic: A Buzzword Blizzard

August 2, 2017

I read “Semantic Enhancement Server.” Interesting stuff. The technology struck me as a cross between indexing, good old enterprise search, and assorted technologies. Individuals who are shopping for an automatic indexing systems (either with expensive, time consuming hand coded rules or a more Autonomy-like automatic approach) will want to kick the tires of the Smartlogic system. In addition to the echoes of the SchemaLogic approach, I noted a Thomson submachine gun firing buzzwords; for example:

best bets (I’m feeling lucky?)
dynamic summaries (like Island Software’s approach in the 1990s)
faceted search (hello, Endeca?)
model
navigator (like the Siderean “navigator”?)
real time
related topics (clustering like Vivisimo’s)
semantic (of course)
taxonomy
topic maps
topic pages (a Google report as described in US29970198481)
topic path browser (aka breadcrumbs?)
visualization

What struck me after I compiled this list about a system that “drives exceptional user search experiences” was that Smartlogic is repeating the marketing approach of traditional vendors of enterprise search. The marketing lingo and “one size fits all” triggered thoughts of Convera, Delphes, Entopia, Fast Search & Transfer, and Siderean Software, among others.

I asked myself:

Is it possible for one company’s software to perform such a remarkable array of functions in a way that is easy to implement, affordable, and scalable? There are industrial strength systems which perform many of these functions. Examples range from BAE’s intelligence system to the Palantir Gotham platform.

My hypothesis is that Smartlogic might struggle to process a real time flow of WhatsApp messages, YouTube content, and mobile phone intercept voice calls. Toss in the multi language content which is becoming increasingly important to enterprises, and the notional balloon I am floating says, “Generating buzzwords and associated over inflated expectations is really easy. Delivering high accuracy, affordable, and scalable content processing is a bit more difficult.”

Perhaps Smartlogic has cracked the content processing equivalent of the Voynich manuscript.

image

Will buzzwords crack the Voynich manuscript’s inscrutable text? What if Voynich is a fake? How will modern content processing systems deal with this type of content? Running some content processing tests might provide some insight into systems which possess Watson-esque capabilities.

What happened to those vendors like Convera, Delphes, Entopia, Fast Search & Transfer, and  Siderean Software, among others? (Free profiles of these companies are available at www.xenky.com/vendor-profiles.) Oh, that’s right. The reality of the marketplace did not match the companies’ assertions about technology. Investors and licensees of some of these systems were able to survive the buzzword blizzard. Some became the digital equivalent of Ötzi, 5,300 year old iceman.

Stephen E Arnold, August 2, 2017

AI Not to Replace Lawyers, Not Yet

May 9, 2017

Robot or AI lawyers may be effective in locating relevant cases for references, but they are far away from replacing lawyers, who still need to go to the court and represent a client.

ReadWrite in a recently published analytical article titled Look at All the Amazing Things AI Can (and Can’t yet) Do for Lawyers says:

Even if AI can scan documents and predict which ones will be relevant to a legal case, other tasks such as actually advising a client or appearing in court cannot currently be performed by computers.

The author further explains that what the present generation of AI tools or robots does. They merely find relevant cases based on indexing and keywords, which was a time-consuming and cumbersome process. Thus, what robots do is eliminate the tedious work that was performed by interns or lower level employees. Lawyers still need to collect evidence, prepare the case and argue in the court to win a case. The robots are coming, but only for doing lower level jobs and not to snatch them.

Vishol Ingole, May 9, 2017

Palantir Technologies: A Beatdown Buzz Ringing in My Ears

April 27, 2017

I have zero contacts at Palantir Technologies. The one time I valiantly contacted the company about a speaking opportunity at one of my wonky DC invitation-only conferences, a lawyer from Palantir referred my inquiry to a millennial who had a one word vocabulary, “No.”

There you go.

I have written about Palantir Technologies because I used to be an adviser to the pre-IBM incarnation of i2 and its widely used investigation tool, Analyst’s Notebook. I did write about a misadventure between i2 Group and Palantir Technologies, but no one paid much attention to my commentary.

An outfit called Buzzfeed, however, does pay attention to Palantir Technologies. My hunch is that the online real news outfit believes there is a story in the low profile, Peter Thiel-supported company. The technology Palantir has crafted is not that different from the Analyst’s Notebook, Centrifuge Systems’ solution, and quite a few other companies which provide industrial-strength software and systems to law enforcement, security firms, and the intelligence community. (I list about 15 of these companies in my forthcoming “Dark Web Notebook.” No, I won’t provide that list in this free blog. I may be retired, but I am not giving away high value information.)

So what’s caught my attention. I read the article “Palantir’s Relationship with the Intelligence Community Has Been Worse Than You Think.” The main idea is that the procurement of Palantir’s Gotham and supporting services provided by outfits specializing in Palantir systems has not been sliding on President Reagan’s type of Teflon. The story has been picked up and recycled by several “real” news outfits; for example, Brainsock. The story meshes like matryoshkas with other write ups; for example, “Inside Palantir, Silicon Valley’s Most Secretive Company” and “Palantir Struggles to Retain Clients and Staff, BuzzFeed Reports.” Palantir, it seems to me in Harrod’s Creek, is a newsy magnet.

The write up about Palantir’s lousy relationship with the intelligence community pivots on a two year old video. I learned that the Big Dog at Palantir, Alex Karp, said in a non public meeting which some clever Hobbit type videoed on a smartphone words presented this way by the real news outfit:

The private remarks, made during a staff meeting, are at odds with a carefully crafted public image that has helped Palantir secure a $20 billion valuation and win business from a long list of corporations, nonprofits, and governments around the world. “As many of you know, the SSDA’s recalcitrant,” Karp, using a Palantir codename for the CIA, said in the August 2015 meeting. “And we’ve walked away, or they walked away from us, at the NSA. Either way, I’m happy about that.” The CIA, he said, “may not like us. Well, when the whole world is using Palantir they can still not like us. They’ll have no choice.” Suggesting that the Federal Bureau of Investigation had also had friction with Palantir, he continued, “That’s de facto how we got the FBI, and every other recalcitrant place.”

Okay, I don’t know the context of the remarks. It does strike me that 2015 was more than a year ago. In the zippy doo world of Sillycon Valley, quite a bit can change in one year.

I don’t know if you recall Paul Doscher who was the CEO of Exalead USA and Lucid Imagination (before the company asserted that its technology actually “works). Mr. Doscher is a good speaker, but he delivered a talk in 2009, captured on video, during which he was interviewed by a fellow in a blue sport coat and shirt. Mr. Doscher wore a baseball cap in gangsta style, a crinkled unbuttoned shirt, and evidenced a hipster approach to discussing travel. Now if you know Mr. Doscher, he is not a manager influenced by gangsta style. My hunch is that he responded to an occasion, and he elected to approach travel with a bit of insouciance.

Could Mr. Karp, the focal point of the lousy relationship article, have been responding to an occasion? Could Mr. Karp have adopted a particular tone and style to express frustration with US government procurement? Keep in mind that a year later, Palantir sued the US Army. My hunch is that views expressed in front of a group of employees may not be news of the moment. Interesting? Sure.

What I find interesting is that the coverage of Palantir Technologies does not dig into the parts of the company which I find most significant. To illustrate: Palantir has a system and method for an authorized user to add new content to the Gotham system. The approach makes it possible to generate an audit trail to make it easy (maybe trivial) to answer these questions:

  1. What data were added?
  2. When were the data added?
  3. What person added the data?
  4. What index terms were added to the data?
  5. What entities were added to the metadata?
  6. What special terms or geographic locations were added to the data?

You get the idea. Palantir’s Gotham brings to intelligence analysis the type of audit trail I found some compelling in the Clearwell system and other legal oriented systems. Instead of a person in information technology saying in response to a question like “Where did this information come from?”, “Duh. I don’t know.”

Gotham gets me an answer.

For me, explaining the reasoning behind Palantir’s approach warrants a write up. I think quite a few people struggling with problems of data quality and what is called by the horrid term “governance” would find Palantir’s approach of some interest.

Now do I care about Palantir? Nah.

Do I care about bashing Palantir? Nah.

What I do care about is tabloidism taking precedence over substantive technical approaches. From my hollow in rural Kentucky, I see folks looking for “sort of” information.

How about more substantive information? I am fed up with podcasts which recycle old information with fake good cheer. I am weary of leaks. I want to know about Palantir’s approach to search and content processing and have its systems and methods compared to what its direct competitors purport to do.

Yeah, I know this is difficult to do. But nothing worthwhile comes easy, right?

I can hear the millennials shouting, “Wrong, you dinosaur.” Hey, no problem. I own a house. I don’t need tabloidism. I have picked out a rest home, and I own 60 cemetery plots.

Do your thing, dudes and dudettes of “real” journalism.

Stephen E Arnold, April 27, 2017

Palantir Technologies: 9000 Words about a Secretive Company

April 3, 2017

Palantir Technologies is a search and content processing company. The technology is pretty good. The company’s marketing pretty good. Its public profile is now darned good. I don’t have much to say about Palantir’s wheel interface, its patents, or its usefulness to “operators.” If you are not familiar with the company, you may want to read or at least skim the weirdo Fortune Magazine Web article “Donald Trump, Palantir, and the Crazy Battle to Clean Up a Multibillion Dollar Military Procurement Swamp.” The subtitle is a helpful statement:

Peter Thiel’s software company says it has a product that will save soldiers’ lives—and hundreds of millions in taxpayer funds. The Army, which has spent billions on a failed alternative, isn’t interested. Weill the president and his generals ride to the rescue?”

The article, minus the pull quotes, is more than 9000 words long. The net net of the write  up is that changing the US government’s method of purchasing goods and services may be tough to modify. I used to work at a Beltway Bandit outfit. Legend has it that my employer helped set up the US Department of the Navy and many of the business processes so many contractors know and love.

One has to change elected officials, government professionals who operate procurement processes, outfits like Beltway Bandits, and assorted legal eagles.

Why take 9000 words to reach this conclusion. My hunch is that the journey was fun: Fun for the Fortune Magazine staff, fun for the author, and fun for the ad sales person who peppered the infinite page with ads.

Will Palantir Technologies enjoy the write up? I suppose it depends on whom one asks. Perhaps a reader connected to IBM could ask Watson about the Analyst’s Notebook team. What are their views of Palantir? For most folks, my thought is that the Palantir connection to President Trump may provide a viewshed from which to assess the impact of this real journalism essay thing.

Stephen E Arnold, April 3, 2017

Is Google Plucking a Chicken Joint?

March 14, 2017

Real chicken or fake news? You decide. I read “Google, What the H&%)? Search Giant Wrongly Said Shop Closed Down, Refused to List the Truth.” The write up reports that a chicken restaurant is clucking mad about how Google references the eatery. The Google, according to the article, thinks the fowl peddler is out of business. The purveyor of poultry disagrees.

The write up reports:

Kaie Wellman says that her rotisserie chicken outlet Arrosto, in Portland, Oregon, US, was showing up as “permanently closed” on Google’s mobile search results.

Ms Wellman contacted the Google and allegedly learned that Google would not change the listing. The fix seems to be that the bird roaster has to get humans to input data via Google Maps. The smart Google system will recognize the inputs and make the fix.

The write up reports that the Google listing is now correct. The fowl mix up is now resolved.

Yes, the Google. Relevance, precision, recall, and accuracy. Well, maybe not so much of these ingredients when one is making fried mobile outputs.

Stephen E Arnold, March 14, 2017

Index Is Important. Yes, Indexing.

March 8, 2017

I read “Ontologies: Practical Applications.” The main idea in the write up is that indexing is important. Now indexing is labeled in different ways today; for example, metadata, entity extraction, concepts, etc. I agree that indexing is important, but the challenge is that most people are happy with tags, keywords, or systems which return a result that has made a high percentage of users happy. Maybe semi-happy. Who really knows? Asking about search and content processing system satisfaction returns the same grim news year after year; that is, most users (roughly two thirds) are not thrilled with the tools available to locate information. Not much progress in 50 years it seems.

The write up informs me:

Ontologies are a critical component of the enterprise information architecture. Organizations must be capable of rapidly gathering and interpreting data that provides them with insights, which in turn will give their organization an operational advantage.  This is accomplished by developing ontologies that conceptualize the domain clearly, and allows transfer of knowledge between systems.

This seems to mean a classification system which makes sense to those who work in an organization. The challenge which we have encountered over the last half century is that the content and data flowing into an organization changes often rapidly over time. At any one point in time, the information today is not available. The organization sucks in what’s needed and hopes the information access system indexes the new content right away and makes it findable and usable in other software.

That’s the hope anyway.

The reality is that a gap exists between what’s accessible to a person in an organization and what information is being acquired and used by others in the organization. Search fails for most system users because what’s needed now is not indexed or if indexed, the information is not findable.

An ontology is a fancy way of saying that a consultant and software can cook up a classification system and use those terms to index content. Nifty idea, but what about that gap?

This is the killer for most indexing outfits. They make a sale because people are dissatisfied with the current methods of information access. An ontology or some other jazzed up indexing component is sold as the next big thing.

When an ontology, taxonomy, or other solution does not solve the problem, the company grouses about search and cotenant processing again.

Is there a fix? Who knows. But after 50 years in the information access sector, I know that jargon is not an effective way to solve very real problems. Money, know how, and old school methods are needed to make certain technologies deliver useful applications.

Ontologies. Great. Silver bullet. Nah. Practical applications? Nifty concept. Reality is different.

Stephen E Arnold, March 8, 2017

Forecasting Methods: Detail without Informed Guidance

February 27, 2017

Let’s create a scenario. You are a person trying to figure out how to index a chunk of content. You are working with cancer information sucked down from PubMed or a similar source. You run an extraction process and push the text through an indexing system. You use a system like Leximancer and look at the results. Hmmm.

Next you take a corpus of blog posts dealing with medical information. You suck down the content and run it through your extractor, your indexing system, and your Leximancer set up. You look at the results. Hmmm.

How do you figure out what terms are going to be important for your next batch of mixed content?

You might navigate to “Selecting Forecasting Methods in Data Science.” The write up does a good job of outlining some of the numerical recipes taught in university courses and discussed in textbooks. For example, you can get an overview in this nifty graphic:

image

And you can review outputs from the different methods identified like this:

image

Useful.

What’s missing? For the person floundering away like one government agency’s employee at which I worked years ago, you pick the trend line you want. Then you try to plug in the numbers and generate some useful data. If that is too tough, you hire your friendly GSA schedule consultant to do the work for you. Yep, that’s how I ended up looking at:

  • Manually selected data
  • Lousy controls
  • Outputs from different systems
  • Misindexed text
  • Entities which were not really entities
  • A confused government employee.

Here’s the takeaway. Just because software is available to output stuff in a log file and Excel makes it easy to wrangle most of the data into rows and columns, none of the information may be useful, valid, or even in the same ball game.

When one then applies without understanding different forecasting methods, we have an example of how an individual can create a pretty exciting data analysis.

Descriptions of algorithms do not correlate with high value outputs. Data quality, sampling, understanding why curves are “different”, and other annoying details don’t fit into some busy work lives.

Stephen E Arnold, February 27, 2017

Intellisophic / Linkapedia

February 24, 2017

Intellisophic identifies itself as a Linkapedia company. Poking around Linkapedia’s ownership revealed some interesting factoids:

  • Linkapedia is funded in part by GITP Ventures and SEMMX (possible a Semper fund)
  • The company operates in Hawaii and Pennsylvania
  • One of the founders is a monk / Zen master. (Calm is a useful characteristic when trying to spin money from a search machine.)

First, Intellisophic. The company describes itself this way at this link:

Intellisophic is the world’s largest provider of taxonomic content. Unlike other methods for taxonomy development that are limited by the expense of corporate librarians and subject matter experts, Intellisophic content is machine developed, leveraging knowledge from respected reference works. The taxonomies are unbounded by subject coverage and cost significantly less to create. The taxonomy library covers five million topic areas defined by hundreds of millions of terms. Our taxonomy library is constantly growing with the addition of new titles and publishing partners.

In addition, Intellisophic’s technology—Orthogonal Corpus Indexing—can identify concepts in large collections of text. The system can be sued to enrich an existing technology, business intelligence, and search. One angle Intellisophic exploits is its use of reference and educational books. The company is in the “content intelligence” market.

Second, the “parent” of Intellisophic is Linkapedia. This public facing Web site allows a user to run a query and see factoids, links about a topic. Plus, Linkapedia has specialist collections of content bundles; for example, lifestyle, pets, and spirituality. I did some clicking around and found that certain topics were not populated; for instance, Lifestyle, Cars, and Brands. No brand information appeared for me.  I stumbled into a lengthy explanation of the privacy policy related to a mathematics discussion group. I backtracked, trying to get access the actual group and failed. I think the idea is an interesting one, but more work is needed. My test query for “enterprise search” presented links to Convera and a number of obscure search related Web sites.

The company is described this way in Crunchbase:

Linkapedia is an interest based advertising platform that enables publishers and advertisers to monetize their traffic, and distribute their content to engaged audiences. As opposed to a plain search engine which delivers what users already know, Linkapedia’s AI algorithms understand the interests of users and helps them discover something new they may like even if they don’t already know to look for it. With Linkapedia content marketers can now add Discovery as a new powerful marketing channel like Search and Social.

Like other search related services, Linkapedia uses smart software. Crunchbase states:

What makes Linkapedia stand out is its AI discovery engine that understands every facet of human knowledge. “There’s always something for you on Linkapedia”. The way the platform works is simple: people discover information by exploring a knowledge directory (map) to find what interests them. Our algorithms show content and native ads precisely tailored to their interests. Linkapedia currently has hundreds of million interest headlines or posts from the worlds most popular sources. The significance of a post is that “someone thought something related to your interest was good enough to be saved or shared at a later time.” The potential of a post is that it is extremely specific to user interests and has been extracted from recognized authorities on millions of topics.

Interesting. Search positioned as indexing, discovery, social, and advertising.

Stephen E Arnold, February 24, 2017

Mondeca: Tweaking Its Market Position

February 22, 2017

One of the Beyond Search goslings noticed a repositioning of the taxonomy capabilities of Mondeca. Instead of pitching indexing, the company has embraced ElasticSearch (based on Lucene) and Solr. The idea is that if an organization is using either of these systems for search and retrieval, Mondeca can provide “augmented” indexing. The idea is that keywords are not enough. Mondeca can index the content using concepts.

Of course, the approach is semantic, permits exploration, and enables content discovery. Mondeca’s Web site describes search as “find” and explains:

Initial results are refined, annotated and easy to explore. Sorted by relevancy, important terms are highlighted: easy to decide which one are relevant. Sophisticated facet based filters. Refining results set: more like this, this one, statistical and semantic methods, more like these: graph based activation ranking. Suggestions to help refine results set: new queries based on inferred or combined tags. Related searches and queries.

This is a similar marketing move to the one that Intrafind, a German search vendor, implemented several years ago. Mondeca continues to offer its taxonomy management system. Human subject matter experts do have a role in the world of indexing. Like other taxonomy systems and services vendors, the hook is that content indexed with concepts is smart. I love it when indexing makes content intelligent.

The buzzword is used by outfits ranging from MarkLogic’s merry band of XML and XQuery professionals to the library-centric outfits like Smartlogic. Isn’t smart logic better than logic?

Stephen E Arnold, February 22, 2017

Next Page »

  • Archives

  • Recent Posts

  • Meta