January 1, 2017
I read a write up which, like 99 percent of the information available for free via the Internet, is 100 percent accurate.
The write up’s title tells the tale: “Google Does a Better Job with Fake News Than Facebook, but There’s a Big Loophole It Hasn’t Fixed.” What’s the loophole? The write up reports:
…the “newsy” modules that sit at the top of many Google searches (the “In the news” section on desktop, and the “Top stories” section on mobile) don’t pull content straight from Google News. They pull from all sorts of content available across the web, and can include sites not approved by Google News. This is particularly confusing for users on the desktop version of Google’s site, where the “In the news” section lives.Not only does the “In the news” section literally have the word “news” in its name, but the link at the bottom of the module, which says “More news for…,” takes you to the separate Google News page, which is comprised only of articles that Google’s editorial system has approved.
So why isn’t the “In the news” section just the top three Google News results?
The short answer is because Google sees Google Search and Google News as separate products.
The word “news” obviously does not mean news. We reported last week about Google’s effort to define “monopoly” for the European Commission investigating allegations of Google’s being frisky with its search results. News simply needs to be understood in the Google contextual lexicon.
The write up helps me out with this statement:
So why isn’t the “In the news” section just the top three Google News results? The short answer is because Google sees Google Search and Google News as separate products.
Logical? From Google’s point of view absolutely crystal clear.
The write up amplifies the matter:
Google does, however, seem to want to wipe fake news from its platform. “From our perspective, there should just be no situation where fake news gets distributed, so we are all for doing better here,” Google CEO Sundar Pichai said recently. After the issue of fake news entered the spotlight after the election, Google announced it would ban fake-news sites from its ad network, choking off their revenue. But even if Google’s goal is to kick fake-news sites out of its search engine, most Google users probably understand that Google search results don’t have carry the editorial stamp of approval from Google.
Fake news, therefore, is mostly under control. The Google users just have to bone up on how Google works to make information available.
What about mobile?
Google AMP is not news; AMP content labeled as “news” is part of the AMP technical standard which speeds up mobile page display.
Google, like Facebook, may tweak its approach to news.
Beyond Search would like to point out that wild and crazy news releases from big time PR dissemination outfits can propagate a range of information (some mostly accurate and some pretty crazy). The handling of high value sources allows some questionable content to flow. Oh, there are other ways to inject questionable content into the Web indexing systems.
There is not one loophole. There are others. Who wants to nibble into revenue? Not Beyond Search.
Stephen E Arnold, January 1, 2017
December 28, 2016
I spotted a tweet about making smart content smarter. It seems that if content is smarter, then intelligence becomes contentier. I loved my logic class in 1962.
Here’s the diagram from this tweet. Hey, if the link is wonky, just attend the conference and imbibe the intelligence directly, gentle reader.
The diagram carries the identifier Data Ninja, which echoes Palantir’s use of the word ninja for some of its Hobbits. Data Ninja’s diagram has three parts. I want to focus on the middle part:
What I found interesting is that instead of a single block labeled “content processing,” the content processing function is broken into several parts. These are:
A Data Ninja API
A third component in the top box is the statement “analyze unstructured text.” This may refer to indexing and such goodies as entity extraction.
The second box performs “text analysis.” Obviously this process is different from “the analyze unstructured text” step; otherwise, why run the same analyses again? The second box performs what may be clustering of content into specific domains. This is important because a “terminal” in transportation may be different from a “terminal” in a cloud hosting facility. Disambiguation is important because the terminal may be part of a diversified transportation company’s computing infrastructure. I assume Data Ninja’s methods handles this parsing of “concepts” without many errors.
Once the selection of a domain area has been performed, the system appears to perform four specific types of operations as the Data Ninja practice their katas. These are the smart components:
- Smart sentiment; that is, is the content object weighted “positive” or “negative”, “happy” or “sad”, or green light or red light, etc.
- Smart data; that is, I am not sure what this means
- Smart content; that is, maybe a misclassification because the end result should be smart content, but the diagram shows smart content as a subcomponent within the collection of procedures/assertions in the middle part of the diagram
- Smart learning; that is, the Data Ninja system is infused with artificial intelligence, smart software, or machine learning (perhaps the three buzzwords are combined in practice, not just in diagram labeling?)
- The end result is an iPhrase-type representation of data. (Note: that this approach infuses TeraText, MarkLogic, and other systems which transform unstructured data to metadata tagged structured information).
The diagram then shows a range of services “plugging” into the box performing the functions referenced in my description of the middle box.
If the system works as depicted, Data Ninjas may have the solution to the federation challenge which many organizations face. Smarter content should deliver contentier intelligence or something along that line.
Stephen E Arnold, November 28, 2016
December 2, 2016
I read “This Startup Helps You Deep Snoop Competitor Email Marketing.” I like that “deep snoop” thing. That works pretty well until one loses access to content to analyze. Just ask Geofeedia which is scrambling since it lost access to Twitter and other social media content.
The outfit Rival Explorer offers:
a tool designed to help users improve their email marketing strategy and product pricing and promotion through comprehensive monitoring of their competitor’s email newsletters. After creating a free account, users can browse through a database of marketing emails from over 50,000 brands. Rival Explorer offers access to a number of different email types, including newsletters, cart abandonment emails, welcome emails, and other transactional messages.
In terms of information access, the Rival Explorer customers:
can search by brand, subject, message body, date, day of week, industry, category, and custom tags and keywords. When users select a message, they’re able to view the sender email, subject line, and timestamp of the messages. In addition to those details, users can view the emails as they appear on tablets and smartphones, plus they also can toggle images to get a better idea of design and copy strategy.
You can get more information at this link. Public content and marketing information can be useful it seems.
Stephen E Arnold, December 2, 2016
November 25, 2016
I read “Shedding Light on Dark Data: How to Get Started.” Okay, Dark Data. Like Big Data, the phrase is the fruit of the nomads at Garner Group. The person embracing this sort of old concept is an outfit OdinText. Spoiler: I thought the write up was going to identify outfits like BAE Systems, Centrifuge Systems, IBM Analyst’s Notebook, Palantir Technologies, and Recorded Future (an In-Q-Tel and Google backed outfit). Was I wrong? Yes.
The write up explains that a company has to tackle a range of information in order to be aware, informed, or insightful. Pick one. Here’s the list of Dark Data types, which the aforementioned companies have been working to capture, analyze, and make sense of for almost 20 years in the case of NetReveal (Detica) and Analyst’s Notebook. The other companies are comparative spring chickens with an average of seven years’ experience in this effort.
- Customer relationship management data
- Data warehouse information
- Enterprise resource planning information
- Log files
- Machine data
- Mainframe data
- Semi structured information
- Social media content
- Unstructured data
- Web content.
I think the company or non profit which tries to suck in these data types and process them may run into some cost and legal issues. Analyzing tweets and Facebook posts can be useful, but there are costs and license fees required. Frankly not even law enforcement and intelligence entities are able to do a Cracker Jack job with these content streams due to their volume, cryptic nature, and pesky quirks related to metadata tagging. But let’s move on. To this statement:
Phone transcripts, chat logs and email are often dark data that text analytics can help illuminate. Would it be helpful to understand how personnel deal with incoming customer questions? Which of your products are discussed with which of your other products or competitors’ products more often? What problems or opportunities are mentioned in conjunction with them? Are there any patterns over time?
Yep, that will work really well in many legal environments. Phone transcripts are particularly exciting.
How does one think about Dark Data? Easy. Here’s a visualization from the OdinText folks:
Notice that there are data types in this diagram NOT included in the listing above. I can’t figure out if this is just carelessness or an insight which escapes me.
How does one deal with Dark Data? OdinText, of course. Yep, of course. Easy.
Stephen E Arnold, November 25, 2016
November 14, 2016
I am preparing three seven minute videos. That effort will be one video each week starting on 20 December 2016. The subject is my Google Trilogy, published by an antique outfit which has drowned in River Avon. The first video is about the 2004 monograph, The Google Legacy. I coined the term “Googzilla” in that 230 page discussion of how Google became baby Google. The second video summarizes several of the take aways from Google: The Calculating Predator, published in 2007. The key to the monograph is the bound phrase “calculating predator.” Yep, not the happy little search out most know and love. The third video hits the main points of Google: The Digital Gutenberg, published in 2009. The idea is that Google spits out more digital content than almost anyone. Few think of the GOOG as the content generator the company has become. Yep, a map is a digital artifact.
Now to the curiosity. I wanted to reference the work of Dr. Alon Halevy, a former University of Washington professor and founder of Nimble and Transformic. I had a stack of links I used when I was doing the research for my predator book. Just out of curiosity I started following the links. I do have PDF versions of most of the open source Halevy-centric content I located.
But guess what?
Dr. Alon Halevy has disappeared. I could not locate the open source version of his talk about dataspaces. I could not locate the Wayback Machine’s archived version of the Transformic.com Web site. The links returned these weird 404 errors. My assumption was that Wayback’s Web pages resided happily on the outfit’s servers. I was incorrect. Here’s what I saw:
I explored the bound phrase “Alon Halvey” with various other terms only to learn that the bulk of the information has disappeared. No PowerPoints, no much substantive information. There were a few “information objects” which have not yet disappeared; for example:
- An ACM blog post which references “the structured data team” and Nimble and Transformic
- A Google research paper which will not make those who buy into David Gelerter’s The Tides of the Mind thesis
- A YouTube video of a lecture given at Technion.
I found the gap between my research gathered in 2005 to 2007 interesting. I asked myself, “How did I end up with so many dead links about a technology I have described as one of the most important in database, data management, data analysis, and information retrieval?
Here are the answers I formulated:
- The Web is a lousy source of information. Stuff just disappears like the Darpa listing of open source Dark Web software, blogs, and Web sites
- I did really terrible research and even worse librarian type behavior. Yep, mea culpa.
- Some filtering procedures became a bit too aggressive and the information has been swept from assorted indexes
- The Wayback Machine ran off the rails and pointed to an actual 2005 Web site which its system failed to copy when the original spidering was completed.
- Gremlins. Hey, they really do exist. Just ask Grace Hopper. Yikes, she’s not available.
I wanted to mention this apparent or erroneous scrubbing. The story in this week HonkinNews video points out that 89 percent of journalists do their research via Google. Now if information is not in Google, what does that imply for a “real” journalist trying to do an objective, comprehensive story? I leave it up to you, gentle reader, to penetrate this curiosity.
Watch for the Google Trilogy seven minute videos on December 20, 2016, December 27, 2016, and
Stephen E Arnold, November 14, 2016, and January 3, 2017. Free. No pay wall. No Patreon.com pleading. No registration form. Just honkin’ news seven days a week and some video shot on an old Bell+Howell camera in a log cabin in rural Kentucky.
November 11, 2016
What’s next in search? My answer is, “No search at all. The system thinks for you.” Sounds like Utopia for the intellectual couch potato to me.
I read “The Latest in Search: New Services in the Content Discovery Marketplace.” The main point of the write up is to highlight three “discovery” services. A discovery service is one which offers “information users new avenues to the research literature.”
See, no search needed.
The three services highlighted are:
- Yewno, which is powered by an inference engine. (Does anyone remember the Inference search engine from days gone by?). The Yewno system uses “computational analysis and a concept map.” The problem is that it “supplements institutional discovery.” I don’t know what “institutional discovery” means, and my hunch is that folks living outside of rural Kentucky know what “institutional discovery” means. Sorry to be so ignorant.
- ScienceOpen, which delivers a service which “complements open Web discovery.” Okay. I assume that this means I run an old fashioned query and ScienceOpen helps me out.
- TrendMD, which “serves as a classic “onward journey tool” that aims to generate relevant recommendations serendipitously.”
I am okay with the notion of having tools to make it easier to locate information germane to a specific query. I am definitely happy with tools which can illustrate connections via concept maps, link analysis, and similar outputs. I understand that lawyers want to type in a phrase like “Panama deal” and get a set of documents related to this term so the mass of data can be chopped down by sending, recipient, time, etc.
But setting up discovery as a separate operation from keyword or entity based search seems a bit forced to me. The write up spins its lawn mower blades over the TrendMD service. That’s fine, but there are a number of ways to explore scientific, technical, and medical literature. Some are or were delightful like Grateful Med; others are less well known; for example, Mednar and Quertle.
Discovery means one thing to lawyers. It means another thing to me: A search add on.
Stephen E Arnold, November 11, 2016
November 9, 2016
I read “Peter Thiel Explains Why His Company’s Defense Contracts Could Lead to Less War.” I noted that the write up appeared in the Washington Post, a favorite of Jeff Bezos I believe. The write up referenced a refrain which I have heard before:
Washington “insiders” currently leading the government have “squandered” money, time and human lives on international conflicts.
What I highlighted as an interesting passage was this one:
a spokesman for Thiel explained that the technology allows the military to have a more targeted response to threats, which could render unnecessary the wide-scale conflicts that Thiel sharply criticized.
I also put a star by this statement from the write up:
“If we can pinpoint real security threats, we can defend ourselves without resorting to the crude tactic of invading other countries,” Thiel said in a statement sent to The Post.
The write up pointed out that Palantir booked about $350 million in business between 2007 and 2016 and added:
The total value of the contracts awarded to Palantir is actually higher. Many contracts are paid in a series of installments as work is completed or funds are allocated, meaning the total value of the contract may be reflected over several years. In May, for example, Palantir was awarded a contract worth $222.1 million from the Defense Department to provide software and technical support to the U.S. Special Operations Command. The initial amount paid was $5 million with the remainder to come in installments over four years.
I was surprised at the Washington Post’s write up. No ads for Alexa and no Beltway snarkiness. That too was interesting to me. And I don’t have a dog in the fight. For those with dogs in the fight, there may be some billability worries ahead. I wonder if the traffic jam at 355 and Quince Orchard will now abate when IBM folks do their daily commute.
Stephen E Arnold, November 9, 2016
November 9, 2016
Relationships among metadata, words, and other “information” are important. Google’s Dr. Alon Halevy, founder of Transformic which Google acquired in 2006, has been beavering away in this field for a number of years. His work on “dataspaces” is important for Google and germane to the “intelligence-oriented” systems which knit together disparate factoids about a person, event, or organization. I recall one of his presentations—specifically the PODs 2006 keynote–in which he reproduced a “colleague’s” diagram of a flow chart which made it easy to see who received the document, who edited the document and what changes were made, and to whom recipients of the document forward the document.
Here’s the diagram from Dr. Halevy’s lecture:
Principles of Dataspace Systems, Slide 4 by Dr. Alon Halevy at delivered on June 26, 2006 at PODs. Note that “PODs” is an annual ACM database-centric conference.
I found the Halevy discussion interesting.
November 7, 2016
There are differences among these three use cases for entity extraction:
- Operatives reviewing content for information about watched entities prior to an operation
- Identifying people, places, and things for a marketing analysis by a PowerPoint ranger
- Indexing Web content to add concepts to keyword indexing.
Regardless of your experience with software which identifies “proper nouns,” events, meaningful digits like license plate numbers, organizations, people, and locations (accepted and colloquial)—you will find the information in “Performance Comparison of 10 Linguistic APIs for Entity Recognition” thought provoking.
The write up identifies the systems which perform the best and the worst.
Here are the five systems and the number of errors each generated in a test corpus. The “scores” are based on a test which contained 150 targets. The “best” system got more correct than incorrect. I find the results interesting but not definitive.
The five best performing systems on the test corpus were:
- Intellexer API (best)
- Lexalytics (better
- AlchemyLanguage IBM (good)
- Indico (less good)
- Google Natural Language.
The five worst performing systems on the test corpus were:
- Microsoft Cognitive Services (dead last)
- Hewlett Packard Enterprise Haven (penultimate last)
- Text Razor (antipenultimate)
- Meaning Cloud
- Aylien (apparently misspelled in the source article).
There are some caveats to consider:
- Entity identification works quite well when the training set includes the entities and their synonyms as part of the training set
- Multi-language entity extraction requires additional training set preparation. “Learn as you go” is often problematic when dealing with social messages, certain intercepted content, and colloquialisms
- Identification of content used as a code—for example, Harrod’s teddy bear for contraband—is difficult even for smart software operating with subject matter experts’ input. (Bad guys are often not stupid and understand the concept of using one word to refer to another thing based on context or previous interactions).
Net net: Automated systems are essential. The error rates may be fine for some use cases and potentially dangerous for others.
Stephen E Arnold, November 7, 2016
October 21, 2016
Have you ever visited a Web site and then lost the address or could not find a particular section on it? You know that the page exists, but no matter how often you use an advanced search feature or scour through your browser history it cannot be found. If you use Google Chrome as your main browser than there is a solution, says GHacks in the article, “Falcon: Full-Text history Search For Chrome.”
Falcon is a Google Chrome extension that adds full-text history search to a browser. Chrome usually remembers Web sites and their extensions when you type them into the address bar. The Falcon extension augments the default behavior to match text found on previously visited Web Sites.
Falcon is a search option within a search feature:
The main advantage of Falcon over Chrome’s default way of returning results is that it may provide you with better results. If the title or URL of a page don’t contain the keyword you entered in the address bar, it won’t be displayed by Chrome as a suggestion even if the page is full of that keyword. With Falcon, that page may be returned as well in the suggestions.
The new Chrome extension acts as a delimiter to recorded Web history and improves a user’s search experience so they do not have to sift through results individually.