DuckDuckGo: One Expert Thinks It Is Better Than Google Search
November 8, 2016
I love the stories about Google’s lousy search system. The GOOG is trying to improve search with smart software and providing more third party sponsored links in search results. In my research, I have learned that most Google users focus on getting answers to their questions. The fact that these users are asking questions which are mostly repetitive means that the GOOG can optimize for what works to handle the majority of the queries. Instrumental efficiency for the user, for Google’s network resources, and for Google’s paying customers. Some experts don’t like the direction Google is moving, powered by its data analysis.
One example is spelled out in “How I Quit Using Google Search and Saved a Lot of Time.” I noted:
Now, DDG isn’t an exact replacement for Google, but they’re close. I almost always find what I’m looking for with them [I think the “them” refers to the Google Web search system], but it [I think this means searching] can be more work. The biggest feature I miss is that you can’t specify a search period, such as the last week or year, or a date range. But only a few times in the last year have I been forced to go to Google for a difficult search.
Okay, but Google does offer Google Advanced Search and some old fashioned search box command line instructions. These are not perfect. I agree that Google has some time deficiencies. That lack of “time” functionality may be one contributing reason behind Google’s investment in Recorded Future, an analytics platform designed to perform a range of time centric functions; for example, spider the Dark Web and array events on a timeline with additional analytic reports available with a mouse click.
The write up does not share these “advanced” factoids. I highlighted this statement:
Given the advantages of a Google-free existence, I have to wonder what Google is costing the world economy. If interesting ads cause each Internet user to spend an extra five minutes a day on non-productive shopping, with almost 3 billion Internet users today, that’s 15 billion minutes or over 28,000 person years of productivity every day.
Yes, an example of what I call mid tier consultant reasoning. Make assumptions about “time”. Assign a value. Calculate the “cost.” Works every time; for example, the now cherished IDC estimate of how much time a worker spends looking for information. The idea is that a better search system unleashes value, productivity, and other types of wonderfulness. The problem is that this type of logic is often misleading because the assumptions are specious and the analysis something even a sixth grade baseball statistics fan would doubt. How about them Cubbies?
But the point of the write up is that DuckDuckGo does not traffic in human user data. There are ads, but these ads are different from Google ads. Okay. Fine.
The write up reveals three things about experts doing online search analysis:
- Ads, regardless of who shows them, pump data back to the source of the ad. The outfit may choose to ignore what works and what doesn’t at its peril. Failed ads do not generate revenue for the advertiser. Hence the advertiser will go elsewhere.
- Running queries which return on point information is time consuming and often difficult. The reasons range from the mysterious removal of information from indexes to the vagaries of human language. Do you know the exact term to use to locate malware which can be used to compromise an iPhone and the name of the vendor who sells this type of program. Give that a whirl on a single free Web search system.
- The merging of imprecise information about the “cost” of a search is not convincing. Perhaps the expert should consider the impact of the shift from desktop search to mobile device search. That change will force most free Web search systems to turn some cartwheels and find ways to generate revenue. Fancy functionality is simply not used by 97 percent of online search users. Good enough search is the new normal. Thus, search today is not what search yesterday was perceived to be.
Who cares about alternative free Web search systems? The truth is that those who long for the good old days of Google may have to wake up and check out the new dawn. Misinformation, disinformation, filtered information are the norm. No advanced search system on the planet can provide pointers to high value content, accurate content on a consistent basis.
Stephen E Arnold, November 8, 2016
How Real Journalists Do Research
November 8, 2016
I read “Search & Owned Media Most Used by Journalists.” The highlight of the write up was a table created by Businesswire. The “Media Survey” revealed “Where the Media Look When Researching an Organization.” Businesswire is a news release outfit. Organizations pay to have a write up sent to “real” journalists.
Let’s look at the data in the write up.
The top five ways “real” journalists obtain information is summarized in the table below. I don’t know the sample size, the methodology, or the method of selecting the sample. My hunch is that the people responding have signed up for Businesswire information or have some other connection with the company.
Most Used Method | Percent Using |
89% | |
Organization Web site | 88% |
Organization’s online newsroom | 75% |
Social media postings | 54% |
Government records | 53% |
Now what about the five least used methods for research:
Least Used Method | Percent Using |
Organization PR spokesperson | 39% |
News release boilerplate | 33% |
Bing | 8% |
Yahoo | 7% |
Other (sorry but no details) | 6% |
Now what about the research methods in between these two extremes of most and least used:
No Man’s Land Methods | Percent Using |
Talk to humans | 51% |
Trade publication Web sites | 44% |
Local newspapers | 43% |
Wikipedia | 40% |
Organization’s blog | 39% |
Several observations flapped across the minds of the goslings in Harrod’s Creek.
- Yahoo and Bing may want to reach out to “real” journalists and explain how darned good their search systems are for “real” information. If the data are accurate, Google is THE source for “real” journalists’ core or baseline information
- The popularity of social media and government information is a dead heat. I am not sure whether this means social media information is wonderful or if government information is not up to the standards of social media like Facebook or Twitter
- Talking to humans, which I assume was the go to method for information, is useful to half the “real” journalists. This suggests that half of the “real” news churned out by “real” journalists may be second hand, recycled and transformed, or tough to verify. The notion of “good enough” enters at this point
- Love that Wikipedia because 40 percent of “real” journalists rely on it for some or maybe a significant portion of the information in a “real” news story.
It comes as no surprise that news releases creep into the results list via Google’s indexing of “real” news, the organization’s online newsroom, the organization’s tweets and Facebook posts, trade publications which are first class recyclers of news releases, and the organization’s blog.
Interesting. Echo chamber, filter bubble, disinformation—Do any of these terms resonate with you?
Stephen E Arnold, November 8, 2016
Iceland Offers the First Human Search Engine
November 8, 2016
Iceland is a northern country that one does not think about much. It is cold, has a high literacy rate, and did we mention it was cold? Despite its frigid temperatures, Iceland is a beautiful country with a rich culture and friendly people. shares just how friendly the Icelanders are with their new endeavor: “Iceland Launches ‘Ask Guðmundur,’ The World’s First Human Search Engine.”
Here is what the country is doing:
The decidedly Icelandic and truly personable service will see special representatives from each of Iceland’s seven regions offer their insider knowledge to the world via Inspired By Iceland’s social media platforms (Twitter, Facebook and YouTube). Each representative shares the name Guðmundur or Guðmunda, currently one of the most popular forenames in the country with over 4,000 men and women claiming it as their own.
Visitors to the site can literally submit their questions and have them answered by an expert. Each of the seven Guðmundurs is an Icelandic regional expert. Iceland’s goal with the human search engine is to answer’s the world’s questions about the country, but to answer them in the most human way possible: with actual humans.
A human search engine is an awesome marketing campaign for Iceland. One of the best ways to encourage tourism is to introduce foreigners to the locale people and customs, the more welcoming, quirky, and interesting is all the better for Iceland. So go ahead, ask Guðmundur.
Whitney Grace, November 8, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
Good Old Sleuthing Can Still Beat Dark Web
November 8, 2016
Undercover investigative work of different agencies in Bergen County, New York resulted in arrest of an 18-year old man who was offering hitman services over the Dark Net.
As reported by Patch.com in news report titled Hitman Who Drove To Mahwah For Meeting Arrested: Prosecutor :
The Mahwah Police Department, Homeland Security Investigations, and the Bergen County Prosecutor’s Office Cyber Crimes Unit investigated Rowling, a Richmondville, New York resident. Rowling allegedly used the dark web to offer his services as a hitman.
Tracking Dark Web participants are extremely difficult, thus undercover agents posing as buyers were scouting hitmen in New York. Rowling without suspecting anything offered his services in return for some cash and a gun. The meeting was fixed at Mason Jar in Mahwah where he was subsequently arrested and remanded to Bergen County Jail.
As per the report, Rowling is being charged with:
In addition to conspiracy to murder, Rowling was charged with possession of a weapon for an unlawful purpose, unlawful possession of a weapon, and possession of silencer, Grewal said.
Drug traffickers, hackers, smugglers of contraband goods and narcotics are increasingly using the Dark Web for selling their goods and services. Authorities under such circumstances have no option but to use old techniques of investigation and put the criminals behind bars. However, most of the Dark Net and its participants are still out of reach of law enforcement agencies.
Vishal Ingole, November 8, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
Demand for Palantir Shares Has Allegedly Gone Poof
November 7, 2016
I read “Ex-Palantir Employees Are Struggling To Sell Their Shares.” Let’s assume that the information in the write up is spot on. The main idea is that one of the most visible of Silicon Valley’s secretive companies has created a problem for some of its former employees. I learned:
Demand has evaporated” for the shares that make up the bulk of Palantir’s pay packages, and the company’s CEO seems aware of financial angst among his staff.
The softening of the market for stock options suggests that the company’s hassles with investors and the legal dust up with the US government are having an effect. Couple the buzz with the prices in Silicon Valley, and it is easy to understand why some people want to covert options for cash money. I highlighted this passage:
Some said they needed the cash to buy a house or pay down debt, while another said they took out a loan to fund the process of turning the options into shares. One said it was “infuriating” trying to sell their shares in a “crap” market.
I found this statement from a broker, who was not named, suggestive:
This person then quoted an unidentified broker as saying, “There is absolutely nothing moving in Palantir. People who have bought through us are trying to sell now. I don’t see it changing without the company changing their tone on an IPO.”
With the apparent decision relating to the US Army and it procurement position with regards to Palantir going the way of the Hobbits, perhaps the negativism will go away.
One thought: Buzzfeed continues to peck away at Palantir Technologies. Palantir Technologies has a relationship with Peter Thiel. The intersection of online publications and Peter Thiel has been interesting. Worth watching.
Stephen E Arnold, November 7, 2016
Entity Extraction: No Slam Dunk
November 7, 2016
There are differences among these three use cases for entity extraction:
- Operatives reviewing content for information about watched entities prior to an operation
- Identifying people, places, and things for a marketing analysis by a PowerPoint ranger
- Indexing Web content to add concepts to keyword indexing.
Regardless of your experience with software which identifies “proper nouns,” events, meaningful digits like license plate numbers, organizations, people, and locations (accepted and colloquial)—you will find the information in “Performance Comparison of 10 Linguistic APIs for Entity Recognition” thought provoking.
The write up identifies the systems which perform the best and the worst.
Here are the five systems and the number of errors each generated in a test corpus. The “scores” are based on a test which contained 150 targets. The “best” system got more correct than incorrect. I find the results interesting but not definitive.
The five best performing systems on the test corpus were:
- Intellexer API (best)
- Lexalytics (better
- AlchemyLanguage IBM (good)
- Indico (less good)
- Google Natural Language.
The five worst performing systems on the test corpus were:
- Microsoft Cognitive Services (dead last)
- Hewlett Packard Enterprise Haven (penultimate last)
- Text Razor (antipenultimate)
- Meaning Cloud
- Aylien (apparently misspelled in the source article).
There are some caveats to consider:
- Entity identification works quite well when the training set includes the entities and their synonyms as part of the training set
- Multi-language entity extraction requires additional training set preparation. “Learn as you go” is often problematic when dealing with social messages, certain intercepted content, and colloquialisms
- Identification of content used as a code—for example, Harrod’s teddy bear for contraband—is difficult even for smart software operating with subject matter experts’ input. (Bad guys are often not stupid and understand the concept of using one word to refer to another thing based on context or previous interactions).
Net net: Automated systems are essential. The error rates may be fine for some use cases and potentially dangerous for others.
Stephen E Arnold, November 7, 2016
Lucidworks Hires Watson
November 7, 2016
One of our favorite companies to track is Lucidworks, due to their commitment to open source technology and development in business enterprise systems. The San Diego Times shares that “Lucidworks Integrates IBM Watson To Fusion Enterprise Discovery Platform.” This means that Lucidworks has integrated IBM’s supercomputer into their Fusion platform to help developers create discovery applications to capture data and discover insights. In short, they have added a powerful big data algorithm.
While Lucidworks is built on open source software, adding a proprietary supercomputer will only benefit their clients. Watson has proven itself an invaluable big data tool and paired with the Fusion platform will do wonders for enterprise systems. Data is a key component to every industry, but understanding and implementing it is difficult:
Lucidworks’ Fusion is an application framework for creating powerful enterprise discovery apps that help organizations access all their information to make better, data-driven decisions. Fusion can process massive amounts of structured and multi-structured data in context, including voice, text, numerical, and spatial data. By integrating Watson’s ability to read 800 million pages per second, Fusion can deliver insights within seconds. Developers benefit from this platform by cutting down the work and time it takes to create enterprise discovery apps from months to weeks.
With the Watson upgrade to Lucidworks’ Fusion platform, users gain natural language processing and machine learning. It makes the Fusion platform act more like a Star Trek computer that can provide data analysis and even interpret results.
Whitney Grace, November 7, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
Hackers Having Field Day with Mirai Botnet
November 7, 2016
The massive cyber-attack that crippled major website across the US on October 21 was executed using an extensive network of infected computers and smart devices. The same botnet is now on sale on Dark Web which will enable hackers to launch similar or even massive attacks in the future.
As reported by Cyberscoop in article titled You can now buy a Mirai-powered botnet on the dark web:
A botnet of this size could be used to launch DDoS attacks in addition to automated spam and ransomware campaigns. The price tag was $7,500, payable in bitcoin. The anonymous vendor claimed it could generate a massive 1 terabit per second worth of internet traffic.
The particular botnet used in the Dyn attack are all infected with Mirai malware. Though the source code of the malware is freely available across hacker forums, a vendor over Dark Net is offering ready to use Mirai-Powered botnet for $7,500. This enables any hacker to launch DDoS attack of any scale on any network across the globe.
As the article points out:
With the rise of Mirai, experts say the underground DDoS market is shifting as vendors now have the ability to supercharge all of their offerings; giving them an avenue to potentially find new profits and to sell more destructive DDoS cannons.
Though the botnet at present is for sale, soon the prices may drop or even become free enabling a teenager sitting at home to bring down any major network down with few clicks. Things already have been set in motion, it only needs to be seen, when and where the next attack occurs.
Vishal Ingole, November 7, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
DataSift and Its Getting the Most from Facebook Series
November 6, 2016
There’s been some chatter about Facebook’s approach to news. For some researchers, Facebook is a high value source of information and intelligence. If you want to get a sense of what one can do with Facebook, you may find the DataSift series “Getting the Most from Facebook” helpful.
At this time there are six blog posts on this topic, you can locate the articles via the links below. Each write up contains a DataSift commercial:
- Types of social networks
- What data analytics can be used on Facebook data
- Facebook topic data
- Topic data use cases and drawbacks
- Why use filters
- Pylon specific tips but these apply to other analytics systems as well.
The write ups illustrate why law enforcement and intelligence professionals find some Facebook information helpful. Markets are probably aware of the utility of Facebook information, but to get optimum results, discipline must be applied to the content Facebookers generate at a remarkable rate.
Stephen E Arnold, November 6, 2016
Model Based Search: Latent Dirichlet Allocation
November 5, 2016
I worked through a presentation by Thomas Levi, a wizard at Unbounce, a landing page company. . You can download the presentation at this link but you will need to log in in order to access the information. There’s also a video and an MP3 available. The idea is that concepts plus tailored procedures in models provides high value outputs. I noted this passage:
utilizing concepts in topic modeling can be used to build a highly effective model to categorize and find similar pages.
I noted the acronym LDA or Latent Dirichlet Allocation because that struck me as the core of the method. For those familiar with the original Autonomy Digital Reasoning Engine, there will be some similar chords. Unbounce’s approach provides another example of the influence and value of the methods pioneered by Autonomy in the mid 1990s.
Stephen E Arnold, November 5. 2016