CyberOSINT banner

Baidu May Mislead via Search Results

May 10, 2016

Shocker. If the information in “Baidu Found Guilty, Hit with New Restrictions. Will It Go Far Enough?”, the Chinese information access outfit has fiddled its search results. Oh, my. How can search and retrieval companies ignore objectivity in pursuit of other, presumably more lofty, goals?

I learned:

According to state news agency Xinhua, the CAC ruled that a Baidu search result page “did influence the medical choice” of Wei Zexi, a 21-year-old college student who died in April from an ineffective cancer therapy he discovered via a Baidu-promoted link. The company pledged to limit the number of ads to no more than 30% of each search result page in response to the ruling.

I know that this monopoly approach is much loved by MBAs and some financial mavens. However, fiddling search results is an idea which never crossed this addled goose’s mind.

I believed and still do believe that when I run a query on a “free” Web search engine, I am getting rock solid, “take it to the bank” information.

Baidu, I assume, is simply a nail which sticks up and must be pounded down into old fashioned precision and recall.

Stephen E Arnold, May 10, 2016

Update from Lucene

May 10, 2016

It has been awhile since we heard about our old friend Apache Lucene, but the open source search engine has something new, says Open Source Connections in the article, “BM25 The Next Generation Of Lucene Relevance.”  Lucene is added BM25 to its search software and it just might improve search results.

“BM25 improves upon TF*IDF. BM25 stands for “Best Match 25”. Released in 1994, it’s the 25th iteration of tweaking the relevance computation. BM25 has its roots in probabilistic information retrieval. Probabilistic information retrieval is a fascinating field unto itself. Basically, it casts relevance as a probability problem. A relevance score, according to probabilistic information retrieval, ought to reflect the probability a user will consider the result relevant.”

Apache Lucene formerly relied on TF*IDF, a way to rank how users value a text match relevance.  It relied on two factors: term frequency-how often a term appeared in a document and inverse document frequency aka idf-how many documents the term appears and determines how “special” it is.  BM25 improves on the old TF*IDF, because it gives negative scores for terms that have high document frequency.  IDF in BM25 solves this problem by adding a 1 value, therefore making it impossible to deliver a negative value.

BM25 will have a big impact on Solr and Elasticsearch, not only improving search results and accuracy with term frequency saturation.


Whitney Grace, May 10, 2016
Sponsored by, publisher of the CyberOSINT monograph

Wikipedia Relies on Crowdsourcing Once More

May 9, 2016

As a non-profit organization, the Wikimedia Foundation relies on charitable donations to fund many of its projects, including Wikipedia.  It is why every few months, when you are browsing the Wiki pages you will see a donation bar pop to send them money.  Wikimedia uses the funds to keep the online encyclopedia running, but also to start new projects.   Engadget reports that Wikipedia is interested in taking natural language processing and applying it to the Wikipedia search engine, “Wikipedia Is Developing A Crowdsourced Speech Engine.”

Working with Sweden’s KTH Royal Institute of Technology, Wikimedia researchers are building a speech engine to enable people with reading or visual impairments to access the plethora of information housed in the encyclopedia.  In order to fund the speech engine, the researchers turned to crowdsourcing.  It is estimated that twenty-five percent, 125 million monthly users, will benefit from the speech engine.

” ‘Initially, our focus will be on the Swedish language, where we will make use of our own language resources,’ KTH speech technology professor Joakim Gustafson, said in a statement. ‘Then we will do a basic English voice, which we expect to be quite good, given the large amount of open source linguistic resources. And finally, we will do a rudimentary Arabic voice that will be more a proof of concept.’”

Wikimedia wants to have a speech engine in Arabic, English, and Swedish by the end of 2016, then they will focus on the other 280 languages they support with their projects.  Usually, you have to pay to have an accurate and decent natural language processing machine, but if Wikimedia develops a decent speech engine it might not be much longer before speech commands are more commonplace.


Whitney Grace, May 9, 2016
Sponsored by, publisher of the CyberOSINT monograph

How Hackers Hire

May 7, 2016

Ever wonder how hackers fill job openings, search-related or otherwise? A discussion at the forum tehPARADOX.COM considers, “How Hackers Recruit New Talent.” Poster MorningLightMountain cites a recent study by cybersecurity firm Digital Shadows, which reportedly examined around 100 million websites, both on the surface web and on the dark web, for recruiting practices. We learn:

“The researchers found that the process hackers use to recruit new hires mirrors the one most job-seekers are used to. (The interview, for example, isn’t gone—it just might involve some anonymizing technology.) Just like in any other industry, hackers looking for fresh talent start by exploring their network, says Rick Holland, the vice president of strategy at Digital Shadows. ‘Reputation is really, really key,’ Holland says, so a candidate who comes highly recommended from a trusted peer is off to a great start. When hiring criminals, reputation isn’t just about who gets the job done best: There’s an omnipresent danger that the particularly eager candidate on the other end of the line is actually an undercover FBI agent. A few well-placed references can help allay those fears.”

Recruiters, we’re told, frequently advertise on hacker forums. These groups reach many potential recruits and are often password-protected. However, it is pretty easy to trace anyone who logs into one without bothering to anonymize their traffic. Another option is to advertise on the dark web— researchers say they even found a “sort of for cybercrime” there.

The post goes on to discuss job requirements, interviews, and probationary periods. We’re reminded that, no matter how many advanced cybersecurity tools get pushed to market, most attack are pretty basic; they involve approaches like denial-of-service and SQL injection. So, MorningLightMountain advises, any job-seeking hackers should be good to go if they just keep up those skills.


Cynthia Murrell, May 7, 2016

Sponsored by, publisher of the CyberOSINT monograph

A Not-For-Profit Search Engine? That’s So Crazy It Just Might Work

May 4, 2016

The Common Search Project has a simple and straightforward mission statement. They want a nonprofit search engine, an alternative to the companies currently running the Internet (ahem, Google.) They are extremely polite in their venture, but also firmly invested in three qualities for the search engine that they intend to build and run: openness, transparency, and independence. The core values include,

“Radical transparency. Our search results must be explainable and reproducible. All our code is open source and results are generated only using publicly available data. Transparency also extends to our governance, finances and day-to-day operations. Independence. No single person, company or special interest must be able to influence the order of our search results to their benefit. … Public service. We want to build and operate a free service targeted at a large, mainstream audience.”

Common Search currently offers a Demo version for searching homepages only. They are an exciting development compared to the other David’s who have swung at Google’s Goliath. Common Search makes DuckDuckGo, the search engine focused on ensuring user privacy, look downright half-assed. They are calling for, and creating, a real alternative with a completely fresh perspective that isn’t solely about meeting user needs, but insisting on user standards related to privacy, control, and clarity of results.


Chelsea Kerwin, May 4, 2016

Sponsored by, publisher of the CyberOSINT monograph


Do Businesses Have a Collective Intelligence?

May 4, 2016

After working in corporate America for several years, I was amazed by the sheer audacity of its stupidity.  I came to the conclusion that many people in corporate America lack intelligence and are slowly skirting insanity’s edge, so when I read Xconomy’s article, “Brainspace Aims To Harness ‘Collective Intelligence’ Of Businesses” made me giggle.   I digress.  Intelligence really does run rampant in businesses, especially in IT departments the keep modern companies up and running. The digital workspace has created a collective intelligence within a company’s enterprise system and the information is either accessed directly from the file hierarchy or through (the usually quicker) search box.

Keywords within the correct context pertaining to a company are extremely important to semantic search, which is why Brainspace invented a search software that creates a search ontology for individual companies.  Brainspace says that all companies create collective intelligence within their systems and their software takes the digitized “brain” and produces a navigable map that organizes the key items into clusters.

“As the collection of digital data on how we work and live continues to grow, software companies like Brainspace are working on making the data more useful through analytics, artificial intelligence, and machine-learning techniques. For example, in 2014 Google acquired London-based Deep Mind Technologies, while Facebook runs a program called FAIR—Facebook AI Research. IBM Watson’s cognitive computing program has a significant presence in Austin, TX, where a small artificial intelligence cluster is growing.”

Building a search ontology by incorporating artificial intelligence into semantic search is a fantastic idea.  Big data relies on deciphering information housed in the “collective intelligence,” but it can lack human reasoning to understanding context.  An intelligent semantic search engine could do wonders that Google has not even built a startup for yet.


Whitney Grace, May 4, 2016
Sponsored by, publisher of the CyberOSINT monograph

Google Relies on Freebase Machine ID Numbers to Label Images in Knowledge Graph

May 3, 2016

The article on Seo by the Sea titled Image Search and Trends in Google Search Using FreeBase Entity Numbers explains the transformation occurring at Google around Freebase Machine ID numbers. Image searching is a complicated business when it comes to differentiating labels. Instead of text strings, Google’s Knowledge Graph is based in Freebase entities, which are able to uniquely evaluate images- without language. The article explains with a quote from Chuck Rosenberg,

An entity is a way to uniquely identify something in a language-independent way. In English when we encounter the word “jaguar”, it is hard to determine if it represents the animal or the car manufacturer. Entities assign a unique ID to each, removing that ambiguity, in this case “/m/0449p” for the former and “/m/012×34” for the latter.”

Metadata is wonderful stuff, isn’t it? The article concludes by crediting Barbara Starr, a co-administrator of the Lotico San Diego Semantic Web Meetup, with noticing that the Machine ID numbers assigned to Freebase entities now appear in Google Trend’s URLs. Google Trends is a public web facility that enables an exploration of the hive mind by showing what people are currently searching. The Wednesday that President Obama nominated a new Supreme Court Justice, for example, had the top search as Merrick Garland.


Chelsea Kerwin, May 3, 2016

Sponsored by, publisher of the CyberOSINT monograph

An Open Source Search Engine to Experiment With

May 1, 2016

Apache Lucene receives the most headlines when it comes to discussion about open source search software.  My RSS feed pulled up another open source search engine that shows promise in being a decent piece of software.  Open Semantic Search is free software that cane be uses for text mining, analytics, a search engine, data explorer, and other research tools.  It is based on Elasticsearch/Apache Solrs’ open source enterprise search.  It was designed with open standards and with a robust semantic search.

As with any open source search, it can be programmed with numerous features based on the user’s preference.  These include, tagging, annotation, varying file format support, multiple data sources support, data visualization, newsfeeds, automatic text recognition, faceted search, interactive filters, and more.  It has the benefit that it can be programmed for mobile platforms, metadata management, and file system monitoring.

Open Semantic Search is described as

“Research tools for easier searching, analytics, data enrichment & text mining of heterogeneous and large document sets with free software on your own computer or server.”

While its base code is derived from Apache Lucene, it takes the original product and builds something better.  Proprietary software is an expense dubbed a necessary evil if you work in a large company.  If, however, you are a programmer and have the time to develop your own search engine and analytics software, do it.  It could be even turn out better than the proprietary stuff.


Whitney Grace, May 1, 2016
Sponsored by, publisher of the CyberOSINT monograph

Search without Indexing

April 27, 2016

I read “Outsmarting Google Search: Making Fuzzy Search Fast and Easy Without Indexing.”

Here’s a passage I highlighted:

It’s clear the “Google way” of indexing data to enable fuzzy search isn’t always the best way. It’s also clear that limiting the fuzzy search to an edit distance of two won’t give you the answers you need or the most comprehensive view of your data. To get real-time fuzzy searches that return all relevant results you must use a data analytics platform that is not constrained by the underlying sequential processing architectures that make up software parallelism. The key is hardware parallelism, not software parallelism, made possible by the hybrid FPGA/x86 compute engine at the heart of the Ryft ONE.

I also circled:

By combining massively parallel FPGA processing with an x86-powered Linux front-end, 48 TB of storage, a library of algorithmic components and open APIs in a small 1U device, Ryft has created the first easy-to-use appliance to accelerate fuzzy search to match exact search speeds without indexing.

An outfit called InsideBigData published “Ryft Makes Real-time Fuzzy Search a Reality.” Alas, that link is now dead.

Perhaps a real time fuzzy search will reveal the quickly deleted content?

Sounds promising. How does one retrieve information within videos, audio streams, and images? How does one hook together or link a reference to an entity (discovered without controlled term lists) with a phone number?

My hunch is that the methods disclosed in the article have promise, the future of search seems to be lurching toward applications that solve real world, real time problems. Ryft may be heading in that direction in a search climate which presents formidable headwinds.

Stephen E Arnold, April 27, 2016

Duck Duck Go as a Privacy Conscious Google Alternative

April 26, 2016

Those frustrated with Google may have an alternative. Going over to the duck side: A week with Duck Duck Go from Search Engine Watch shares a thorough first-hand account of using Duck Duck Go for a week. User privacy protection seems to be the hallmark of the search service and there is even an option to enable Tor in its mobile app. Features are comparable, such as one designed to compete with Google’s Knowledge Graph called Instant Answers. As an open source product, Instant Answers is built up by community contributions. As far as seamless, intuitive search, the post concludes,

“The question is, am I indignant enough about Google’s knowledge of my browsing habits (and everyone else’s that feed its all-knowing algorithms) to trade the convenience of instantly finding what I’m after for that extra measure of privacy online? My assessment of DuckDuckGo after spending a week in the pond is that it’s a search engine for the long term. To get the most out of using it, you have to make a conscious change in your online habits, rather than just expecting to switch one search engine for another and get the same results.”

Will a majority of users replace “Googling” with “Ducking” anytime soon? Time will tell, and it will be an interesting saga to see unfold. I suppose we could track the evolution on Knowledge Graph and Instant Answers to see the competing narratives unfold.


Megan Feil, April 26, 2016

Sponsored by, publisher of the CyberOSINT monograph


« Previous PageNext Page »