CyberOSINT banner

Enterprise Search: The Valiant Fight On

May 17, 2016

I read “VirtualWorks and Language Tools Announce Merger.” I ran across Language Tools several years ago. The company was working to create components for ElasticSearch’s burgeoning user base. The firm espoused natural language processing as a core technology. NLP is useful, but it imposes some computational burdens on some content processing functions. ElasticSearch works pretty well, and there are a number of companies optimizing, integrating, and creating widgets to make life with ElasticSearch better, faster, and presumably more impressive than the open source system is.

This news release highlights the fact that VirtualWorks and Language Tools have merged. The financial details are not explicit, and it appears that a company founded by a wizard from Citrix will make Language Tools’ R&D hub for the Florida-based VirtualWorks’ operation.

According to the story:

The combined organization brings together best of breed core technologies in the areas of enterprise search, data management, text analytics, discovery techniques and analytics to enable the development of new and exciting next generation applications in the business intelligence space.

VirtualWorks is or was a SharePoint centric solution. Like other search vendors, the company uses connectors to suck data into a central indexing point. Users then search the content and have access to the content without having to query separate systems.

This idea has fueled enterprise search since the days of Verity, Autonomy, Fast Search, Convera, et al. The real money today seems to be in the consulting and engineering services required to make enterprise search useful.

SharePoint is certainly widely used, and it is fraught with interesting challenges. Will the lash up of these two firms generate the type of revenue once associated with Autonomy and Fast Search & Transfer?

My hunch is that enterprise search continues to be a tough market. There are functional solutions to locating information available as open source or at comparatively modest license fees. I am thinking of dtSearch and Maxxcat. Both of these work well within Microsoft centric environments.

Stephen E Arnold, May 17, 2016

Google Moonshot Targets Disease Management, but Might Face Obstacle with Google Management Methods

May 17, 2016

The article on STAT titled Google’s Bold Bid to Transform Medicine Hits Turbulence Under a Divisive CEO explores Google management methods for one of its “moonshot” projects. Namely, the massive company has directed its considerable resources toward overhauling medicine. Verily Life Sciences is the three year-old startup with a mysterious mission and a controversial leader in Andrew Conrad. So far, roughly a dozen Verily players have abandoned the project.

“But “if they are getting off the roller coaster before it gets to the first dip,” something looks seriously wrong, said Rob Enderle, a technology analyst who has tracked Google since its inception. Those who depart well-financed startups usually forsake potential financial windfalls down the line, which further suggests that the people leaving Verily “are losing confidence in the leadership,” he said. No similar brain drain has occurred at Calico, another ambitious Google spinoff, which is focused on increasing the human lifespan.”

Given the scope of the Verily project, which Sergey Brin, Google co-founder, announced that he hoped would significantly change the way we identify, avoid, and handle illness, perhaps Conrad is cracking under the stress. He has maintained complete radio silence and rumors abound that his employees operate under threat of termination for speaking to a reporter.

Chelsea Kerwin, May 17, 2016

Sponsored by, publisher of the CyberOSINT monograph

Extensive Cultural Resources Available at Europeana Collections

May 17, 2016

Check out this valuable cultural archive, highlighted by Open Culture in the piece, “Discover Europeana Collections, a Portal of 48 Million Free Artworks, Books, Videos, Artifacts & Sounds from across Europe.” Writer Josh Jones is clearly excited about the Internet’s ability to place information and artifacts at our fingertips, and he cites the Europeana Collections as the most extensive archive he’s discovered yet. He tells us the works are:

“… sourced from well over 100 institutions such as The European Library, Europhoto, the National Library of Finland, University College Dublin, Museo Galileo, and many, many more, including contributions from the public at large. Where does one begin?

“In such an enormous warehouse of cultural history, one could begin anywhere and in an instant come across something of interest, such as the the stunning collection of Art Nouveau posters like that fine example at the top, ‘Cercle Artstique de Schaerbeek,’ by Henri Privat-Livemont (from the Plandiura Collection, courtesy of Museu Nacional d’Art de Catalynya, Barcelona). One might enter any one of the available interactive lessons and courses on the history of World War I or visit some of the many exhibits on the period, with letters, diaries, photographs, films, official documents, and war propaganda. One might stop by the virtual exhibit, ‘Photography on a Silver Plate,’ a fascinating history of the medium from 1839-1860, or ‘Recording and Playing Machines,’ a history of exactly what it sounds like, or a gallery of the work of Swiss painter Jean Antoine Linck. All of the artifacts have source and licensing information clearly indicated.”

Jones mentions the archive might be considered “endless,” since content is being added faster than anyone could hope to keep up with.  While such a wealth of information and images could easily overwhelm a visitor, he advises us to look at it as an opportunity for discovery. We concur.


Cynthia Murrell, May 17, 2016

Sponsored by, publisher of the CyberOSINT monograph

Excite and Ask: Where Are They Now?

May 14, 2016

I learned a factoid from “Yahoo Stock: Analyzing 5 Key Suppliers.” Here’s the passage with the items I noted in bold face:

Excite Japan Co., Ltd. was established in 1997 as a joint venture with Excite, Inc., which is wholly owned by IAC/InterActiveCorp. At the time, Excite, Inc., which is known in 2016 as, was among the largest and most popular Web portals offering personalized home pages for searching content. In 2015, Excite Japan generated 9.91% of its revenues from Yahoo through a revenue-sharing agreement for ad-clicks going through Yahoo’s search engine. In 2015, the company had revenue of $66.47 million in U.S. dollars and a market capitalization of $3.77 billion.

Interesting about Excite. About Yahoo? Not so much.

Stephen E Arnold, May 14, 2016

Searching the Panama Papers

May 11, 2016

Curious about the money laundering information improperly obtained from a law firm in Panama? You can search for the names of people whom you know by navigating to this link:

I ran a number of queries. The system works okay but considerable effort is required to wrangle on point results.

Sad to say none of the people and outfits I queried seemed to be high fliers. To make sense out of the data, one would need the corpus, some normalization, and an industrial strength tool or two.

Stephen E Arnold, May 11, 2016

The Office of Personnel Management Hack Is Very Bad

May 11, 2016

The US Office of Personnel Management (OPM) was hacked for more than a year before it was discovered in April 2015.  The personal information of 21 million current and former government employees was stolen, including their Social Security numbers and home addresses.  The hack does not seem that important, unless you were or are a government employee, but the Lawfare Blog explains differently in “Why The OPM Hack Is Far Worse Than You Imagine.”

The security breach is much worse than simple identity theft, because background checks were stolen as well.  It might seem that a background check is not that serious (so the hackers discovered a person got a speeding ticket?), but in reality these background checks were far more extensive than the usual as they were used for purposes of entering government mandated areas.  The security clearances included information about family, sexual behavior, and risk of foreign exploitation.  If that was not bad enough,

“Along with the aforementioned databases, the OPM systems are linked electronically to other agencies and databases, and it stored much of this data alongside the security clearance files. According to a 2007 White House report on OPM security clearance performance, checks of State Passport records and searches of military service records are now conducted electronically. According to this report, then, there are electronic linkages between the OPM Security Clearance files, Department of Defense service records, and State Department Passport records.”

OPM took measures to ensure future security, but they either expose whom the victims of the breach are and would allow private contractors access to sensitive data to mitigate future attacks.  OPM is not willing to acknowledge these deficiencies, but would rather continue to expose the victims (and future victims) to further danger.


Whitney Grace, May 11, 2016
Sponsored by, publisher of the CyberOSINT monograph

Baidu May Mislead via Search Results

May 10, 2016

Shocker. If the information in “Baidu Found Guilty, Hit with New Restrictions. Will It Go Far Enough?”, the Chinese information access outfit has fiddled its search results. Oh, my. How can search and retrieval companies ignore objectivity in pursuit of other, presumably more lofty, goals?

I learned:

According to state news agency Xinhua, the CAC ruled that a Baidu search result page “did influence the medical choice” of Wei Zexi, a 21-year-old college student who died in April from an ineffective cancer therapy he discovered via a Baidu-promoted link. The company pledged to limit the number of ads to no more than 30% of each search result page in response to the ruling.

I know that this monopoly approach is much loved by MBAs and some financial mavens. However, fiddling search results is an idea which never crossed this addled goose’s mind.

I believed and still do believe that when I run a query on a “free” Web search engine, I am getting rock solid, “take it to the bank” information.

Baidu, I assume, is simply a nail which sticks up and must be pounded down into old fashioned precision and recall.

Stephen E Arnold, May 10, 2016

Update from Lucene

May 10, 2016

It has been awhile since we heard about our old friend Apache Lucene, but the open source search engine has something new, says Open Source Connections in the article, “BM25 The Next Generation Of Lucene Relevance.”  Lucene is added BM25 to its search software and it just might improve search results.

“BM25 improves upon TF*IDF. BM25 stands for “Best Match 25”. Released in 1994, it’s the 25th iteration of tweaking the relevance computation. BM25 has its roots in probabilistic information retrieval. Probabilistic information retrieval is a fascinating field unto itself. Basically, it casts relevance as a probability problem. A relevance score, according to probabilistic information retrieval, ought to reflect the probability a user will consider the result relevant.”

Apache Lucene formerly relied on TF*IDF, a way to rank how users value a text match relevance.  It relied on two factors: term frequency-how often a term appeared in a document and inverse document frequency aka idf-how many documents the term appears and determines how “special” it is.  BM25 improves on the old TF*IDF, because it gives negative scores for terms that have high document frequency.  IDF in BM25 solves this problem by adding a 1 value, therefore making it impossible to deliver a negative value.

BM25 will have a big impact on Solr and Elasticsearch, not only improving search results and accuracy with term frequency saturation.


Whitney Grace, May 10, 2016
Sponsored by, publisher of the CyberOSINT monograph

Wikipedia Relies on Crowdsourcing Once More

May 9, 2016

As a non-profit organization, the Wikimedia Foundation relies on charitable donations to fund many of its projects, including Wikipedia.  It is why every few months, when you are browsing the Wiki pages you will see a donation bar pop to send them money.  Wikimedia uses the funds to keep the online encyclopedia running, but also to start new projects.   Engadget reports that Wikipedia is interested in taking natural language processing and applying it to the Wikipedia search engine, “Wikipedia Is Developing A Crowdsourced Speech Engine.”

Working with Sweden’s KTH Royal Institute of Technology, Wikimedia researchers are building a speech engine to enable people with reading or visual impairments to access the plethora of information housed in the encyclopedia.  In order to fund the speech engine, the researchers turned to crowdsourcing.  It is estimated that twenty-five percent, 125 million monthly users, will benefit from the speech engine.

” ‘Initially, our focus will be on the Swedish language, where we will make use of our own language resources,’ KTH speech technology professor Joakim Gustafson, said in a statement. ‘Then we will do a basic English voice, which we expect to be quite good, given the large amount of open source linguistic resources. And finally, we will do a rudimentary Arabic voice that will be more a proof of concept.’”

Wikimedia wants to have a speech engine in Arabic, English, and Swedish by the end of 2016, then they will focus on the other 280 languages they support with their projects.  Usually, you have to pay to have an accurate and decent natural language processing machine, but if Wikimedia develops a decent speech engine it might not be much longer before speech commands are more commonplace.


Whitney Grace, May 9, 2016
Sponsored by, publisher of the CyberOSINT monograph

How Hackers Hire

May 7, 2016

Ever wonder how hackers fill job openings, search-related or otherwise? A discussion at the forum tehPARADOX.COM considers, “How Hackers Recruit New Talent.” Poster MorningLightMountain cites a recent study by cybersecurity firm Digital Shadows, which reportedly examined around 100 million websites, both on the surface web and on the dark web, for recruiting practices. We learn:

“The researchers found that the process hackers use to recruit new hires mirrors the one most job-seekers are used to. (The interview, for example, isn’t gone—it just might involve some anonymizing technology.) Just like in any other industry, hackers looking for fresh talent start by exploring their network, says Rick Holland, the vice president of strategy at Digital Shadows. ‘Reputation is really, really key,’ Holland says, so a candidate who comes highly recommended from a trusted peer is off to a great start. When hiring criminals, reputation isn’t just about who gets the job done best: There’s an omnipresent danger that the particularly eager candidate on the other end of the line is actually an undercover FBI agent. A few well-placed references can help allay those fears.”

Recruiters, we’re told, frequently advertise on hacker forums. These groups reach many potential recruits and are often password-protected. However, it is pretty easy to trace anyone who logs into one without bothering to anonymize their traffic. Another option is to advertise on the dark web— researchers say they even found a “sort of for cybercrime” there.

The post goes on to discuss job requirements, interviews, and probationary periods. We’re reminded that, no matter how many advanced cybersecurity tools get pushed to market, most attack are pretty basic; they involve approaches like denial-of-service and SQL injection. So, MorningLightMountain advises, any job-seeking hackers should be good to go if they just keep up those skills.


Cynthia Murrell, May 7, 2016

Sponsored by, publisher of the CyberOSINT monograph

« Previous PageNext Page »