Honkin News: Beyond Search Video News Program Available Now
August 2, 2016
Honkin’ News is now online via YouTube at https://youtu.be/hf93zTSixgo. The weekly program tries to separate the giblets from the goose feathers in online search and content processing. Each program draws upon articles and opinion appearing in the Beyond Search blog.
The Beyond Search program is presented by Stephen E Arnold, who resides in rural Kentucky. The five minute programs highlights stories appearing in the daily Beyond Search blog and includes observations not appearing in the printed version of the stories. No registration is required to view the free video.
Arnold told Beyond Search:
Online search and content processing generate modest excitement. Honkin’ News comments on some of the more interesting and unusual aspects of information retrieval, natural language processing, and the activities of those working to make software understand digital content. The inaugural program highlights Verizon’s Yahoo AOL integration strategy, explores why search fails, and how manufacturing binders and fishing lures might boost an open source information access strategy.
The video is created using high tech found in the hollows of rural Kentucky; for example, eight mm black-and-white film and two coal-fired computing devices. One surprising aspect of the video is the window showing the vista outside the window of the Beyond Search facility. The pond filled with mine drainage is not visible, however.
Kenny Toth, August 2, 2016
More Data to Fuel Debate About Malice on Tor
June 9, 2016
The debate about malicious content on Tor continues. Ars Technica published an article continuing the conversation about Tor and the claims made by a web security company that says 94 percent of the requests coming through the network are at least loosely malicious. The article CloudFlare: 94 percent of the Tor traffic we see is “per se malicious” reveals how CloudFlare is currently handling Tor traffic. The article states,
“Starting last month, CloudFlare began treating Tor users as their own “country” and now gives its customers four options of how to handle traffic coming from Tor. They can whitelist them, test Tor users using CAPTCHA or a JavaScript challenge, or blacklist Tor traffic. The blacklist option is only available for enterprise customers. As more websites react to the massive amount of harmful Web traffic coming through Tor, the challenge of balancing security with the needs of legitimate anonymous users will grow. The same network being used so effectively by those seeking to avoid censorship or repression has become a favorite of fraudsters and spammers.”
Even though the jury may still be out in regards to the statistics reported about the volume of malicious traffic, several companies appear to want action sooner rather than later. Amazon Web Services, Best Buy and Macy’s are among several sites blocking a majority of Tor exit nodes. While a lot seems unclear, we can’t expect organizations to delay action.
Megan Feil, June 9, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
Wikipedia Relies on Crowdsourcing Once More
May 9, 2016
As a non-profit organization, the Wikimedia Foundation relies on charitable donations to fund many of its projects, including Wikipedia. It is why every few months, when you are browsing the Wiki pages you will see a donation bar pop to send them money. Wikimedia uses the funds to keep the online encyclopedia running, but also to start new projects. Engadget reports that Wikipedia is interested in taking natural language processing and applying it to the Wikipedia search engine, “Wikipedia Is Developing A Crowdsourced Speech Engine.”
Working with Sweden’s KTH Royal Institute of Technology, Wikimedia researchers are building a speech engine to enable people with reading or visual impairments to access the plethora of information housed in the encyclopedia. In order to fund the speech engine, the researchers turned to crowdsourcing. It is estimated that twenty-five percent, 125 million monthly users, will benefit from the speech engine.
” ‘Initially, our focus will be on the Swedish language, where we will make use of our own language resources,’ KTH speech technology professor Joakim Gustafson, said in a statement. ‘Then we will do a basic English voice, which we expect to be quite good, given the large amount of open source linguistic resources. And finally, we will do a rudimentary Arabic voice that will be more a proof of concept.’”
Wikimedia wants to have a speech engine in Arabic, English, and Swedish by the end of 2016, then they will focus on the other 280 languages they support with their projects. Usually, you have to pay to have an accurate and decent natural language processing machine, but if Wikimedia develops a decent speech engine it might not be much longer before speech commands are more commonplace.
Whitney Grace, May 9, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
Paywalls Block Pleasure Reading
April 4, 2016
Have you noticed something new in the past few months on news Web sites? You click on an interesting article and are halfway though reading it when a pop-up banner blocks out the screen. The only way to continue reading is to enter your email, find the elusive X icon, or purchase a subscription. Ghacks.net tells us to expect more of these in, “Read Articles Behind Paywalls By Masquerading As Googlebot.”
Big new sites such as the Financial Times, The New York Times, The Washington Post, and The Wall Street Journal are now experimenting with the paywall to work around users’ ad blockers. The downside is that content will be locked up and sites might lose viewers, but that might be a risk they are willing to take to earn a bigger profit.
There used be some tricks to get around paywalls:
“It is no secret that news sites allow access to news aggregators and search engines. If you check Google News or Search for instance, you will find articles from sites with paywalls listed there. In the past, news sites allowed access to visitors coming from major news aggregators such as Reddit, Digg or Slashdot, but that practice seems to be as good as dead nowadays. Another trick, to paste the article title into a search engine to read the cached story on it directly, does not seem to work properly anymore as well as articles on sites with paywalls are not usually cached anymore.”
The best way, the article says, is to make the Web site think you are a Googlebot. Web sites allow Googlebots roam freely to appear higher in search engine results. There are a few ways to trick the Web sites into thinking you are a Googlebot based on your Internet browser, Firefox or Chrome. Check them out, but it will not be long before those become old-fashioned too.
Whitney Grace, April 4, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
Wikipedia Grants Users Better Search
March 24, 2016
Wikipedia is the defacto encyclopedia to confirm fact from fiction, although academic circles shun its use (however, scholars do use it but never cite it). Wikipedia does not usually make the news, unless it is tied to its fundraising campaign or Wikileaks releases sensitive information meant to remain confidential. The Register tells us that Wikipedia makes the news for another reason, “Reluctant Wikipedia Lifts Lid On $2.5m Internet Search Engine Project.” Wikipedia is better associated with the cataloging and dissemination of knowledge, but in order to use that knowledge it needs to be searched.
Perhaps that is why the Wikimedia Foundation is “doing a Google” and will be investing a Knight Foundation Grant into a search-related project. The Wikimedia Foundation finally released information about the Knight Foundation Grant, dedicated to provide funds for companies invested in innovative solutions related to information, community, media, and engagement.
“The grant provides seed money for stage one of the Knowledge Engine, described as “a system for discovering reliable and trustworthy information on the Internet”. It’s all about search and federation. The discovery stage includes an exploration of prototypes of future versions of Wikipedia.org which are “open channels” rather than an encyclopedia, analysing the query-to-content path, and embedding the Wikipedia Knowledge Engine ‘via carriers and Original Equipment Manufacturers’.”
The discovery stage will last twelve months, ending in August 2016. The biggest risk for the search project would be if Google or Yahoo decided to invest in something similar.
What is interesting is that former Wiki worker Jimmy Wales denied the Wikimedia Foundation was working on a search engine via the Knowledge Engine. Wales has since left and Andreas Kolbe reported in a Wikipedia Signpost article that they are building a search engine and led to believe it would be to find information spread cross the Wikipedia portals, rather it is something much more powerful.
Here is what the actual grant is funding:
“To advance new models for finding information by supporting stage one development of the Knowledge Engine by Wikipedia, a system for discovering reliable and trustworthy public information on the Internet.”
It sounds like a search engine that provides true and verifiable search results, which is what academic scholars have been after for years! Wow! Wikipedia might actually be worth a citation now.
Whitney Grace, March 24, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
Interview with Stephen E Arnold, Reveals Insights about Content Processing
March 22, 2016
Nikola Danaylov of the Singularity Weblog interviewed technology and financial analyst Stephen E. Arnold on the latest episode of his podcast, Singularity 1 on 1. The interview, Stephen E. Arnold on Search Engines and Intelligence Gathering, offers thought-provoking ideas on important topics related to sectors — such as intelligence, enterprise search, and financial — which use indexing and content processing methods Arnold has worked with for over 50 years.
Arnold attributes the origins of his interest in technology to a programming challenge he sought and accepted from a computer science professor, outside of the realm of his college major of English. His focus on creating actionable software and his affinity for problem-solving of any nature led him to leave PhD work for a job with Halliburton Nuclear. His career includes employment at Booz, Allen & Hamilton, the Courier Journal & Louisville Times, and Ziff Communications, before starting ArnoldIT.com strategic information services in 1991. He co-founded and sold a search system to Lycos, Inc., worked with numerous organizations including several intelligence and enforcement organizations such as US Senate Police and General Services Administration, and authored seven books and monographs on search related topics.
With a continued emphasis on search technologies, Arnold began his blog, Beyond Search, in 2008 aiming to provide an independent source of “information about what I think are problems or misstatements related to online search and content processing.” Speaking to the relevance of the blog to his current interest in the intelligence sector of search, he asserts:
“Finding information is the core of the intelligence process. It’s absolutely essential to understand answering questions on point and so someone can do the job and that’s been the theme of Beyond Search.”
As Danaylov notes, the concept of search encompasses several areas where information discovery is key for one audience or another, whether counter-terrorism, commercial, or other purposes. Arnold agrees,
“It’s exactly the same as what the professor wanted to do in 1962. He had a collection of Latin sermons. The only way to find anything was to look at sermons on microfilm. Whether it is cell phone intercepts, geospatial data, processing YouTube videos uploaded from a specific IP address– exactly the same problem and process. The difficulty that exists is that today we need to process data in a range of file types and at much higher speeds than ever anticipated, but the processes remain the same.”
Arnold explains the iterative nature of his work:
“The proof of the value of the legacy is I don’t really do anything new, I just keep following these themes. The Dark Web Notebook is very logical. This is a new content domain. And if you’re an intelligence or information professional, you want to know, how do you make headway in that space.”
Describing his most recent book, Dark Web Notebook, Arnold calls it “a cookbook for an investigator to access information on the Dark Web.” This monograph includes profiles of little-known firms which perform high-value Dark Web indexing and follows a book he authored in 2015 called CYBEROSINT: Next Generation Information Access.
Palantir Gets a Rah Rah from Bloomberg
March 5, 2016
I posted the unicorn flier in “Palantir: A Dying Unicorn or a Mad, Mad Sign?” I read “Palantir Staff Shouldn’t Believe the Unicorn Flyers.” I assume that the alleged fliers did exist in the Shire and were not figments of a Tolkienesque imagination. (I wonder of JRR’s classes were anchored in reality.)
The write up states:
For now, Palantir people can rest easy in the Shire, a.k.a. downtown Palo Alto, Calif. The company, which was named after the “seeing stones” from the Lord of the Rings, is not at risk of an evil wizard with preferred shares coming to vaporize workers’ share value.
The write up contains a hefty dose of jargon; for example:
During the fourth quarter of 2015, 42 percent of deals had such provisions, compared with 15 percent in the previous two quarters. Investors were also given the right to block an initial public offering that didn’t meet their valuation threshold in 33 percent of deals in the fourth quarter, compared with 20 percent in the second quarter, the study said. Palantir had neither provision.
Okay.
The only hitch in the git along is that Morgan Stanley has cut the value of its stake in Palantir.
Worth watching even if one is not an employee hoping that the value of this particular unicorn is going to morph into a Pegasus.
Stephen E Arnold, March 5, 2016
Are Search Unicorns Sub Prime Unicorns?
January 4, 2016
The question is a baffler. Navigate to “Sorting Truth from Myth at Technology Unicorns.” If the link is bad or you have to pay to read the article in the Financial Times, pony up, go to the library, or buy hard copy. Don’t complain to me, gentle reader. Publishers are in need of revenue. Now the write up:
The assumption is that a unicorn exists. What exists are firms with massive amounts of venture funding and billion dollar valuations. I know the money is or was real, but the “sub prime unicorn” is a confection from a money thought leader Michael Moritz. A subprime unicorn is a co9mpany “built on the flimsiest of edifices.” Does this mean fairy dust or something more substantial?
According to the write up:
High quality global journalism requires investment. Please share this article with others using the link below, do not cut & paste the article. But the way in which private market valuations have become skewed and inflated as start-ups have delayed IPOs raises questions about the financing of innovation. Despite the excitement, venture capital has produced weak returns in recent decades — only a minority of funds have produced rewards high enough to compensate investors for illiquidity and opacity.
Why would funding start ups perform better than a start up financed by mom, dad, and one’s slightly addled, but friendly, great aunt?
The article then makes a reasonably sane point:
With the rise in US interest rates, the era of ultra-cheap financing is ending. As it does, Silicon Valley’s unicorns are losing their mystique and having to work to raise equity, sometimes at valuations below those they achieved before. The promise of private financing is being tested, and there will be disappointments. It does not pay to be dazzled by mythical beasts.
Let’s think a moment about search and content processing. The mid tier consulting firms—the outfits I call azure chip outfits—have generated some pretty crazy estimates about the market size for search and content processing solutions.
The reality is at odds with these speculative, marketing fueled prognostications. Yep, I would include the wizards at IDC who wanted $3,500 to sell an eight page document with my name on it without my permission. Refresh yourself on the IDC Schubmehl maneuver at this link.
Based on my research, two enterprise search outfits broke $150 million in revenues prior to 2011: Endeca tallied an estimated $150 million in revenues and Autonomy reported $700 million in revenues. Both outfits were sold.
Since 2012 exactly zero enterprise search firms have generated more than $700 million in revenues. Now the wild and crazy funding of search vendors has continued apace since 2012. There are a number of search and retrieval companies and some next generation content processing outfits which have ingested tens of millions of dollars.
How many of these outfits have gone public in the zero cost money environment? Based on my records, zero. Why haven’t Attivio, BA Insight, Coveo, Palantir and others cashed in on their technology, surging revenues, and market demand?
There are three reasons:
- The revenues are simply acceptable, not stunning. In the post Fast Search & Transfer era, twiddling the finances carries considerable risks. Think about a guilty decision for a search wizard. Yep, bad.
- The technology is a rehash gilded with new jargon. Take a look at the search and content processing systems, and you find the same methods and functions that have been known and in use for more than 30 years. The flashy interfaces are new, but the plumbing still delivers precision and recall which has hit a glass ceiling at 80 to 90 percent accuracy for the top performing systems. Looking for a recipe with good enough relevance is acceptable. Looking for a bad actor with a significant margin for error is not so good.
- The smart software performs certain functions at a level comparable to the performance of a subject matter index when certain criteria are met. The notion of human editors riding herd on entity and synonym dictionaries is not one that makes customers weep with joy. Smart software helps with some functions, but today’s systems remain anchored in human operators, and the work these folks have to perform to keep the systems in tip top share is expensive. Think about this human aspect in terms of how Palantir explains architects’ changes to type operators or the role of content intake specialists using the revisioning and similar field operations.
Why do I make this point in the context of unicorns? Search has one or two unicorns. I would suggest Palantir is a unicorn. When I think of Palantir, I consider this item:
To summarize, only a small number of companies reach the IPO stage.
Also, the HP Autonomy “deal” is a quasi unicorn. IBM’s investment in Watson is a potential unicorn if and when IBM releases financial data about his TV show champion.
Then there are a number of search and content processing creatures which could be hybrids of a horse and a donkey. The investors are breeders who hope that the offspring become champions. Long shots all.
The Financial Times’s article expresses a broad concept. The activities of the search and content processing vendors in the next 12 to 18 months will provide useful data about the genetic make up of some technology lab creations.
Stephen E Arnold, January 4, 2015