More Data Truths: Painful Stuff
July 4, 2016
I read “Don’t Let Your Data Lake Turn into a Data Swamp.” Nice idea, but there may be a problem which resists some folks’ best efforts to convert that dicey digital real estate into a tidy subdivision. Swamps are wetlands. As water levels change, the swamps come and go, ebb and flow as it were. More annoying is the fact that swamps are not homogeneous. Fens, muskegs, and bogs add variety to the happy hiker who strays into the Vasyugan Swamp as the spring thaw progresses.
The notion of a data swamp is an interesting one. I am not certain how zeros and ones in a storage medium relate to the Okavango delta, but let’s give this metaphor a go. The write up reveals:
Data does not move easily. This truth has plagued the world of Big Data for some time and will continue to do so. In the end, the laws of physics dictate a speed limit, no matter what else is done. However, somewhere between data at rest and the speed of light, there are many processes that must be performed to make data mobile and useful. Integrating data and managing a data pipeline are two of these necessary tasks.
Okay, no swamp thing here.
The write up shifts gears and introduces the “data pipeline” and the concept of “keeping the data lake clean.”
Let’s step back. What seems to be the motive force for this item about information in digital form has several gears:
- Large volumes of data are a mess. Okay, but not all swamps are messes. The real problem is that whoever stored data did it without figuring out what to do with the information. Collection is not application.
- The notion of a data pipeline implies movement of information from Point A to Point B or through a series of processes which convert Input A into Output B. Data pipelines are easy to talk about, but in my experience these require knowing what one wants to achieve and then constructing a system to deliver. Talking about a data pipeline is not a data pipeline in my wetland.
- The concept of pollution seems to suggest that dirty data are bad. Making certain data are accurate and normalized requires effort.
My view is that this write up is trying to communicate the fact that Big Data is not too helpful if one does not take care of the planning before clogging a storage subsystem with digital information.
Seems obvious but I suppose that’s why we have Love Canals and an ever efficient Environmental Protection Agency to clean up shortcuts.
Stephen E Arnold, July 4, 2016
Google Plus Now Five Years Old
July 4, 2016
I must admit I forget about Google Plus or Google+. Try searching for Google Plus with the “+” on your keyboard. How did that work out for you? Once the “+” allowed me to bind two words together like “white+house.” Well, forget that, gentle reader. You can depend on the Alphabet Google thing to handle bound phrases automatically with super smart software. Yep, works almost every time, doesn’t it?
I read a memory jogger titled “Google+ Social Network Turns 5 Years Old, Wants a Pony and a Plastic Rocket for Its Birthday.” According to the write up, the Alphabet Google thing no longer wants money. The corporate confection needs a Equus ferus caballus and a device which could be confiscated when going through airport security. Quite a surprise for me because I thought the Alphabet Google wanted money.
I learned:
Where Google+ goes from here isn’t entirely clear. Google seems content to keep it running for the time being, using it as a vector to gain new users on Google Photos and other related tools. While it’s nowhere near the juggernaut of Facebook and Reddit seems to be quickly swallowing up the niche community space, there are still a plenty of good conversations to be had in a Google+ feed.
The write up mentions the remarkable Google Wave service. There is not a hint of Google’s other social efforts. No Buzz. No Orkut. No Friend Connect.
Let’s see. Facebook is a big winner in the social space. LinkedIn is now a Microsoft Clippy for corporate social interactions. And Google+ or Google Plus? Yep, good conversations. I find conversations an overused word. Is Google+ or Google Plus overused? What happened to having bonuses for Gogglers linked to social media? Why aren’t Google’s services united with a single Google+ or Google Plus mandate? Oh, right. I forgot about user behavior.
I will try to remember Google+ or is it Google Plus. I thought Sillycon Valley outfits wanted unicorns and real Musk- and Bezos-type rockets. Oh, well.
Stephen E Arnold, July 4, 2016
Google Throws Hat in Ring as Polling Service for Political Campaigns
July 4, 2016
The article on Silicon Angle titled Google is Pitching Its Polling Service at Journos, Politicians…Also, Google Has a Polling Division explores the recent discovery of Google’s pollster ambitions. Compared to other projects Google has undertaken, this desire to join Gallup and Nielsen as a premier polling service seems downright logical. Google is simply taking advantage of its data reach to create Google Consumer Surveys. The article explains,
“Google collects the polling data for the service through pop-up survey boxes before a news article is read, and through a polling application…The data itself, while only representative of people on the internet, is said to be a fair sample nonetheless, as Google selects its sample by calculating the age, location, and demographics of those participating in each poll by using their browsing and search history…the same technology used by Google’s ad services including DoubleClick and AdWords.”
Apparently Google employees have been pitching their services to presidential and congressional campaign staffers, and at least one presidential candidate ran with them. As the article states, the entire project is a “no-brainer,” even with the somewhat uncomfortable idea of politicians gaining access to Google’s massive data trove. Let’s limit this to polling before Google gets any ideas about the census and call it a day.
Chelsea Kerwin, July 4, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
Enterprise Search Is Stuck in the Past
July 4, 2016
Enterprise search is one of the driving forces behind an enterprise system because the entire purpose of the system is to encourage collaboration and quickly find information. While enterprise search is an essential tool, according to Computer Weekly’s article. “Beyond Keywords: Bringing Initiative To Enterprise Search” the feature is stuck in the past.
Enterprise search is due for an upgrade. The amount of enterprise data has increased, but the underlying information management system remains the same. Structured data is easy to make comply with the standard information management system, however, it is the unstructured data that holds the most valuable information. Unstructured information is hard to categorize, but natural language processing is being used to add context. Ontotext combined natural language processing with a graph database, allowing the content indexing to make more nuanced decisions.
We need to level up the basic keyword searching to something more in-depth:
“Search for most organisations is limited: enterprises are forced to play ‘keyword bingo’, rephrasing their question multiple times until they land on what gets them to their answer. The technologies we’ve been exploring can alleviate this problem by not stopping at capturing the keywords, but by capturing the meaning behind the keywords, labeling the keywords into different categories, entities or types, and linking them together and inferring new relationships.”
In other words, enterprise search needs the addition of semantic search in order to add context to the keywords. A basic keyword search returns every result that matches the keyword phrase, but a context-driven search actually adds intuition behind the keyword phrases. This is really not anything new when it comes to enterprise or any kind of search. Semantic search is context-driven search.
Whitney Grace, July 4, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
The EU Google Dust Up: The Ad Business
July 3, 2016
I read “EU turns to Google’s Ad Business in Antitrust Probe.” Poor Alphabet Google. The company wants to focus, change transportation, and reduce costs by embracing smart software. The nitpickers in the EU continue to find fault with one of Sillycon Valley’s most cherished institutions. The problem this time appears to be advertising.
The write up reports (after one participates in a somewhat silly survey before displaying the write up):
Google is set to be hit with a third set of antitrust charges by the European Union – this time against its advertising business.
What’s the regulated beef? I learned:
investigators are taking steps to formalize their accusations by asking companies to remove confidential material from evidence that supports claims Google abuses its dominance in online advertising. If found guilty of breaking EU competition law, Google could face a maximum fine of 10% of its annual revenue per infringement.
Alphabet Google, despite the company’s best efforts over the last decade or so, generates about 90 percent of its $70 billion in revenue from advertising. A fine would certainly be an interesting number when converted to the super currency, the euro.
My thought is that the Alphabet Google outfit is misunderstood. Advertising depends on people who want to use a free online search system. The advertisers pay the Alphabet Google thing to put messages in front of users. Europe tried and failed to create a Google killer. The Qwant service is chugging along but with less and less spring in its step. The Exalead system, believe it or not, is online, but does not seem to be too popular here in rural Kentucky.
I almost feel sorry for the EU. Alphabet Google should be okay, but if the company finds itself having to pay out billions to keep regulators happy, there will be less fun in the Googleplex.
Stephen E Arnold, July 3, 2016
What Is in a Name? Procedures Remain Indifferent
July 2, 2016
I read “Brexit: “Bayesian” Statistics Renamed “Laplacian” Statistics.” Years ago I worked briefly with a person who was I later learned a failed webmaster and a would-be wizard. One of that individual’s distinguishing characteristics was an outright rejection of Bayesian methods. The person thought la place was a non American English way or speaking French. I wonder where that self appointed wizard is now. Oh, well. I hope the thought leader is scrutinizing the end of Bayes.
According to the write up:
With the U.K. leaving the E.U., it’s time for “Bayesian” to exit its titular role and be replaced by “Laplacian”.
I support this completely. I assume that the ever efficient EU bureaucracy in the nifty building in Strasbourg will hop on this intellectual bandwagon.
Stephen E Arnold, July 2, 2016
Amazon AWS Jungle Snares Some Elasticsearch Functions
July 1, 2016
Elastic’s Elasticsearch has become one of the go to open source search and retrieval solutions. Based on Lucene, the system has put the heat on some of the other open source centric search vendors. However, search is a tricky beastie.
Navigate to “AWS Elasticsearch Service Woes” to get a glimpse of some of the snags which can poke holes in one’s rip stop hiking garb. The problems are not surprising. One does not know what issues will arise until a search system is deployed and the lucky users are banging away with their queries or a happy administrator discovers that Button A no longer works.
The write up states:
We kept coming across OOM issues due the JVMMemoryPresure spiking and inturn the ES service kept crapping out. Aside from some optimization work, we’d more than likely have to add more boxes/resources to the cluster which then means more things to manage. This is when we thought, “Hey, AWS have a service for this right? Let’s give that a crack?!”. As great as having it as a service is, it certainly comes with some fairly irritating pitfalls which then causes you to approach the situation from a different angle.
One approach is to use templates to deal with the implementation of shard management in AWS Elasticsearch. Sample templates are provided in the write up. The fix does not address some issues. The article provides a link to a reindexing tool called es-tool.
The most interesting comment in the article in my opinion is:
In hindsight I think it may have been worth potentially sticking with and fleshing out the old implementation of Elasticsearch, instead of having to fudge various things with the AWS ES service. On the other hand it has relieved some of the operational overhead, and in terms of scaling I am literally a couple of clicks away. If you have large amounts of data you pump into Elasticsearch and you require granular control, AWS ES is not the solution for you. However if you need a quick and simple Elasticsearch and Kibana solution, then look no further.
My takeaway is to do some thinking about the strengths and weaknesses of the Amazon AWS before chopping through the Bezos cloud jungle.
Stephen E Arnold, July 1, 2016
Voyager Search: New Release Available
July 1, 2016
Voyager Search is vendor of search and retrieval based on Lucene. I was not familiar with the company until I read “Voyager Search Improves Search Capabilities and Overall Usability With More Than 150 Updates to Its Version 1.9.8.” According to the write up:
In the new version, Voyager makes it easier to configure content in Navigo, its modern web app, extends its spatial content search, and improves the usability of its Navigo processing tools. Managing content in Navigo can now be done through the new personalized ‘My Voyager’ customization page, which allows customers to share saved searches and update display configurations through a drag and drop interface.
One point in the write up I noted was this statement: “An improved ?spatial search interface now includes the ability to draw and buffer points, lines and polygons.” The idea is that geo-spatial operations appear to be supported by the system.
I also highlighted this comment:
Voyager Search is a leading global provider of geospatial, enterprise search tools that connect, find and deliver more than 1,800 different file formats.
In my experience, support for more than 1,000 file formats suggests a large number of conversion widgets.
The company bills itself as the “only install and go Solr/Lucene search engine.” Information about the company is available at this link. A demo is available here.
Stephen E Arnold, July 1, 2016
DuckDuckGo Sees Apparent Exponential Growth
July 1, 2016
The Tor-enabled search engine DuckDuckGo has received attention recently for being an search engine that does not track users. We found their activity report that shows a one year average of their direct queries per day. DuckDuckGo launched in 2008 and offers an array of options to prevent “search leakage”. Their website defines this term as the sharing of personal information, such as the search terms queried. Explaining a few of DuckDuckGo’s more secure search options, their website states:
“Another way to prevent search leakage is by using something called a POST request, which has the effect of not showing your search in your browser, and, as a consequence, does not send it to other sites. You can turn on POST requests on our settings page, but it has its own issues. POST requests usually break browser back buttons, and they make it impossible for you to easily share your search by copying and pasting it out of your Web browser’s address bar.
Finally, if you want to prevent sites from knowing you visited them at all, you can use a proxy like Tor. DuckDuckGo actually operates a Tor exit enclave, which means you can get end to end anonymous and encrypted searching using Tor & DDG together.”
Cybersecurity and privacy have become hot topics since Edward Snowden made headlines in 2013, which is notably when DuckDuckGo’s exponential growth begins to take shape. Recognition of Tor also became more mainstream around that time, 2013, which is when the Silk Road shutdown occurred, placing the Dark Web in the news. It appears that starting a search engine focused on anonymity in 2008 was not such a bad idea.
Megan Feil, July 1, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
Supercomputers Have Individual Personalities
July 1, 2016
Supercomputers like Watson are more than a novelty. They were built to be another tool for humans, rather than replacing humans all together or so reads some comments from Watson’s chief technology officer Rob High. High was a keynote speaker at the Nvidia GPU Technology Conference in San Jose, California. The Inquirer shares the details in “Nvidia GTC: Why IBM Watson Dances Gangam Style And Sings Like Taylor Swift.”
At the conference, High said that he did not want his computer to take over his thinking, instead he wanted the computer to do his research for him. Research and keeping up with the latest trends in any industry consumes A LOT of time and a supercomputer could potentially eliminate some of the hassle. This requires that supercomputers become more human:
“This leads on to the fact that the way we interact with computers needs to change. High believes that cognitive computers need four skills – to learn, to express themselves with human-style interaction, to provide expertise, and to continue to evolve – all at scale. People who claim not to be tech savvy, he explained, tend to be intimidated by the way we currently interact with computers, pushing the need for a further ‘humanising’ of the process.”
In order to humanize robots, what is taking place is them learning how to be human. A few robots have been programmed with Watson as their main processor and they can interact with humans. By interacting with humans, the robots pick up on human spoken language as well as body language and vocal tone. It allows them to learn how to not be human, but rather the best “artificial servant it can be”.
Robots and supercomputers are tools that can ease a person’s job, but the fact still remains that in some industries they can also replace human labor.
Whitney Grace, July 1, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph