June 15, 2016
I read “Data Lakes vs Data Streams: Which Is Better?” The answer seems to me to be “both.” Streams are now. Lakes are “were.” Who wants to make decisions based on historical data. On the other hand, real time data may mislead the unwary data sailor. The write up states:
The availability of these new ways [lakes and streams] of storing and managing data has created a need for smarter, faster data storage and analytics tools to keep up with the scale and speed of the data. There is also a much broader set of users out there who want to be able to ask questions of their data themselves, perhaps to aid their decision making and drive their trading strategy in real-time rather than weekly or quarterly. And they don’t want to rely on or wait for someone else such as a dedicated business analyst or other limited resource to do the analysis for them. This increased ability and accessibility is creating whole new sets of users and completely new use cases, as well as transforming old ones.
Good news for self appointed lake and stream experts. Bad news for a company trying to figure out how to generate new revenues.
The first step may be to answer some basic questions about what data are available, their reliability, and what person “knows” about data wrangling. Worrying about lakes and streams before one knows if the water is polluted is a good idea before diving into the murky waters.
Stephen E Arnold, June 15, 2016
March 23, 2016
The course syllabus for Stanford’s Computer Science class titled CS 349: Data Mining, Search, and the World Wide Web on Stanford.edu provides an overview of some of the technologies and advances that led to Google search. The syllabus states,
“There has been a close collaboration between the Data Mining Group (MIDAS) and the Digital Libraries Group at Stanford in the area of Web research. It has culminated in the WebBase project whose aims are to maintain a local copy of the World Wide Web (or at least a substantial portion thereof) and to use it as a research tool for information retrieval, data mining, and other applications. This has led to the development of the PageRank algorithm, the Google search engine…”
The syllabus alone offers some extremely useful insights that could help students and laypeople understand the roots of Google search. Key inclusions are the Digital Equipment Corporation (DEC) and PageRank, the algorithm named for Larry Page that enabled Google to become Google. The algorithm ranks web pages based on how many other websites link to them. John Kleinburg also played a key role by realizing that websites with lots of links (like a search engine) should also be seen as more important. The larger context of the course is data mining and information retrieval.
Chelsea Kerwin, March 23, 2016
March 22, 2016
The article on Beyond the Book titled Data Not Content Is Now Publishers’ Product floats a new buzzword in its discussion of the future of information: infonomics, or the study of creation and consumption of information. The article compares information to petroleum as the resource that will cause quite a stir in this century. Grace Hong, Vice-President of Strategic Markets & Development for Wolters Kluwer’s Tax & Accounting, weighs in,
“When it comes to big data – and especially when we think about organizations like traditional publishing organizations – data in and of itself is not valuable. It’s really about the insights and the problems that you’re able to solve,” Hong tells CCC’s Chris Kenneally. “From a product standpoint and from a customer standpoint, it’s about asking the right questions and then really deeply understanding how this information can provide value to the customer, not only just mining the data that currently exists.”
Hong points out that the data itself is useless unless it has been produced correctly. That means asking the right questions and using the best technology available to find meaning in the massive collections of information possible to collect. Hong suggests that it is time for publishers to seize on the market created by Big Data.
Chelsea Kerwin, March 22, 2016
March 1, 2016
For us, concepts have meaning in relationship to other concepts, but it’s easy for computers to define concepts in terms of usage statistics. The post Sense2vec with spaCy and Gensim from SpaCy’s blog offers a well-written outline explaining how natural language processing works highlighting their new Sense2vec app. This application is an upgraded version of word2vec which works with more context-sensitive word vectors. The article describes how this Sense2vec works more precisely,
“The idea behind sense2vec is super simple. If the problem is that duck as in waterfowl andduck as in crouch are different concepts, the straight-forward solution is to just have two entries, duckN and duckV. We’ve wanted to try this for some time. So when Trask et al (2015) published a nice set of experiments showing that the idea worked well, we were easy to convince.
We follow Trask et al in adding part-of-speech tags and named entity labels to the tokens. Additionally, we merge named entities and base noun phrases into single tokens, so that they receive a single vector.”
Curious about the meta definition of natural language processing from SpaCy, we queried natural language processing using Sense2vec. Its neural network is based on every word on Reddit posted in 2015. While it is a feat for NLP to learn from a dataset on one platform, such as Reddit, what about processing that scours multiple data sources?
Megan Feil, March 1, 2016
February 5, 2016
Elasticsearch is one of the most popular open source search applications and it has been deployed for personal as well as corporate use. Elasticsearch is built on another popular open source application called Apache Lucene and it was designed for horizontal scalability, reliability, and easy usage. Elasticsearch has become such an invaluable piece of software that people do not realize just how useful it is. Eweek takes the opportunity to discuss the search application’s uses in “9 Ways Elasticsearch Helps Us, From Dawn To Dusk.”
“With more than 45 million downloads since 2012, the Elastic Stack, which includes Elasticsearch and other popular open-source tools like Logstash (data collection), Kibana (data visualization) and Beats (data shippers) makes it easy for developers to make massive amounts of structured, unstructured and time-series data available in real-time for search, logging, analytics and other use cases.”
How is Elasticsearch being used? The Guardian is daily used by its readers to interact with content, Microsoft Dynamics ERP and CRM use it to index and analyze social feeds, it powers Yelp, and her is a big one Wikimedia uses it to power the well-loved and used Wikipedia. We can already see how much Elasticsearch makes an impact on our daily lives without us being aware. Other companies that use Elasticsearch for our and their benefit are Hotels Tonight, Dell, Groupon, Quizlet, and Netflix.
Elasticsearch will continue to grow as an inexpensive alternative to proprietary software and the number of Web services/companies that use it will only continues to grow.
February 3, 2016
An article entitled Tor and the enterprise 2016 – blocking malware, darknet use and rogue nodes from Computer World UK discusses the inevitable enterprise concerns related to anonymity networks. Tor, The Onion Router, has gained steam with mainstream internet users in the last five years. According to the article,
“It’s not hard to understand that Tor has plenty of perfectly legitimate uses (it is not our intention to stigmatise its use) but it also has plenty of troubling ones such as connecting to criminal sites on the ‘darknet’, as a channel for malware and as a way of bypassing network security. The anxiety for organisations is that it is impossible to tell which is which. Tor is not the only anonymity network designed with ultra-security in mind, The Invisible Internet Project (I2P) being another example. On top of this, VPNs and proxies also create similar risks although these are much easier to spot and block.”
The conclusion this article draws is that technology can only take the enterprise so far in mitigating risk. Reliance on penalties for running unauthorized applications is their suggestion, but this seems to be a short-sighted solution if popularity of anonymity networks rise.
Megan Feil, February 3, 2016
February 1, 2016
Computer programmers who specialize in machine learning, artificial intelligence, data mining, data visualization, and statistics are smart individuals, but they sometimes even get stumped. Using the same form of communication as reddit and old-fashioned forums, Cross Validated is a question an answer site run by Stack Exchange. People can post questions related to data and relation topics and then wait for a response. One user posted a question about “Machine Learning Classifiers”:
“I have been trying to find a good summary for the usage of popular classifiers, kind of like rules of thumb for when to use which classifier. For example, if there are lots of features, if there are millions of samples, if there are streaming samples coming in, etc., which classifier would be better suited in which scenarios?”
The response the user received was that the question was too broad. Classifiers perform best depending on the data and the process that generates it. It is kind of like asking the best way to organize books or your taxes, it depends on the content within the said items.
Another user replied that there was an easy way to explain the general process of understanding the best way to use classifiers. The user directed users to the Sci-Kit.org chart about “choosing the estimator”. Other users say that the chart is incomplete, because it does not include deep learning, decision trees, and logistic regression.
We say create some other diagrams and share those. Classifiers are complex, but they are a necessity to the artificial intelligence and big data craze.
January 22, 2016
One of the best things about data and numbers is that they do not lie…usually. According to Slate’s article, “FTC Report Details How Big Data Can Discriminate Against The Poor,” big data does a huge disservice to people of lower socioeconomic status by reinforcing existing negative patterns. The Federal Trade Commission (FTC), academics, and activists have expressed for some time that big data analytics.
“At its worst, big data can reinforce—and perhaps even amplify—existing disparities, partly because predictive technologies tend to recycle existing patterns instead of creating new openings. They can be especially dangerous when they inform decisions about people’s access to healthcare, credit, housing, and more. For instance, some data suggests that those who live close to their workplaces are likely to maintain their employment for longer. If companies decided to take that into account when hiring, it could be accidentally discriminatory because of the radicalized makeup of some neighborhoods.”
The FTC stresses that big data analytics has positive benefits as well. It can yield information that can create more job opportunities, transform health care delivery, give credit through “non-traditional methods, and more.
The way big data can avoid reinforcing these problems and even improve upon them is to include biases from the beginning. Large data sets can make these problems invisible or even harder to recognize. Companies can use prejudiced data to justify the actions they take and even weaken the effectiveness of consumer choice.
Data is supposed to be an objective tool, but the sources behind the data can be questionable. It becomes important for third parties and the companies themselves to investigate the data sources, run multiple tests, and confirm that the data is truly objective. Otherwise we will be dealing with social problems and more reinforced by bad data.
December 23, 2015
Hacking software is and could be a potential problem. While some government agencies, hacktivist organizations, and software companies are trying to use it for good, terrorist groups, digital thieves, and even law enforcement agencies can use it to spy and steal data from individuals. The Technology Review shares some interesting stories about how software is being used for benign and harmful purposes in “The Growth Industry Helping Governments Hack Terrorists, Criminals, And Political Opponents.”
The company Hacking Team is discussed at length and its Remote Control System software, which can worm its way through security holes in a device and steal valuable information. Governments from around the globe have used the software for crime deterrence and to keep tabs on enemies, but other entities used the software for harmful acts including spying and hacking into political opponents computers.
Within the United States, it is illegal to use a Remote Control System without proper authority, but often this happens:
“When police get access to new surveillance technologies, they are often quickly deployed before any sort of oversight is in place to regulate their use. In the United States, the abuse of Stingrays—devices that sweep up information from cell phones in given area—has become common. For example, the sheriff of San Bernardino County, near Los Angeles, deployed them over 300 times without a warrant in the space of less than two years. That problem is only being addressed now, years after it emerged, with the FBI now requiring a warrant to use Stingrays, and efforts underway to force local law enforcement to do the same. It’s easy to imagine a similar pattern of abuse with hacking tools, which are far more powerful and invasive than other surveillance technologies that police currently use.”
It is scary how the software is being used and how governments are skirting around its own laws to use it. It reminds me of how gun control is always controversial topic. Whenever there is a mass shooting, debates rage about how the shooting would never had happened if there was stricter gun control to keep weapons out of the hands of psychopaths. While the shooter was blamed for the incident, people also place a lot of blame on the gun, as if it was more responsible. As spying, control, and other software becomes more powerful and ingrained in our lives, I imagine there will be debates about “software control” and determining who has the right to use certain programs.
December 21, 2015
Topology’s time has finally come, according to “The Unreasonable Usefulness of Imagining You Live in a Rubbery World,” shared by 3 Quarks Daily. The engaging article reminds us that the field of topology emphasizes connections over geometric factors like distance and direction. Think of a subway map as compared to a street map; or, as writer Jonathan Kujawa describes:
“Topologists ask a question which at first sounds ridiculous: ‘What can you say about the shape of an object if you have no concern for lengths, angles, areas, or volumes?’ They imagine a world where everything is made of silly putty. You can bend, stretch, and distort objects as much as you like. What is forbidden is cutting and gluing. Otherwise pretty much anything goes.”
Since the beginning, this perspective has been dismissed by many as purely academic. However, today’s era of networks and big data has boosted the field’s usefulness. The article observes:
“A remarkable new application of topology has emerged in the last few years. Gunnar Carlsson is a mathematician at Stanford who uses topology to extract meaningful information from large data sets. He and others invented a new field of mathematics called Topological data analysis. They use the tools of topology to wrangle huge data sets. In addition to the networks mentioned above, Big Data has given us Brobdinagian sized data sets in which, for example, we would like to be able to identify clusters. We might be able to visually identify clusters if the data points depend on only one or two variables so that they can be drawn in two or three dimensions.”
Kujawa goes on to note that one century-old tool of topology, homology, is being used to analyze real-world data, like the ways diabetes patients have responded to a specific medication. See the well-illustrated article for further discussion.
Cynthia Murrell, December 21, 2015