Big Data Too Is Prone to Human Bug

August 2, 2017

Conventional wisdom says Big Data being a realm of machines is immune from human behavioral traits like discrimination. Insights from data scientists, however, are different.

According to an article published by PHYS.ORG titled Discrimination, Lack of Diversity, and Societal Risks of Data Mining Highlighted in Big Data, the author says:

Despite the dramatic growth in big data affecting many areas of research, industry, and society, there are risks associated with the design and use of data-driven systems. Among these are issues of discrimination, diversity, and bias.

The crux of the problem is the way data is mined, processed and decisions made. At every step, humans need to be involved in order to tell machines how each of these processes are executed. If the person guiding the system is biased, these biases are bound to seep into the subsequent processes in some way.

Apart from decisions like granting credit, human resources which also is being automated may have diversity issues. The fundamental remains the same in this case too.

Big Data was touted as the next big thing and may turn out to be so, but most companies are yet to figure out how to utilize it. Streamlining the processes and making them efficient would be the next step.

Vishal Ingole, August 2, 2017

Big Data in Biomedical

July 19, 2017

The biomedical field which is replete with unstructured data is all set to take a giant leap towards standardization with Biological Text Mining Unit.

According to PHYS.ORG, in a peer review article titled Researchers Review the State-Of-The-Art Text Mining Technologies for Chemistry, the author states:

Being able to transform unstructured biomedical research data into structured databases that can be more efficiently processed by machines or queried by humans is critical for a range of heterogeneous applications.

Scientific data has fixed set of vocabulary which makes standardization and indexation easy. However, most big names in Big Data and enterprise search are concentrating their efforts on e-commerce.

Hundreds of new compounds are discovered every year. If the data pertaining to these compounds is made available to other researchers, advancements in this field will be very rapid. The major hurdle is the data is in an unstructured format, which Biological Text Mining Unit standards intend to overcome.

Vishal Ingole, July 19, 2017

Does This Count As Irony?

May 16, 2017

Does this count as irony?

Palantir, who has built its data-analysis business largely on its relationships with government organizations, has a Department of Labor analysis to thank for recent charges of discrimination. No word on whether that Department used Palantir software to “sift through” the reports. Now, Business Insider tells us, “Palantir Will Shell Out $1.7 Million to Settle Claims that It Discriminated Against Asian Engineers.” Writer Julie Bort tells us that, in addition to that payout, Palantir will make job offers to eight unspecified Asians. She also explains:

The issue arose because, as a government contractor, Palantir must report its diversity statistics to the government. The Labor Department sifted through these reports and concluded that even though Palantir received a huge number of qualified Asian applicants for certain roles, it was hiring only small numbers of them. Palantir, being the big data company that it is, did its own sifting and produced a data-filled response that it said refuted the allegations and showed that in some tech titles 25%-38% of its employees were Asians. Apparently, Palantirs protestations weren’t enough on to satisfy government regulators, so the company agreed to settle.

For its part, Palantir insists on their innocence but say they settled in order to put the matter behind them. Bort notes the unusual nature of this case—according to the Equal Employment Opportunity Commission, African-Americans, Latin-Americans, and women are more underrepresented in tech fields than Asians. Is the Department of Labor making it a rule to analyze the hiring patterns of companies required to report diversity statistics? If they are consistent, there should soon be a number of such lawsuits regarding discrimination against other groups. We shall see.

Cynthia Murrell, May 16, 2017

Bad Big Data? Get More Data Then

March 2, 2017

I like the idea that more is better. The idea is particularly magnetic when a company cannot figure out what it’s own, in house, proprietary data mean. Think of the legions of consultants from McKinsey and BCG telling executives what their own data “means.” Toss in the notion of a Big Data in a giant “data lake,” and you have decision makers who cannot use the information they already have.

Well, how does one fix that problem? Easy. Get more data. That sounds like a plan, particularly when the professionals struggling are in charge of figuring out if sales and marketing investments sort of pay for themselves.

I learned that I need more data by reading “Deepening The Data Lake: How Second-Party Data Increases AI For Enterprises.” The headline introduces the amazing data lake concept along with two giant lake front developments: More data and artificial intelligence.

Buzzwords? Heck no. Just solid post millennial reasoning; for example:

there are many marketers with surprisingly sparse data, like the food marketer who does not get many website visitors or authenticated customers downloading coupons. Today, those marketers face a situation where they want to use data science to do user scoring and modeling but, because they only have enough of their own data to fill a shallow lake, they have trouble justifying the costs of scaling the approach in a way that moves the sales needle.

I like that sales needle phrase. Marketers have to justify themselves and many have only “sparse” data. I would suggest that marketers have often useless data like the number of unique clicks, but that’s only polluting the data lake.

The fix is interesting. I learned:

we can think of the marketer’s first-party data – media exposure data, email marketing data, website analytics data, etc. – being the water that fills a data lake. That data is pumped into a data management platform (pictured here as a hydroelectric dam), pumped like electricity through ad tech pipes (demand-side platforms, supply-side platforms and ad servers) and finally delivered to places where it is activated (in the town, where people live)… this infrastructure can exist with even a tiny bit of water but, at the end of the cycle, not enough electricity will be generated to create decent outcomes and sustain a data-driven approach to marketing. This is a long way of saying that the data itself, both in quality and quantity, is needed in ever-larger amounts to create the potential for better targeting and analytics.

Yep, more data.

And what about making sense of the additional data? I learned:

The data is also of extremely high provenance, and I would also be able to use that data in my own environment, where I could model it against my first-party data, such as site visitors or mobile IDs I gathered when I sponsored free Wi-Fi at the last Country Music Awards. The ability to gather and license those specific data sets and use them for modeling in a data lake is going to create massive outcomes in my addressable campaigns and give me an edge I cannot get using traditional ad network approaches with third-party segments. Moreover, the flexibility around data capture enables marketers to use highly disparate data sets, combine and normalize them with metadata – and not have to worry about mapping them to a predefined schema. The associative work happens after the query takes place. That means I don’t need a predefined schema in place for that data to become valuable – a way of saying that the inherent observational bias in traditional approaches (“country music fans love mainstream beer, so I’d better capture that”) never hinders the ability to activate against unforeseen insights.

Okay, I think I understand. No wonder companies hire outfits like blue chip consulting firms to figure out what is going on in their companies. Stated another way, insiders live in the swamp. Outsiders can put the swamp into a context and maybe implement some pollution control systems.

Stephen E Arnold, March 2, 2017

Big Data Needs to Go Public

December 16, 2016

Big Data touches every part of our lives and we are unaware.  Have you ever noticed when you listen to the news, read an article, or watch a YouTube video that people say items such as: “experts claim, “science says,” etc.”  In the past, these statements relied on less than trustworthy sources, but now they can use Big Data to back up their claims.  However, popular opinion and puff pieces still need to back up their big data with hard fact.  Nature.com says that transparency is a big deal for Big Data and algorithm designers need to work on it in the article, “More Accountability For Big-Data Algorithms.”

One of the hopes is that big data will be used to bridge the divide between one bias and another, except that he opposite can happen.  In other words, Big Data algorithms can be designed with a bias:

There are many sources of bias in algorithms. One is the hard-coding of rules and use of data sets that already reflect common societal spin. Put bias in and get bias out. Spurious or dubious correlations are another pitfall. A widely cited example is the way in which hiring algorithms can give a person with a longer commute time a negative score, because data suggest that long commutes correlate with high staff turnover.

Even worse is that people and organizations can design an algorithm to support science or facts they want to pass off as the truth.  There is a growing demand for “algorithm accountability,” mostly in academia.  The demands are that data sets fed into the algorithms are made public.  There also plans to make algorithms that monitor algorithms for bias.

Big Data is here to say, but relying too much on algorithms can distort the facts.  This is why the human element is still needed to distinguish between fact and fiction.  Minority Report is closer to being our present than ever before.

Whitney Grace, December 16, 2016

Algorithm Bias in Beauty Contests

September 16, 2016

I don’t read about beauty contests. In my college dorm, I recall that the televised broadcast of the Miss America pageant was popular among some of the residents. I used the attention grabber as my cue to head to the library so I could hide reserved books from my classmates. Every little bit helps in the dog eat dog world of academic achievement.

When Artificial Intelligence Judges a Beauty Contest, White People Win” surprised me. I thought that algorithms were objective little numerical recipes. Who could fiddle 1=1=2?

I learned:

The foundation of machine learning is data gathered by humans, and without careful consideration, the machines learn the same biases of their creators. Sometimes bias is difficult to track, but other times it’s clear as the nose on someone’s face—like when it’s a face the algorithm is trying to process and judge.

Its seems that an algorithm likes white people. The write up informed me:

An online beauty contest called Beauty.ai, run byYouth Laboratories (that lists big names in tech like Nvidia and Microsoft as “partners and supporters” on the contest website), solicited 600,000 entries by saying they would be graded by artificial intelligence. The algorithm would look at wrinkles, face symmetry, amount of pimples and blemishes, race, and perceived age. However, race seemed to play a larger role than intended; of the 44 winners, 36 were white.

Oh, oh. Microsoft and its smart software seem to play a role in this drama.

What’s the fix? Better data. The write up includes this statement from a Microsoft expert:

“If a system is trained on photos of people who are overwhelmingly white, it will have a harder time recognizing non-white faces,” writes Kate Crawford, principal researcher at Microsoft Research New York City, in a New York Times op-ed. “So inclusivity matters—from who designs it to who sits on the company boards and which ethical perspectives are included. Otherwise, we risk constructing machine intelligence that mirrors a narrow and privileged vision of society, with its old, familiar biases and stereotypes.”

In the last few months, Microsoft’s folks were involved in Tay, a chatbot which allegedly learned to be racist. Then there was the translation of “Daesh” as Saudi Arabia. Now algorithms appear to favor folks of a particular stripe.

Exciting math. But Microsoft has also managed to gum up webcams and Kindle access in Windows 10. Yep, the new Microsoft is a sparkling example of smart.

Stephen E Arnold, September 16, 2016

In-Q-Tel Wants Less Latency, Fewer Humans, and Smarter Dashboards

September 15, 2016

I read “The CIA Just Invested in a Hot Startup That Makes Sense of Big Data.” I love the “just.” In-Q-Tel investments are not like bumping into a friend in Penn Station. Zoomdata, founded in 2012, has been making calls, raising venture funding (more than $45 million in four rounds from 21 investors), and staffing up to about 100 full time equivalents. With its headquarters in Reston, Virginia, the company is not exactly operating from a log cabin west of Paducah, Kentucky.

The write up explains:

Zoom Data uses something called Data Sharpening technology to deliver visual analytics from real-time or historical data. Instead of a user searching through an Excel file or creating a pivot table, Zoom Data puts what’s important into a custom dashboard so users can see what they need to know immediately.

What Zoomdata does is offer hope to its customers for less human fiddling with data and faster outputs of actionable intelligence. If you recall how IBM i2 and Palantir Gotham work, humans are needed. IBM even snagged Palantir’s jargon of AI for “augmented intelligence.”

In-Q-Tel wants more smart software with less dependence on expensive, hard to train, and often careless humans. When incoming rounds hit near a mobile operations center, it is possible to lose one’s train of thought.

Zoomdata has some Booz, Allen DNA, some MIT RNA, and protein from other essential chemicals.

The write up mentions Palantir, but does not make explicit the need to reduce t6o some degree the human-centric approaches which are part of the major systems’ core architecture. You have nifty cloud stuff, but you have less nifty humans in most mission critical work processes.

To speed up the outputs, software should be the answer. An investment in Zoomdata delivers three messages to me here in rural Kentucky:

  1. In-Q-Tel continues to look for ways to move along the “less wait and less weight” requirement of those involved in operations. “Weight” refers to heavy, old-fashioned system. “Wait” refers to the latency imposed by manual processes.
  2. Zoomdata and other investments whips to the flanks of the BAE Systems, IBMs, and Palantirs chasing government contracts. The investment focuses attention not on scope changes but on figuring out how to deal with the unacceptable complexity and latency of many existing systems.
  3. In-Q-Tel has upped the value of Zoomdata. With consolidation in the commercial intelligence business rolling along at NASCAR speeds, it won’t take long before Zoomdata finds itself going to big company meetings to learn what the true costs of being acquired are.

For more information about Zoomdata, check out the paid-for reports at this link.

Stephen E Arnold, September 15, 2016

How Collaboration and Experimentation Are Key to Advancing Machine Learning Technology

September 12, 2016

The article on CIO titled Machine Learning “Still a Cottage Industry” conveys the sentiments of a man at the heart of the industry in Australia, Professor Bob Williamson. Williamson is the Commonwealth Scientific and Industrial Research Organisation’s (CSIRO’s) Data 61 group chief scientist. His work in machine learning and data analytics led him to the conclusion that for machine learning to truly move forward, scientists must find a way to collaborate. He is quoted in the article,

There’s these walled gardens: ‘I’ve gone and coded my models in a particular way, you’ve got your models coded in a different way, we can’t share’. This is a real challenge for the community. No one’s cracked this yet.” A number of start-ups have entered the “machine-learning-as-a-service” market, such as BigML, Wise.io and Precog, and the big names including IBM, Microsoft and Amazon haven’t been far behind. Though these MLaaSs herald some impressive results, Williamson warned businesses to be cautious.

Williamson speaks to the possibility of stagnation in machine learning due to the emphasis on data mining as opposed to experimenting. He hopes businesses will do more with their data than simply look for patterns. It is a refreshing take on the industry from an outsider/insider, a scientist more interested in the science of it all than the massive stacks of cash at stake.

Chelsea Kerwin, September 12, 2016

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
There is a Louisville, Kentucky Hidden Web/Dark Web meet up on September 27, 2016.
Information is at this link: https://www.meetup.com/Louisville-Hidden-Dark-Web-Meetup/events/233599645/

Data: Lakes, Streams, Whatever

June 15, 2016

I read “Data Lakes vs Data Streams: Which Is Better?” The answer seems to me to be “both.” Streams are now. Lakes are “were.” Who wants to make decisions based on historical data. On the other hand, real time data may mislead the unwary data sailor. The write up states:

The availability of these new ways [lakes and streams] of storing and managing data has created a need for smarter, faster data storage and analytics tools to keep up with the scale and speed of the data. There is also a much broader set of users out there who want to be able to ask questions of their data themselves, perhaps to aid their decision making and drive their trading strategy in real-time rather than weekly or quarterly. And they don’t want to rely on or wait for someone else such as a dedicated business analyst or other limited resource to do the analysis for them. This increased ability and accessibility is creating whole new sets of users and completely new use cases, as well as transforming old ones.

Good news for self appointed lake and stream experts. Bad news for a company trying to figure out how to generate new revenues.

The first step may be to answer some basic questions about what data are available, their reliability, and what person “knows” about data wrangling. Worrying about lakes and streams before one knows if the water is polluted is a good idea before diving into the murky waters.

Stephen E Arnold, June 15, 2016

Stanford Offers Course Overviewing Roots of the Google Algorithm

March 23, 2016

The course syllabus for Stanford’s Computer Science class titled CS 349: Data Mining, Search, and the World Wide Web on Stanford.edu provides an overview of some of the technologies and advances that led to Google search. The syllabus states,

“There has been a close collaboration between the Data Mining Group (MIDAS) and the Digital Libraries Group at Stanford in the area of Web research. It has culminated in the WebBase project whose aims are to maintain a local copy of the World Wide Web (or at least a substantial portion thereof) and to use it as a research tool for information retrieval, data mining, and other applications. This has led to the development of the PageRank algorithm, the Google search engine…”

The syllabus alone offers some extremely useful insights that could help students and laypeople understand the roots of Google search. Key inclusions are the Digital Equipment Corporation (DEC) and PageRank, the algorithm named for Larry Page that enabled Google to become Google. The algorithm ranks web pages based on how many other websites link to them. John Kleinburg also played a key role by realizing that websites with lots of links (like a search engine) should also be seen as more important. The larger context of the course is data mining and information retrieval.

 

Chelsea Kerwin, March 23, 2016

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

 

Next Page »

  • Archives

  • Recent Posts

  • Meta