November 30, 2016
It seems obvious to us, but apparently, some folks need a reminder. Harvard Business Review proclaims, “You Don’t Need Big Data, You Need the Right Data.” Perhaps that distinction has gotten lost in the Big Data hype. Writer Maxwell Wessel points to Uber as an example. Though the company does collect a lot of data, the key is in which data it collects, and which it does not. Wessel explains:
In an era before we could summon a vehicle with the push of a button on our smartphones, humans required a thing called taxis. Taxis, while largely unconnected to the internet or any form of formal computer infrastructure, were actually the big data players in rider identification. Why? The taxi system required a network of eyeballs moving around the city scanning for human-shaped figures with their arms outstretched. While it wasn’t Intel and Hewlett-Packard infrastructure crunching the data, the amount of information processed to get the job done was massive. The fact that the computation happened inside of human brains doesn’t change the quantity of data captured and analyzed. Uber’s elegant solution was to stop running a biological anomaly detection algorithm on visual data — and just ask for the right data to get the job done. Who in the city needs a ride and where are they? That critical piece of information let the likes of Uber, Lyft, and Didi Chuxing revolutionize an industry.
In order for businesses to decide which data is worth their attention, the article suggests three guiding questions: “What decisions drive waste in your business?” “Which decisions could you automate to reduce waste?” (Example—Amazon’s pricing algorithms) and “What data would you need to do so?” (Example—Uber requires data on potential riders’ locations to efficiently send out drivers.) See the article for more notes on each of these guidelines.
November 25, 2016
Despite our broader knowledge, we still believe that if we press a few buttons and press enter computers can do all work for us. The advent of machine learning and artificial intelligence does not repress this belief, but instead big data vendors rely on this image to sell their wares. Big data, though, has its weaknesses and before you deploy a solution you should read Network World’s, “6 Machine Learning Misunderstandings.”
Pulling from Juniper Networks’s security intelligence software engineer Roman Sinayev explains some of the pitfalls to avoid before implementing big data technology. It is important not to take into consideration all the variables and unexpected variables, otherwise that one forgotten factor could wreck havoc on your system. Also, do not forget to actually understand the data you are analyzing and its origin. Pushing forward on a project without understanding the data background is a guaranteed fail.
Other practical advice, is to build a test model, add more data when the model does not deliver, but some advice that is new even to us is:
One type of algorithm that has recently been successful in practical applications is ensemble learning – a process by which multiple models combine to solve a computational intelligence problem. One example of ensemble learning is stacking simple classifiers like logistic regressions. These ensemble learning methods can improve predictive performance more than any of these classifiers individually.
Employing more than one algorithm? It makes sense and is practical advice why did that not cross our minds? The rest of the advice offered is general stuff that can be applied to any project in any field, just change the lingo and expert providing it.
November 25, 2016
A brief write-up at the ontotext blog, “The Knowledge Discovery Quest,” presents a noble vision of the search field. Philologist and blogger Teodora Petkova observed that semantic search is the key to bringing together data from different sources and exploring connections. She elaborates:
On a more practical note, semantic search is about efficient enterprise content usage. As one of the biggest losses of knowledge happens due to inefficient management and retrieval of information. The ability to search for meaning not for keywords brings us a step closer to efficient information management.
If semantic search had a separate icon from the one traditional search has it would have been a microscope. Why? Because semantic search is looking at content as if through the magnifying lens of a microscope. The technology helps us explore large amounts of systems and the connections between them. Sharpening our ability to join the dots, semantic search enhances the way we look for clues and compare correlations on our knowledge discovery quest.
At the bottom of the post is a slideshow on this “knowledge discovery quest.” Sure, it also serves to illustrate how ontotext could help, but we can’t blame them for drumming up business through their own blog. We actually appreciate the company’s approach to semantic search, and we’d be curious to see how they manage the intricacies of content conversion and normalization. Founded in 2000, ontotext is based in Bulgaria.
November 22, 2016
Shakespeare is regarded as the greatest writer in the English language. Many studies, however, are devoted to the theory that he did not pen all of his plays and poems. Some attribute them to Francis Bacon, Edward de Vere, Christopher Marlowe, and others. Whether Shakespeare was a singular author or one of many, two facts remain: he was a dirty, old man and it could be said he plagiarized his ideas from other writers. Shall he still be regarded as the figurehead for English literature?
Philly.com takes the Shakespeare authorship into question in the article, “Penn Engineers Use Big Data To Show Shakespeare Had Coauthor On ‘Henry VI’ Plays.” Editors of a new edition of Shakespeare’s complete works listed Marlowe as a coauthor on the Henry VI plays due to a recent study at the University of Pennsylvania. Alejandro Ribeiro used his experience researching networks could be applied to the Shakespeare authorship question using big data.
Ribeiro learned that Henry VI was among the works for which scholars thought Shakespeare might have had a co-author, so he and lab members Santiago Segarra and Mark Eisen tackled the question with the tools of big data. Working with Shakespeare expert Gabriel Egan of De Montfort University in Leicester, England, they analyzed the proximity of certain target words in the playwright’s works, developing a statistical fingerprint that could be compared with those of other authors from his era.
Two other research groups had the same conclusion with other analytical techniques. The results from all three studies were enough to convince the lead general editor of the New Oxford Shakespeare Gary Taylor, who decided to list Marlowe as a coauthor to Henry VI. More research has been conducted to determine other potential Shakespeare coauthors and six more will also be credited in the New Oxford editions.
Ribeiro and his team created “word-adjacency networks” that discovered patterns in Shakespeare’s writing style and six other dramatists. They discovered that many scenes in Henry VI were non-written in Shakespeare’s style, enough to prove a coauthor.
Some Shakespeare purists remain against the theory that Shakespeare did not pen all of his plays, but big data analytics proves many of the theories that other academics have theorized for generations. The dirty old man was not old alone as he wrote his ditties.
November 18, 2016
I love election years! Actually, that is sarcasm. Election years bring out the worst in Americans. The media runs rampant with predictions that each nominee is the equivalent of the anti-Christ and will “doom America,” “ruin the nation,” or “destroy humanity.” The sane voter knows that whoever the next president is will probably not destroy the nation or everyday life…much. Fear, hysteria, and paranoia sells more than puff pieces and big data supports that theory. Popular news site Newsweek shares that, “Our Trust In Big Data Shows We Don’t Trust Ourselves.”
The article starts with a new acronym: DATA. It is not that new, but Newsweek takes a new spin on it. D means dimensions or different datasets, the ability to combine multiple data streams for new insights. A is for automatic, which is self-explanatory. T stands for time and how data is processed in real time. The second A is for artificial intelligence that discovers all the patterns in the data.
Artificial intelligence is where the problems start to emerge. Big data algorithms can be unintentionally programmed with bias. In order to interpret data, artificial intelligence must learn from prior datasets. These older datasets can show human bias, such as racism, sexism, and socioeconomic prejudices.
Our machines are not as objectives as we believe:
But our readiness to hand over difficult choices to machines tells us more about how we see ourselves.
Instead of seeing a job applicant as a person facing their own choices, capable of overcoming their disadvantages, they become a data point in a mathematical model. Instead of seeing an employer as a person of judgment, bringing wisdom and experience to hard decisions, they become a vector for unconscious bias and inconsistent behavior. Why do we trust the machines, biased and unaccountable as they are? Because we no longer trust ourselves.”
Newsweek really knows how to be dramatic. We no longer trust ourselves? No, we trust ourselves more than ever, because we rely on machines to make our simple decisions so we can concentrate on more important topics. However, what we deem important is biased. Taking the Newsweek example, what a job applicant considers an important submission, a HR representative will see as the 500th submission that week. Big data should provide us with better, more diverse perspectives.
November 15, 2016
Data crunching done by an information security firm reveals that around 55% is legal and mundane like the clear or Open Web.
Digital Journal, which published the article Despite its Nefarious Reputation, New Report Finds Majority of Activity on the Dark Web is Totally Legal and Mundane, says that:
What we’ve found is that the dark web isn’t quite as dark as you may have thought,” said Emily Wilson, Director of Analysis at Terbium Labs. “The vast majority of dark web research to date has focused on illegal activity while overlooking the existence of legal content. We wanted to take a complete view of the dark web to determine its true nature and to offer readers of this report a holistic view of dark web activity — both good and bad.
The findings have been curated in a report The Truth About the Dark Web: Separating Fact from Fiction that puts the Dark Web in a new light. According to this report, around 55% of the content on Dark Web is legal; porn makes 7% of content on Dark Web, and most of it is legal. Drugs though is a favorite topic, only 45% of the content related to it can be termed as illegal. Fraud, extremism and illegal weapons trading on the other hand just make 5-7% of Dark Web.
The research methodology was done using a mix of machine intelligence and human intelligence, as pointed out in the article:
Conducting research on the dark web is a difficult task because the boundaries between categories are unclear,” said Clare Gollnick, Chief Data Scientist at Terbium Labs. “We put significant effort into making sure this study was based on a representative, random sample of the dark web. We believe the end result is a fair and comprehensive assessment of dark web activity, with clear acknowledgment of the limitations involved in both dark web data specifically and broader limitations of data generally.
Dark Web slowly is gaining traction as users of Open Web are finding utilities on this hidden portion of the Internet. Though the study is illuminating indeed, it fails to address how much of the illegal activity or content on Dark Web affects the real world. For instance, what quantity of drug trade takes place over Dark Web. Any answers?
November 8, 2016
Iceland is a northern country that one does not think about much. It is cold, has a high literacy rate, and did we mention it was cold? Despite its frigid temperatures, Iceland is a beautiful country with a rich culture and friendly people. shares just how friendly the Icelanders are with their new endeavor: “Iceland Launches ‘Ask Guðmundur,’ The World’s First Human Search Engine.”
Here is what the country is doing:
The decidedly Icelandic and truly personable service will see special representatives from each of Iceland’s seven regions offer their insider knowledge to the world via Inspired By Iceland’s social media platforms (Twitter, Facebook and YouTube). Each representative shares the name Guðmundur or Guðmunda, currently one of the most popular forenames in the country with over 4,000 men and women claiming it as their own.
Visitors to the site can literally submit their questions and have them answered by an expert. Each of the seven Guðmundurs is an Icelandic regional expert. Iceland’s goal with the human search engine is to answer’s the world’s questions about the country, but to answer them in the most human way possible: with actual humans.
A human search engine is an awesome marketing campaign for Iceland. One of the best ways to encourage tourism is to introduce foreigners to the locale people and customs, the more welcoming, quirky, and interesting is all the better for Iceland. So go ahead, ask Guðmundur.
November 7, 2016
One of our favorite companies to track is Lucidworks, due to their commitment to open source technology and development in business enterprise systems. The San Diego Times shares that “Lucidworks Integrates IBM Watson To Fusion Enterprise Discovery Platform.” This means that Lucidworks has integrated IBM’s supercomputer into their Fusion platform to help developers create discovery applications to capture data and discover insights. In short, they have added a powerful big data algorithm.
While Lucidworks is built on open source software, adding a proprietary supercomputer will only benefit their clients. Watson has proven itself an invaluable big data tool and paired with the Fusion platform will do wonders for enterprise systems. Data is a key component to every industry, but understanding and implementing it is difficult:
Lucidworks’ Fusion is an application framework for creating powerful enterprise discovery apps that help organizations access all their information to make better, data-driven decisions. Fusion can process massive amounts of structured and multi-structured data in context, including voice, text, numerical, and spatial data. By integrating Watson’s ability to read 800 million pages per second, Fusion can deliver insights within seconds. Developers benefit from this platform by cutting down the work and time it takes to create enterprise discovery apps from months to weeks.
With the Watson upgrade to Lucidworks’ Fusion platform, users gain natural language processing and machine learning. It makes the Fusion platform act more like a Star Trek computer that can provide data analysis and even interpret results.
November 4, 2016
DNA does not lie. DNA does not lie if conducted accurately and by an experienced geneticist. Right now it is popular for people to get their DNA tested to discover where their ancestors came from. Many testers are surprised when they receive their results, because they learn their ancestors came from unexpected places. Black Americans are eager to learn about the genetics, due to their slave ancestry and lack of familial records. For many Black Americans, DNA is the only way they can learn where their roots originated, but Africa is not entirely cataloged.
According to Science Daily’s article “Major Racial Bias Found In Leading Genomics Database,” if you have African ancestry and get a DNA test it will be difficult to pinpoint your results. The two largest genomics databases that geneticists refer to contain a measurable bias to European genes. From a logical standpoint, this is understandable as Africa has the largest genetic diversity and remains a developing continent without the best access to scientific advances. These provide challenges for geneticists as they try to solve the African genetic puzzle.
It also weighs heavily on black Americans, because they are missing a significant component in their genetic make-up they can reveal vital health information. Most black Americans today contain a percentage of European ancestry. While the European side of their DNA can be traced, their African heritage is more likely to yield clouded results. On a financial scale, it is more expensive to test black Americans genetics due to the lack of information and the results are still not going to be as accurate as a European genome.
This groundbreaking research by Dr. O’Connor and his team clearly underscores the need for greater diversity in today’s genomic databases,’ says UM SOM Dean E. Albert Reece, MD, PhD, MBA, who is also Vice President of Medical Affairs at the University of Maryland and the John Z. and Akiko Bowers Distinguished Professor at UM SOM. ‘By applying the genetic ancestry data of all major racial backgrounds, we can perform more precise and cost-effective clinical diagnoses that benefit patients and physicians alike.
While Africa is a large continent, the Human Genome Project and other genetic organizations should apply for grants that would fund a trip to Africa. Geneticists and biologists would then canvas Africa, collect cheek swabs from willing populations, return with the DNA to sequence, and add to the database. Would it be expensive? Yes, but it would advance medical knowledge and reveal more information about human history. After all, we all originate from Mother Africa.
November 4, 2016
The article titled Companies are Falling Short in Data Management on IT ProPortal describes the obstacles facing many businesses when it comes to data management optimization. Why does this matter? The article states that big data analytics and the internet of things will combine to form an over $300 billion industry by 2020. Companies that fail to build up their capabilities will lose out—big. The article explains,
More than two thirds of data management leaders believe they have an effective data management strategy. They also believe they are approaching data cleansing and analytics the right way…The [SAS] report also says that approximately 10 per cent of companies it calls ‘laggards’, believe the same thing. The problem is – there are as many ‘laggards’, as there are leaders in the majority of industries, which leads SAS to a conclusion that ‘many companies are falling short in data management’.
In order to avoid this trend, company leaders must identify the obstacles impeding their path. A better focus on staff training and development is only possible after recognizing that a lack of internal skills is one of the most common issues. Additionally, companies must clearly define their data strategy and disseminate the vision among all levels of personnel.