Add Free Search to the Free Tibet Slogan

December 13, 2016

China is notorious for censoring its people’s access to the Internet.  I have heard and made more than one pun about the Great Firewall of China.  There is search engine in China, but it will not be in Chinese, says Quartz: “How Censored Is China;s First Tibetan Language Search Engine? It Omits The Dalai Lama’s Web Site.”

Yongzin is the first Tibetan language search engine.  It is supposed to act as a unified portal for all the major Tibetan language Web sites in China.  There are seven million Tibetan people in China, but the two big Chinese search engines: Baidu and Sogou do not include the Tibetan language.  Google is banned in China.

Yongzin rips off Google in colors and function.  The Chinese government has dealt with tense issues related to the country of Tibet for decades:

The Chinese government wants the service to act as a propaganda tool too. In the future, Yongzin will provide data for the government to guide public opinion across Tibet, and monitor information in Tibetan online for “information security” purposes, Tselo, who’s in charge of Yongzin’s development, told state media (link in Chinese) at Monday’s (Aug. 22) launch event.

When people search Yongzin with Tibet related keywords, such as Dalai Lama and Tibetan tea, China’s censorship shows itself at work.  Nothing related to the Dalai Lama is shown, not even his Web site, and an article about illegal publications.

China wants to position itself as guardian of the Tibetan culture, but instead they proffer a Chinese-washed version of Tibet rather than the true thing.  It is another reason why the Free Tibet campaign is still important.

Whitney Grace, December 13, 2016

How Big a Hurdle Is Encryption Really?

December 12, 2016

At first blush, the recent Wiretap Report 2015 from United States Courts would seem to contradict law enforcement’s constant refrain that encryption is making their jobs difficult. Motherboard declares, “Feds and Cops Encountered Encryption in Only 13 Wiretaps in 2015.” This small number is down from 2014. Isn’t this evidence that law enforcement agencies are exaggerating their troubles? The picture is not quite so simple. Reporter Lorenzo Franceschi-Bicchierai writes:

Both FBI director James Comey, as well as Deputy Attorney General Sally Yates, argued last year that the Wiretap Report is not a good indicator. Yates said that the Wiretap Report only reflects number of interception requests ‘that are sought’ and not those where an investigator doesn’t even bother asking for a wiretap ‘because the provider has asserted that an intercept solution does not exist.

Obtaining a wiretap order in criminal investigations is extremely resource-intensive as it requires a huge investment in agent and attorney time,’ Yates wrote, answering questions from the chairman of the Senate’s Judiciary Committee, Sen. Chuck Grassley (R-IA). ‘It is not prudent for agents and prosecutors to devote resources to this task if they know in advance that the targeted communications cannot be intercepted.

That’s why Comey promised the agency is working on improving data collection ‘to better explain’ the problem with encryption when data is in motion. It’s unclear then these new, improved numbers will come out.

Of course, to what degree encryption actually hampers law enforcement is only one piece of a complex issue—whether we should mandate that law enforcement be granted “back doors” to every device they’d like to examine. There are the crucial civil rights concerns, and the very real possibility that where law enforcement can get in, so too can hackers. It is a factor, though, that we must examine objectively. Perhaps when we get that “better” data from the FBI, the picture will be more clear.

Cynthia Murrell, December 12, 2016

IBM Thinks Big on Data Unification

December 7, 2016

So far, the big data phenomenon has underwhelmed. We have developed several good ways to collect, store, and analyze data. However, those several ways have resulted in separate, individually developed systems that do not play well together. IBM hopes to fix that, we learn from “IBM Announces a Universal Platform for Data Science” at Forbes. They call the project the Data Science Experience. Writer Greg Satell explains:

Consider a typical retail enterprise, which has separate operations for purchasing, point-of-sale, inventory, marketing and other functions. All of these are continually generating and storing data as they interact with the real world in real time. Ideally, these systems would be tightly integrated, so that data generated in one area could influence decisions in another.

The reality, unfortunately, is that things rarely work together so seamlessly. Each of these systems stores information differently, which makes it very difficult to get full value from data. To understand how, for example, a marketing campaign is affecting traffic on the web site and in the stores, you often need to pull it out of separate systems and load it into excel sheets.

That, essentially, has been what’s been holding data science back. We have the tools to analyze mountains of data and derive amazing insights in real time. New advanced cognitive systems, like Watson, can then take that data, learn from it and help guide our actions. But for all that to work, the information has to be accessible.”

The article acknowledges that progress that has been made in this area, citing the open-source Hadoop and its OS, Spark, for their ability to tap into clusters of data around the world and analyze that data as a single set. Incompatible systems, however, still vex many organizations.

The article closes with an interesting observation—that many business people’s mindsets are stuck in the past. Planning far ahead is considered prudent, as is taking ample time to make any big decision. Technology has moved past that, though, and now such caution can render the basis for any decision obsolete as soon as it is made. As Satell puts it, we need “a more Bayesian approach to strategy, where we don’t expect to predict things and be right, but rather allow data streams to help us become less wrong over time.” Can the humans adapt to this way of thinking? It is reassuring to have a plan; I suspect only the most adaptable among us will feel comfortable flying by the seat of our pants.

Cynthia Murrell, December 7, 2016

Emphasize Data Suitability over Data Quantity

November 30, 2016

It seems obvious to us, but apparently, some folks need a reminder. Harvard Business Review proclaims, “You Don’t Need Big Data, You Need the Right Data.” Perhaps that distinction has gotten lost in the Big Data hype. Writer Maxwell Wessel points to Uber as an example. Though the company does collect a lot of data, the key is in which data it collects, and which it does not. Wessel explains:

In an era before we could summon a vehicle with the push of a button on our smartphones, humans required a thing called taxis. Taxis, while largely unconnected to the internet or any form of formal computer infrastructure, were actually the big data players in rider identification. Why? The taxi system required a network of eyeballs moving around the city scanning for human-shaped figures with their arms outstretched. While it wasn’t Intel and Hewlett-Packard infrastructure crunching the data, the amount of information processed to get the job done was massive. The fact that the computation happened inside of human brains doesn’t change the quantity of data captured and analyzed. Uber’s elegant solution was to stop running a biological anomaly detection algorithm on visual data — and just ask for the right data to get the job done. Who in the city needs a ride and where are they? That critical piece of information let the likes of Uber, Lyft, and Didi Chuxing revolutionize an industry.

In order for businesses to decide which data is worth their attention, the article suggests three guiding questions: “What decisions drive waste in your business?” “Which decisions could you automate to reduce waste?” (Example—Amazon’s pricing algorithms) and “What data would you need to do so?” (Example—Uber requires data on potential riders’ locations to efficiently send out drivers.) See the article for more notes on each of these guidelines.

Cynthia Murrell, November 30, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Machine Learning Does Not Have All the Answers

November 25, 2016

Despite our broader knowledge, we still believe that if we press a few buttons and press enter computers can do all work for us.  The advent of machine learning and artificial intelligence does not repress this belief, but instead big data vendors rely on this image to sell their wares.  Big data, though, has its weaknesses and before you deploy a solution you should read Network World’s, “6 Machine Learning Misunderstandings.”

Pulling from Juniper Networks’s security intelligence software engineer Roman Sinayev explains some of the pitfalls to avoid before implementing big data technology.  It is important not to take into consideration all the variables and unexpected variables, otherwise that one forgotten factor could wreck havoc on your system.  Also, do not forget to actually understand the data you are analyzing and its origin.  Pushing forward on a project without understanding the data background is a guaranteed fail.

Other practical advice, is to build a test model, add more data when the model does not deliver, but some advice that is new even to us is:

One type of algorithm that has recently been successful in practical applications is ensemble learning – a process by which multiple models combine to solve a computational intelligence problem. One example of ensemble learning is stacking simple classifiers like logistic regressions. These ensemble learning methods can improve predictive performance more than any of these classifiers individually.

Employing more than one algorithm?  It makes sense and is practical advice why did that not cross our minds? The rest of the advice offered is general stuff that can be applied to any project in any field, just change the lingo and expert providing it.

Whitney Grace, November 25, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

 

The Noble Quest Behind Semantic Search

November 25, 2016

A brief write-up at the ontotext blog, “The Knowledge Discovery Quest,” presents a noble vision of the search field. Philologist and blogger Teodora Petkova observed that semantic search is the key to bringing together data from different sources and exploring connections. She elaborates:

On a more practical note, semantic search is about efficient enterprise content usage. As one of the biggest losses of knowledge happens due to inefficient management and retrieval of information. The ability to search for meaning not for keywords brings us a step closer to efficient information management.

If semantic search had a separate icon from the one traditional search has it would have been a microscope. Why? Because semantic search is looking at content as if through the magnifying lens of a microscope. The technology helps us explore large amounts of systems and the connections between them. Sharpening our ability to join the dots, semantic search enhances the way we look for clues and compare correlations on our knowledge discovery quest.

At the bottom of the post is a slideshow on this “knowledge discovery quest.” Sure, it also serves to illustrate how ontotext could help, but we can’t blame them for drumming up business through their own blog. We actually appreciate the company’s approach to semantic search, and we’d be curious to see how they manage the intricacies of content conversion and normalization. Founded in 2000, ontotext is based in Bulgaria.

Cynthia Murrell, November 25, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Exit Shakespeare, for He Had a Coauthor

November 22, 2016

Shakespeare is regarded as the greatest writer in the English language.  Many studies, however, are devoted to the theory that he did not pen all of his plays and poems.  Some attribute them to Francis Bacon, Edward de Vere, Christopher Marlowe, and others.  Whether Shakespeare was a singular author or one of many, two facts remain:  he was a dirty, old man and it could be said he plagiarized his ideas from other writers.  Shall he still be regarded as the figurehead for English literature?

Philly.com takes the Shakespeare authorship into question in the article, “Penn Engineers Use Big Data To Show Shakespeare Had Coauthor On ‘Henry VI’ Plays.”  Editors of a new edition of Shakespeare’s complete works listed Marlowe as a coauthor on the Henry VI plays due to a recent study at the University of Pennsylvania.  Alejandro Ribeiro used his experience researching networks could be applied to the Shakespeare authorship question using big data.

Ribeiro learned that Henry VI was among the works for which scholars thought Shakespeare might have had a co-author, so he and lab members Santiago Segarra and Mark Eisen tackled the question with the tools of big data.  Working with Shakespeare expert Gabriel Egan of De Montfort University in Leicester, England, they analyzed the proximity of certain target words in the playwright’s works, developing a statistical fingerprint that could be compared with those of other authors from his era.

Two other research groups had the same conclusion with other analytical techniques.  The results from all three studies were enough to convince the lead general editor of the New Oxford Shakespeare Gary Taylor, who decided to list Marlowe as a coauthor to Henry VI.  More research has been conducted to determine other potential Shakespeare coauthors and six more will also be credited in the New Oxford editions.

Ribeiro and his team created “word-adjacency networks” that discovered patterns in Shakespeare’s writing style and six other dramatists.  They discovered that many scenes in Henry VI were non-written in Shakespeare’s style, enough to prove a coauthor.

Some Shakespeare purists remain against the theory that Shakespeare did not pen all of his plays, but big data analytics proves many of the theories that other academics have theorized for generations.  The dirty old man was not old alone as he wrote his ditties.

Whitney Grace, November 22, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

 

Big Data Teaches Us We Are Big Paranoid

November 18, 2016

I love election years!  Actually, that is sarcasm.  Election years bring out the worst in Americans.  The media runs rampant with predictions that each nominee is the equivalent of the anti-Christ and will “doom America,” “ruin the nation,” or “destroy humanity.”  The sane voter knows that whoever the next president is will probably not destroy the nation or everyday life…much.  Fear, hysteria, and paranoia sells more than puff pieces and big data supports that theory.  Popular news site Newsweek shares that, “Our Trust In Big Data Shows We Don’t Trust Ourselves.”

The article starts with a new acronym: DATA.  It is not that new, but Newsweek takes a new spin on it.  D means dimensions or different datasets, the ability to combine multiple data streams for new insights.  A is for automatic, which is self-explanatory.  T stands for time and how data is processed in real time.  The second A is for artificial intelligence that discovers all the patterns in the data.

Artificial intelligence is where the problems start to emerge.  Big data algorithms can be unintentionally programmed with bias.  In order to interpret data, artificial intelligence must learn from prior datasets.  These older datasets can show human bias, such as racism, sexism, and socioeconomic prejudices.

Our machines are not as objectives as we believe:

But our readiness to hand over difficult choices to machines tells us more about how we see ourselves.

Instead of seeing a job applicant as a person facing their own choices, capable of overcoming their disadvantages, they become a data point in a mathematical model. Instead of seeing an employer as a person of judgment, bringing wisdom and experience to hard decisions, they become a vector for unconscious bias and inconsistent behavior.  Why do we trust the machines, biased and unaccountable as they are? Because we no longer trust ourselves.”

Newsweek really knows how to be dramatic.  We no longer trust ourselves?  No, we trust ourselves more than ever, because we rely on machines to make our simple decisions so we can concentrate on more important topics.  However, what we deem important is biased.  Taking the Newsweek example, what a job applicant considers an important submission, a HR representative will see as the 500th submission that week.  Big data should provide us with better, more diverse perspectives.

Whitney Grace, November 18, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Most Dark Web Content Is Legal and Boring

November 15, 2016

Data crunching done by an information security firm reveals that around 55% is legal and mundane like the clear or Open Web.

Digital Journal, which published the article Despite its Nefarious Reputation, New Report Finds Majority of Activity on the Dark Web is Totally Legal and Mundane, says that:

What we’ve found is that the dark web isn’t quite as dark as you may have thought,” said Emily Wilson, Director of Analysis at Terbium Labs. “The vast majority of dark web research to date has focused on illegal activity while overlooking the existence of legal content. We wanted to take a complete view of the dark web to determine its true nature and to offer readers of this report a holistic view of dark web activity — both good and bad.

The findings have been curated in a report The Truth About the Dark Web: Separating Fact from Fiction that puts the Dark Web in a new light. According to this report, around 55% of the content on Dark Web is legal; porn makes 7% of content on Dark Web, and most of it is legal. Drugs though is a favorite topic, only 45% of the content related to it can be termed as illegal. Fraud, extremism and illegal weapons trading on the other hand just make 5-7% of Dark Web.

The research methodology was done using a mix of machine intelligence and human intelligence, as pointed out in the article:

Conducting research on the dark web is a difficult task because the boundaries between categories are unclear,” said Clare Gollnick, Chief Data Scientist at Terbium Labs. “We put significant effort into making sure this study was based on a representative, random sample of the dark web. We believe the end result is a fair and comprehensive assessment of dark web activity, with clear acknowledgment of the limitations involved in both dark web data specifically and broader limitations of data generally.

Dark Web slowly is gaining traction as users of Open Web are finding utilities on this hidden portion of the Internet. Though the study is illuminating indeed, it fails to address how much of the illegal activity or content on Dark Web affects the real world. For instance, what quantity of drug trade takes place over Dark Web. Any answers?

Vishal Ingole, November  15, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Iceland Offers the First Human Search Engine

November 8, 2016

Iceland is a northern country that one does not think about much.  It is cold, has a high literacy rate, and did we mention it was cold?  Despite its frigid temperatures, Iceland is a beautiful country with a rich culture and friendly people.  shares just how friendly the Icelanders are with their new endeavor: “Iceland Launches ‘Ask Guðmundur,’ The World’s First Human Search Engine.”

Here is what the country is doing:

The decidedly Icelandic and truly personable service will see special representatives from each of Iceland’s seven regions offer their insider knowledge to the world via Inspired By Iceland’s social media platforms (Twitter, Facebook and YouTube).   Each representative shares the name Guðmundur or Guðmunda, currently one of the most popular forenames in the country with over 4,000 men and women claiming it as their own.

Visitors to the site can literally submit their questions and have them answered by an expert.  Each of the seven Guðmundurs is an Icelandic regional expert.  Iceland’s goal with the human search engine is to answer’s the world’s questions about the country, but to answer them in the most human way possible: with actual humans.

A human search engine is an awesome marketing campaign for Iceland.  One of the best ways to encourage tourism is to introduce foreigners to the locale people and customs, the more welcoming, quirky, and interesting is all the better for Iceland.  So go ahead, ask Guðmundur.

Whitney Grace, November 8, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta